## Leaf Classification - Relatório 2

Modelos lineares/simples:

* Bayesiano (sklearn.naive_bayes)

Modelo não-linear:

* KNeighborsClassifier

Árvore:

* Decision tree

Ensemble:

* Random Forests

Redes neurais:

* Multi-layer Perceptron ([sklearn.neural_network.MLPClassifier](http://scikit-learn.org/stable/modules/neural_networks_supervised.html#neural-networks-supervised))

SVM:

* SVC ([sklearn.svm](http://scikit-learn.org/stable/modules/svm.html#svm))

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def warn(*args, **kwargs): pass
import warnings
warnings.warn = warn

from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import StratifiedShuffleSplit

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [2]:
# Swiss army knife function to organize the data

def encode(train, test):
    le = LabelEncoder().fit(train.species) 
    labels = le.transform(train.species)           # encode species strings
    classes = list(le.classes_)                    # save column names for submission
    test_ids = test.id                             # save test ids for submission
    
    train = train.drop(['species', 'id'], axis=1)  
    test = test.drop(['id'], axis=1)
    
    return train, labels, test, test_ids, classes

train, labels, test, test_ids, classes = encode(train, test)

train.head(1)

Unnamed: 0,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,margin9,margin10,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,0.007812,0.023438,0.023438,0.003906,0.011719,0.009766,0.027344,0.0,0.001953,0.033203,...,0.007812,0.0,0.00293,0.00293,0.035156,0.0,0.0,0.004883,0.0,0.025391


In [3]:
from sklearn.metrics import accuracy_score, log_loss, recall_score, precision_recall_fscore_support, precision_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_predict

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis()
]

# Define a KFold for cross-validation
# Guarantees all cross validations are made with the same split
folds = StratifiedKFold(n_splits=10, random_state=None, shuffle=False)

# Logging for Visual Comparison
log_cols=["Classifier", "Accuracy", "Log Loss"]
log = pd.DataFrame(columns=log_cols)

# Cross validation
def crossValidate(classifiers, train, labels, folds):
    for clf in classifiers:
        name = clf.__class__.__name__
    
        print("="*30)
        print(name)
    
        print('****Results****')
        train_predictions = cross_val_predict(clf, train, labels, n_jobs=-1, cv=folds)#clf.predict(X_test)
        print("Precision: ", precision_score(labels, train_predictions, average='macro'))
        print("Recall: ", recall_score(labels, train_predictions, average='macro'))
        print("Accuracy: ", accuracy_score(labels,train_predictions))
    
        #log_entry = pd.DataFrame([[name, acc*100, ll]], columns=log_cols)
        #log = log.append(log_entry)
    
    print("="*30)
    
crossValidate(classifiers, train, labels, folds)

KNeighborsClassifier
****Results****
Precision:  0.898580375853
Recall:  0.891919191919
Accuracy:  0.891919191919
SVC
****Results****
Precision:  0.815300943805
Recall:  0.80101010101
Accuracy:  0.80101010101
DecisionTreeClassifier
****Results****
Precision:  0.696273710445
Recall:  0.681818181818
Accuracy:  0.681818181818
RandomForestClassifier
****Results****
Precision:  0.913235810206
Recall:  0.905050505051
Accuracy:  0.905050505051
GaussianNB
****Results****
Precision:  0.750538610202
Recall:  0.611111111111
Accuracy:  0.611111111111
LinearDiscriminantAnalysis
****Results****
Precision:  0.980466671376
Recall:  0.977777777778
Accuracy:  0.977777777778


In [4]:
MLPs = [
    MLPClassifier(hidden_layer_sizes=(1,)),
    MLPClassifier(hidden_layer_sizes=(99,)),
    MLPClassifier(hidden_layer_sizes=(198,))
]

crossValidate(MLPs, train, labels, folds)

MLPClassifier
****Results****
Precision:  0.000408121620243
Recall:  0.010101010101
Accuracy:  0.010101010101
MLPClassifier
****Results****
Precision:  0.908468944075
Recall:  0.89696969697
Accuracy:  0.89696969697
MLPClassifier
****Results****
Precision:  0.942063576154
Recall:  0.934343434343
Accuracy:  0.934343434343


In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train)

crossValidate(classifiers[:2]+MLPs, scaler.transform(train), labels, folds)

KNeighborsClassifier
****Results****
Precision:  0.976946622401
Recall:  0.973737373737
Accuracy:  0.973737373737
SVC
****Results****
Precision:  0.98034423489
Recall:  0.977777777778
Accuracy:  0.977777777778
MLPClassifier
****Results****
Precision:  0.00562768429804
Recall:  0.0212121212121
Accuracy:  0.0212121212121
MLPClassifier
****Results****
Precision:  0.983007228462
Recall:  0.980808080808
Accuracy:  0.980808080808
MLPClassifier
****Results****
Precision:  0.986164677074
Recall:  0.984848484848
Accuracy:  0.984848484848
