## Kaggle - Leaf competition
### Topic: ensembling basic classifiers with a VotingClassifier
* Models: KNN, DecisionTree, SVC

* Ensambling: VotingClassifier

* Tuning: GridSearchCV

* CV: inner/outer: StratifiedKFold

In [1]:
import numpy as np
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

In [2]:
# Loading data
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [3]:
# Preprocessing
train_id = train.pop('id')
train_y = train.pop('species')
test_id = test.pop('id')
le = LabelEncoder()
y = le.fit_transform(train_y)
n_samples, n_features = train.shape

The innner CV defines **random splits used for hyperparameter tuning cross validation**. We will use different random splits (though perform on the same data) to estimate the generalization error. This is to reduce the bias arising from the fact that the inner CV splits where used to select the hyperparameters and so the out of fold error rates will be too optimistic.

In [4]:
# Inner CV
skf = StratifiedKFold(5, shuffle=True) # 10 obs. per class, select 2 for testing

Now we set **the three base classifiers and their voting ensemble**. We put the ensemble into a pipepline. There are no more steps in the pipeline at this point. We can add more steps while developing the model later.

In [5]:
clf1 = DecisionTreeClassifier() #max_depth=4)
clf2 = KNeighborsClassifier() #n_neighbors=7)
clf3 = SVC(kernel='rbf') #, probability=True)
estimators=[('dt', clf1), ('knn', clf2),('svc', clf3)]
n_estimators = len(estimators)
eclf = VotingClassifier(estimators)#, n_jobs=-1) #voting='soft', weights=[2, 1, 2])
pipeline = Pipeline([('vc', eclf)])

#### Part 1.1. Simple fitting without hyperparameter tuning

In [None]:
# fitting (ensemble)
pipeline.fit(train, y)
# validation (ensemble)
scores = cross_val_score(pipeline, train, y, cv=skf, n_jobs=-1)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), 2*scores.std(ddof=1)))

In [7]:
# CV score of all classifiers and the ensemble (before GridSearch tuning)
for clf, label in zip([clf1, clf2, clf3, eclf], ['Decision Tree', 'KNN', 'SVC-RBF', 'Ensemble']):
    scores = cross_val_score(clf, train, y, cv=skf, scoring='accuracy')
    print("Accuracy: %0.3f (+/- %0.3f) [%s]" % (scores.mean(), 2*scores.std(ddof=1), label))

Accuracy: 0.676 (+/- 0.101) [Decision Tree]
Accuracy: 0.858 (+/- 0.049) [KNN]
Accuracy: 0.797 (+/- 0.021) [SVC-RBF]
Accuracy: 0.852 (+/- 0.061) [Ensemble]


#### Part 1.2. Tuning hyperparameters with GridSearchCV

In [8]:
g = 1/n_features
parameters = {'vc__dt__max_depth':[5,10,None],
              'vc__knn__n_neighbors':[2,3,5],
              'vc__knn__n_jobs':[-1],
              'vc__svc__gamma':[10*g, g, g/10]}
gs = GridSearchCV(pipeline, parameters, cv=skf, n_jobs=-1)
gs.fit(train, y)
print("The best parameters are %s with a score of %0.3f" % (gs.best_params_, gs.best_score_))

The best parameters are {'vc__svc__gamma': 0.05208333333333333, 'vc__dt__max_depth': None, 'vc__knn__n_jobs': -1, 'vc__knn__n_neighbors': 2} with a score of 0.870


In [9]:
# Explore the individual estimators
gs.best_estimator_
gs.best_estimator_.named_steps['vc']
gs.best_estimator_.named_steps['vc'].estimators_
gs.best_estimator_.named_steps['vc'].estimators_[0]
# CV score for individual classifiers (after GridSearch tuning)
for i in np.arange(n_estimators):
    scores = cross_val_score(gs.best_estimator_.named_steps['vc'].estimators_[i],
                             train, y, cv=skf, scoring='accuracy')
    print("Accuracy: %0.3f (+/- %0.3f) [%s]" % (scores.mean(), 2*scores.std(ddof=1),
                            estimators[i][1].__class__.__name__))

Accuracy: 0.691 (+/- 0.034) [DecisionTreeClassifier]
Accuracy: 0.873 (+/- 0.040) [KNeighborsClassifier]
Accuracy: 0.785 (+/- 0.055) [SVC]


Hypertuning of the ensemble increased the performance of Decision Tree and KNN and decreased it for SVM.

In [10]:
# validation (ensemble after GridSearch tuning) (remember to use the .best_estimator_)
scores = cross_val_score(gs.best_estimator_, train, y, cv=skf, n_jobs=-1)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), 2*scores.std(ddof=1)))

Accuracy: 0.874 (+/- 0.052)


The resulting accuracy of the ensemble is higher than for any individual model.