# Notebook ICD - 17

### Library

In [1]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn.ensemble import VotingClassifier

### Dataset

In [2]:
from sklearn.datasets import load_iris
iris = load_iris()

### Boosting

Gradient Boosting for classification.

This algorithm builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the loss function, e.g. binary or multiclass log loss. Binary classification is a special case where only a single regression tree is induced.

In [3]:
clf = GradientBoostingClassifier(n_estimators=10)
scores = cross_val_score(clf, iris.data, iris.target, cv=10)
print(scores)
print(scores.mean())

[1.         0.93333333 1.         0.93333333 0.93333333 0.86666667
 0.93333333 1.         1.         1.        ]
0.96


### Bagging

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting. If samples are drawn with replacement, then the method is known as Bagging. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces. Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches.

The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a DecisionTreeClassifier.

In [4]:
clf = BaggingClassifier(n_estimators=10)
scores = cross_val_score(clf, iris.data, iris.target, cv=10)
print(scores)
print(scores.mean())

[1.         0.93333333 1.         0.93333333 0.93333333 0.93333333
 0.8        1.         1.         1.        ]
0.9533333333333334


### Stacking

Stack of estimators with a final classifier.

Stacked generalization consists in stacking the output of individual estimator and use a classifier to compute the final prediction. Stacking allows to use the strength of each individual estimator by using their output as input of a final estimator.

Note that estimators_ are fitted on the full X while final_estimator_ is trained using cross-validated predictions of the base estimators using cross_val_predict.

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
RANDOM_SEED = 42

In [6]:
X, y = iris.data[:, 1:3], iris.target

In [7]:
clf1 = DecisionTreeClassifier()
clf2 = GaussianNB()
clf3 = KNeighborsClassifier(n_neighbors=3)
clf4 = svm.SVC()

lr = LogisticRegression()

In [8]:
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4],
                            meta_classifier=lr,
                            random_state=RANDOM_SEED)

In [9]:
print('10-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, clf4, sclf], 
                      ['Decision Tree', 
                       'Naive Bayes',
                       'kNN', 
                       'SVM',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, X, y, 
                                              cv=10, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

10-fold cross validation:

Accuracy: 0.92 (+/- 0.06) [Decision Tree]
Accuracy: 0.91 (+/- 0.05) [Naive Bayes]
Accuracy: 0.95 (+/- 0.05) [kNN]
Accuracy: 0.95 (+/- 0.04) [SVM]
Accuracy: 0.95 (+/- 0.05) [StackingClassifier]


### Voting

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing models in order to balance out their individual weaknesses.

In majority voting, the predicted class label for a particular sample is the class label that represents the majority (mode) of the class labels predicted by each individual classifier.

E.g., if the prediction for a given sample is

classifier 1 -> class 1

classifier 2 -> class 1

classifier 3 -> class 2

the VotingClassifier (with voting='hard') would classify the sample as “class 1” based on the majority class label.

In the cases of a tie, the VotingClassifier will select the class based on the ascending sort order. E.g., in the following scenario

classifier 1 -> class 2

classifier 2 -> class 1

the class label 1 will be assigned to the sample.



In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

In [11]:
clf1 = DecisionTreeClassifier()
clf2 = GaussianNB()
clf3 = KNeighborsClassifier(n_neighbors=3)
clf4 = svm.SVC()

In [12]:
eclf = VotingClassifier(
     estimators=[('dt', clf1), ('nb', clf2), ('kNN', clf3), ('svm', clf4)],
     voting='hard') # 'soft'

In [13]:
for clf, label in zip([clf1, clf2, clf3, clf4, eclf], ['dt', 'nb', 'kNN', 'svm', 'Ensemble']):
    scores = cross_val_score(clf, X, y, scoring='accuracy', cv=10)
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.92 (+/- 0.06) [dt]
Accuracy: 0.91 (+/- 0.05) [nb]
Accuracy: 0.95 (+/- 0.05) [kNN]
Accuracy: 0.95 (+/- 0.04) [svm]
Accuracy: 0.94 (+/- 0.06) [Ensemble]
