# Machine Learning 1 - Nearest Neighbors and Decision Trees

## Lab objectives

* Classification with decision trees and random forests.
* Cross-validation and evaluation.

In [10]:
from lab_tools import CIFAR10, get_hog_image

dataset = CIFAR10('./CIFAR10/')

Pre-loading training data
Pre-loading test data


# 1. Nearest Neighbor

The following example uses the Nearest Neighbor algorithm on the Histogram of Gradient decriptors in the dataset.

In [11]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=1)
clf.fit( dataset.train['hog'], dataset.train['labels'] )

KNeighborsClassifier(n_neighbors=1)

* What is the **descriptive performance** of this classifier ?
* Modify the code to estimate the **predictive performance**.
* Use cross-validation to find the best hyper-parameters for this method.

In [12]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [4]:
pred = clf.predict(dataset.train['hog'])
score = accuracy_score(dataset.train['labels'], pred)
print("Descriptive score", score)
cm = confusion_matrix(dataset.train['labels'], pred)
print(cm)

Descriptive score 1.0
[[5000    0    0]
 [   0 5000    0]
 [   0    0 5000]]


In [5]:
pred = clf.predict(dataset.test['hog'])
score = accuracy_score(dataset.test['labels'], pred)
print("Predictive score", score)
cm = confusion_matrix(dataset.test['labels'], pred)
print(cm)

Predictive score 0.694
[[609 258 133]
 [ 63 754 183]
 [ 26 255 719]]


Les différents hyperparameters sont les suivants :

In [6]:
n_neighbors = [1, 2, 3]
weights = ['uniform', 'distance']
p = [1, 2]

In [13]:
from sklearn.model_selection import cross_val_score
import numpy as np

In [8]:
for n in n_neighbors:
    print("Number of neighbors ", n)
    for w in weights:
        print("Weights ", w)
        for power in p:
            print("Power ", power)
            clf = KNeighborsClassifier(n_neighbors=n, weights=w, p=power)
            scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
            mean = np.mean(scores)
            print("La moyenne est ", mean)

Number of neighbors  1
Weights  uniform
Power  1
La moyenne est  0.7456
Power  2
La moyenne est  0.6878
Weights  distance
Power  1
La moyenne est  0.7456
Power  2
La moyenne est  0.6878
Number of neighbors  2
Weights  uniform
Power  1
La moyenne est  0.7438
Power  2
La moyenne est  0.6868666666666667
Weights  distance
Power  1
La moyenne est  0.7456
Power  2
La moyenne est  0.6878
Number of neighbors  3
Weights  uniform
Power  1
La moyenne est  0.7695333333333333
Power  2
La moyenne est  0.7060666666666666
Weights  distance
Power  1
La moyenne est  0.7635333333333333
Power  2
La moyenne est  0.6988


Plus le nombre de neighbors est grand, meilleure est l'accuracy. Power = 1 donne systématiquement des meilleurs résultats que power = 2.
Pour neighbors = 3, weight = 'distance' était mieux.

In [9]:
n_neighbors = [3, 4, 5]
leaf_size = [20, 30, 40]
for n in n_neighbors:
    print("Number of neighbors ", n)
    for w in weights:
        print("Weights ", w)
        for leaf in leaf_size:
            print("Leaf size ", leaf)
            clf = KNeighborsClassifier(n_neighbors=n, weights=w, leaf_size=leaf)
            scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
            mean = np.mean(scores)
            print("La moyenne est ", mean)

Number of neighbors  3
Weights  uniform
Leaf size  20
La moyenne est  0.7060666666666666
Leaf size  30
La moyenne est  0.7060666666666666
Leaf size  40
La moyenne est  0.7060666666666666
Weights  distance
Leaf size  20
La moyenne est  0.6988
Leaf size  30
La moyenne est  0.6988
Leaf size  40
La moyenne est  0.6988
Number of neighbors  4
Weights  uniform
Leaf size  20
La moyenne est  0.7002
Leaf size  30
La moyenne est  0.7002
Leaf size  40
La moyenne est  0.7002
Weights  distance
Leaf size  20
La moyenne est  0.7051999999999999
Leaf size  30
La moyenne est  0.7051999999999999
Leaf size  40
La moyenne est  0.7051999999999999
Number of neighbors  5
Weights  uniform
Leaf size  20
La moyenne est  0.7097333333333333
Leaf size  30
La moyenne est  0.7097333333333333
Leaf size  40
La moyenne est  0.7097333333333333
Weights  distance
Leaf size  20
La moyenne est  0.7029333333333333
Leaf size  30
La moyenne est  0.7029333333333333
Leaf size  40
La moyenne est  0.7029333333333333


In [17]:
n_neighbors = [3, 4, 5]
leaf_size = [30, 50]
for n in n_neighbors:
    print("Number of neighbors ", n)
    for leaf in leaf_size:
        print("Leaf size ", leaf)
        clf = KNeighborsClassifier(n_neighbors=n, weights='uniform', leaf_size=leaf, p=1)
        scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
        mean = np.mean(scores)
        print("La moyenne est ", mean)

Number of neighbors  3
Leaf size  30
La moyenne est  0.7695333333333333
Leaf size  50
La moyenne est  0.7695333333333333
Number of neighbors  4
Leaf size  30
La moyenne est  0.7636
Leaf size  50
La moyenne est  0.7636
Number of neighbors  5
Leaf size  30
La moyenne est  0.7752666666666668
Leaf size  50
La moyenne est  0.7752666666666668


In [18]:
n_neighbors = [5, 6, 7]
for n in n_neighbors:
    print("Number of neighbors ", n)
    clf = KNeighborsClassifier(n_neighbors=n, p=1)
    scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
    mean = np.mean(scores)
    print("La moyenne est ", mean)

Number of neighbors  5
La moyenne est  0.7752666666666668
Number of neighbors  6
La moyenne est  0.7689333333333332
Number of neighbors  7
La moyenne est  0.7693333333333333


In [19]:
n_neighbors = [5,6,7,8,9,10]
leaf_size = [30, 90, 120]
for n in n_neighbors:
    print("Number of neighbors ", n)
    for leaf in leaf_size:
        print("Leaf size ", leaf)
        clf = KNeighborsClassifier(n_neighbors=n, weights='uniform', leaf_size=leaf, p=1)
        scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
        mean = np.mean(scores)
        print("La moyenne est ", mean)

Number of neighbors  5
Leaf size  30
La moyenne est  0.7752666666666668
Leaf size  90
La moyenne est  0.7752666666666668
Leaf size  120
La moyenne est  0.7752666666666668
Number of neighbors  6
Leaf size  30
La moyenne est  0.7689333333333332
Leaf size  90
La moyenne est  0.7689333333333332
Leaf size  120
La moyenne est  0.7689333333333332
Number of neighbors  7
Leaf size  30
La moyenne est  0.7693333333333333
Leaf size  90
La moyenne est  0.7693333333333333
Leaf size  120
La moyenne est  0.7693333333333333
Number of neighbors  8
Leaf size  30
La moyenne est  0.7696
Leaf size  90
La moyenne est  0.7696
Leaf size  120
La moyenne est  0.7696
Number of neighbors  9
Leaf size  30
La moyenne est  0.7725333333333333
Leaf size  90
La moyenne est  0.7725333333333333
Leaf size  120
La moyenne est  0.7725333333333333
Number of neighbors  10
Leaf size  30
La moyenne est  0.7700000000000001
Leaf size  90
La moyenne est  0.7700000000000001
Leaf size  120
La moyenne est  0.7700000000000001


In [None]:
n_neighbors = [5,10,15,20]
power = [1,2]
for n in n_neighbors:
    print("Number of neighbors ", n)
    for p in power:
        print("Power ", p)
        clf = KNeighborsClassifier(n_neighbors=n, weights='uniform', p=p)
        scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
        mean = np.mean(scores)
        print("La moyenne est ", mean)

## 2. Decision Trees

[Decision Trees](http://scikit-learn.org/stable/modules/tree.html#tree) classify the data by splitting the feature space according to simple, single-feature rules. Scikit-learn uses the [CART](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees_.28CART.29) algorithm for [its implementation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) of the classifier. 

* **Create a simple Decision Tree classifier** using scikit-learn and train it on the HoG training set.
* Use cross-validation to find the best hyper-paramters for this method.

In [20]:
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf.fit( dataset.train['hog'], dataset.train['labels'] )

DecisionTreeClassifier()

In [21]:
pred = clf.predict(dataset.train['hog'])
score = accuracy_score(dataset.train['labels'], pred)
print("Descriptive score", score)
cm = confusion_matrix(dataset.train['labels'], pred)
print(cm)

Descriptive score 1.0
[[5000    0    0]
 [   0 5000    0]
 [   0    0 5000]]


In [22]:
pred = clf.predict(dataset.test['hog'])
score = accuracy_score(dataset.test['labels'], pred)
print("Predictive score", score)
cm = confusion_matrix(dataset.test['labels'], pred)
print(cm)

Predictive score 0.5696666666666667
[[598 231 171]
 [208 531 261]
 [156 264 580]]


In [28]:
splitter = ["random", "best"]
criterion = ["gini", "entropy"]
min_samples_split = [0.5, 2, 3, 4]

In [29]:
for s in splitter:
    print("Splitter ", s)
    for c in criterion:
        print("Criterion ", c)
        for m in min_samples_split:
            print("Min samples split ", m)
            clf = tree.DecisionTreeClassifier(splitter=s, criterion=c, min_samples_split=m)
            scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
            mean = np.mean(scores)
            print("La moyenne est ", mean)

Splitter  random
Criterion  gini
Min samples split  0.5
La moyenne est  0.4886666666666667
Min samples split  2
La moyenne est  0.5566000000000001
Min samples split  3
La moyenne est  0.5565333333333333
Min samples split  4
La moyenne est  0.5501333333333334
Criterion  entropy
Min samples split  0.5
La moyenne est  0.4698666666666667
Min samples split  2
La moyenne est  0.5572666666666666
Min samples split  3
La moyenne est  0.5610666666666666
Min samples split  4
La moyenne est  0.5574
Splitter  best
Criterion  gini
Min samples split  0.5
La moyenne est  0.5302
Min samples split  2
La moyenne est  0.5754666666666666
Min samples split  3
La moyenne est  0.5788666666666668
Min samples split  4
La moyenne est  0.5763999999999999
Criterion  entropy
Min samples split  0.5
La moyenne est  0.5006
Min samples split  2
La moyenne est  0.5704666666666667
Min samples split  3
La moyenne est  0.5731333333333333
Min samples split  4
La moyenne est  0.5691333333333334


In [30]:
max_features=["sqrt", "log2"]
min_samples_split = [0.8, 2, 3, 4, 5]

In [31]:
for f in max_features:
    print("Max features ", f)
    for c in criterion:
        print("Criterion ", c)
        for m in min_samples_split:
            print("Min samples split ", m)
            clf = tree.DecisionTreeClassifier(splitter="best", criterion=c, min_samples_split=m, max_features=f)
            scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
            mean = np.mean(scores)
            print("La moyenne est ", mean)

Max features  sqrt
Criterion  gini
Min samples split  0.8
La moyenne est  0.4468
Min samples split  2
La moyenne est  0.5424666666666667
Min samples split  3
La moyenne est  0.5406
Min samples split  4
La moyenne est  0.5468
Min samples split  5
La moyenne est  0.5486666666666667
Criterion  entropy
Min samples split  0.8
La moyenne est  0.4267333333333333
Min samples split  2
La moyenne est  0.5506
Min samples split  3
La moyenne est  0.5436666666666667
Min samples split  4
La moyenne est  0.5445333333333333
Min samples split  5
La moyenne est  0.5526
Max features  log2
Criterion  gini
Min samples split  0.8
La moyenne est  0.43946666666666667
Min samples split  2
La moyenne est  0.5344
Min samples split  3
La moyenne est  0.5254
Min samples split  4
La moyenne est  0.5275333333333333
Min samples split  5
La moyenne est  0.5346666666666667
Criterion  entropy
Min samples split  0.8
La moyenne est  0.43406666666666666
Min samples split  2
La moyenne est  0.5360666666666666
Min samples sp

In [34]:
min_samples_split = [2, 4, 5, 7, 9, 10]

In [35]:
for m in min_samples_split:
    print("Min samples split ", m)
    clf = tree.DecisionTreeClassifier(splitter="best", criterion="gini", min_samples_split=m)
    scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
    mean = np.mean(scores)
    print("La moyenne est ", mean)

Min samples split  2
La moyenne est  0.5757333333333333
Min samples split  4
La moyenne est  0.5717333333333333
Min samples split  5
La moyenne est  0.5752666666666667
Min samples split  7
La moyenne est  0.5794666666666666
Min samples split  9
La moyenne est  0.5746666666666667
Min samples split  10
La moyenne est  0.5758666666666666


## 3. Random Forests

[Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) classifiers use multiple decision trees trained on "weaker" datasets (less data and/or less features), averaging the results so as to reduce over-fitting.

* Use scikit-learn to **create a Random Forest classifier** on the CIFAR data. 
* Use cross-validation to find the best hyper-paramters for this method.

In [37]:
from sklearn import ensemble

clf = ensemble.RandomForestClassifier()
clf.fit( dataset.train['hog'], dataset.train['labels'] )

RandomForestClassifier()

In [38]:
pred = clf.predict(dataset.train['hog'])
score = accuracy_score(dataset.train['labels'], pred)
print("Descriptive score", score)
cm = confusion_matrix(dataset.train['labels'], pred)
print(cm)

Descriptive score 1.0
[[5000    0    0]
 [   0 5000    0]
 [   0    0 5000]]


In [39]:
pred = clf.predict(dataset.test['hog'])
score = accuracy_score(dataset.test['labels'], pred)
print("Predictive score", score)
cm = confusion_matrix(dataset.test['labels'], pred)
print(cm)

Predictive score 0.772
[[787 158  55]
 [121 745 134]
 [ 55 161 784]]


In [47]:
n_estimators = [50, 100, 150, 200]
bootstrap = [True, False]
class_weight = ["balanced", "balanced_subsample", None]

In [42]:
for e in n_estimators:
    print("Nb estimators ", e)
    for b in bootstrap:
        if b :
            print("Bootstrap")
        else:
            print("No bootstrap")
        clf = ensemble.RandomForestClassifier(n_estimators=e, bootstrap=b)
        scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
        mean = np.mean(scores)
        print("La moyenne est ", mean)            

Nb estimators  50
Bootstrap
La moyenne est  0.7462666666666666
No bootstrap
La moyenne est  0.7494666666666666
Nb estimators  100
Bootstrap
La moyenne est  0.7558666666666667
No bootstrap
La moyenne est  0.7628666666666666
Nb estimators  150
Bootstrap
La moyenne est  0.7661333333333333
No bootstrap
La moyenne est  0.7689333333333332
Nb estimators  200
Bootstrap
La moyenne est  0.7668666666666667
No bootstrap
La moyenne est  0.7710666666666667


In [43]:
n_estimators = [250, 300, 350]

In [44]:
for e in n_estimators:
    print("Nb estimators ", e)
    clf = ensemble.RandomForestClassifier(n_estimators=e, bootstrap=False)
    scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
    mean = np.mean(scores)
    print("La moyenne est ", mean)    

Nb estimators  250
La moyenne est  0.7719333333333334
Nb estimators  300
La moyenne est  0.7715333333333334
Nb estimators  350
La moyenne est  0.7734666666666666


In [45]:
n_estimators = [350, 400, 450]

In [46]:
for e in n_estimators:
    print("Nb estimators ", e)
    clf = ensemble.RandomForestClassifier(n_estimators=e, bootstrap=False)
    scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
    mean = np.mean(scores)
    print("La moyenne est ", mean)    

Nb estimators  350
La moyenne est  0.7761333333333333
Nb estimators  400
La moyenne est  0.7734666666666666
Nb estimators  450
La moyenne est  0.7769333333333334


In [48]:
n_estimators = [350, 400, 450, 500, 550]

In [49]:
for e in n_estimators:
    print("Nb estimators ", e)
    for w in class_weight:
        print("Class weight ", w)
        clf = ensemble.RandomForestClassifier(n_estimators=e, bootstrap=False, class_weight=w)
        scores = cross_val_score(clf,dataset.train['hog'],dataset.train['labels'])
        mean = np.mean(scores)
        print("La moyenne est ", mean)    

Nb estimators  350
Class weight  balanced
La moyenne est  0.7738666666666666
Class weight  balanced_subsample
La moyenne est  0.7742000000000001
Class weight  None
La moyenne est  0.7723333333333333
Nb estimators  400
Class weight  balanced
La moyenne est  0.7733333333333333
Class weight  balanced_subsample
La moyenne est  0.7748666666666668
Class weight  None
La moyenne est  0.7734666666666667
Nb estimators  450
Class weight  balanced
La moyenne est  0.7756
Class weight  balanced_subsample
La moyenne est  0.7748666666666667
Class weight  None
La moyenne est  0.7755333333333334
Nb estimators  500
Class weight  balanced
La moyenne est  0.7754
Class weight  balanced_subsample
La moyenne est  0.7750666666666668
Class weight  None
La moyenne est  0.7754666666666666
Nb estimators  550
Class weight  balanced
La moyenne est  0.7771333333333332
Class weight  balanced_subsample
La moyenne est  0.7746666666666666
Class weight  None
La moyenne est  0.7757333333333334
