# TP boosting
## dataset: MNIST
Diane Lingrand
diane.lingrand@univ-cotedazur.fr
2021-2022

In [17]:
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

print("nb of train samples",len(y_train))

nb of train samples 60000


Display the number of data in the test dataset:

In [18]:
len(y_test)

10000

Display the first 100 labels of the train dataset:

In [19]:
y_train[:100]

array([5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5, 3, 6, 1, 7, 2, 8, 6, 9, 4, 0,
       9, 1, 1, 2, 4, 3, 2, 7, 3, 8, 6, 9, 0, 5, 6, 0, 7, 6, 1, 8, 7, 9,
       3, 9, 8, 5, 9, 3, 3, 0, 7, 4, 9, 8, 0, 9, 4, 1, 4, 4, 6, 0, 4, 5,
       6, 1, 0, 0, 1, 7, 1, 6, 3, 0, 2, 1, 1, 7, 9, 0, 2, 6, 7, 8, 3, 9,
       0, 4, 6, 7, 4, 6, 8, 0, 7, 8, 3, 1], dtype=uint8)

For the binary classification we will choose the class of digit '4' and the class of digit '8' in the MNIST dataset. Feel free to change the classes.

In [20]:
import numpy as np
from sklearn.utils import shuffle

In [21]:
# class of '4'
x_train4 = x_train[y_train==4,:]
# class of '8'
x_train8 = x_train[y_train==8,:]

# together
x_trainBinaire = np.append(x_train4,x_train8,axis=0)
# '4' as negative class and '8' as positive class
y_trainBinaire = np.append(np.full(len(x_train4),-1), np.full(len(x_train8),1))

# dimensions ?
print(x_trainBinaire.shape, y_trainBinaire.shape)

# shuffle. why ?
(x_trainBinaire,y_trainBinaire) = shuffle(x_trainBinaire,y_trainBinaire,random_state=0)

(11693, 28, 28) (11693,)


## binary boosting: directly on image pixels
An image = a 1-d array of pixels

In [22]:
n = x_trainBinaire.shape[0]
x_trainBinaire = x_trainBinaire.reshape(n,-1)
print(x_trainBinaire.shape)

(11693, 784)


What are the dimensions of x_trainBinaire ? Explain the values.

In [23]:
from sklearn import ensemble
from sklearn.metrics import confusion_matrix
from sklearn.utils import shuffle

In [24]:
x_trainBinaire.shape, y_trainBinaire.shape

((11693, 784), (11693,))

In [50]:
## learning the boosting (Adaboost)
# create the boosting object
myboosting = ensemble.AdaBoostClassifier(n_estimators=20, learning_rate=1, algorithm='SAMME.R')
# learn on the train dataset
myboosting.fit(x_trainBinaire,y_trainBinaire)
# prediction of train data: should be similar to labels
y_predBinaire = myboosting.predict(x_trainBinaire)
print('confusion matrix on train data',confusion_matrix(y_trainBinaire,y_predBinaire))

confusion matrix on train data [[5733  109]
 [ 111 5740]]


We have displayed the confusion matrix on the train dataset. It should be computed on the test dataset. Let's do it!

In [51]:
# TO DO
# preprocessing of test data(2 classes ....)rain
# class of '4'
x_test4 = x_test[y_test==4,:]
# class of '8'
x_test8 = x_test[y_test==8,:]


# together
x_testBinaire = np.append(x_test4,x_test8,axis=0)
# '4' as negative class and '8' as positive class
y_testBinaire = np.append(np.full(len(x_test4),-1), np.full(len(x_test8),1))

# dimensions ?
print(x_testBinaire.shape, y_testBinaire.shape)
x_testBinaire = x_testBinaire.reshape(x_testBinaire.shape[0],-1)
print(x_testBinaire.shape, y_testBinaire.shape)

# # shuffle. why ?
# (x_testBinaire,y_testBinaire) = shuffle(x_testBinaire,y_testBinaire,random_state=0)


# compute and display the confusion matrix
confusion_matrix(y_testBinaire, myboosting.predict(x_testBinaire))


(1956, 28, 28) (1956,)
(1956, 784) (1956,)


array([[963,  19],
       [ 27, 947]])

How is the result ? And what about modifying the variable n_estimators ?

Pretty good. Increasing `n_estimators` improves the confusion matrix, but takes more time.

## binary boosting using Haar filters
First step: prepare the data before boosting algorithm.

### Haar filters

In [48]:
from skimage import feature
from skimage import transform

For Haar filters, you can choose between two options:
- automatic generation
- hand-made filters

In [60]:
# automatic generation from 2 types: 
#       'type-2-x' and 'type-2-y'
# and dimensiosn of images: 28x28
feat_coord, feat_type = feature.haar_like_feature_coord(28,28, ['type-2-x','type-2-y'])
feat_coord.shape, feat_type.shape, x_train[0].reshape(-1).shape

((158760,), (158760,), (784,))

How many filters ? Compare to the number of pixels ...

In [74]:
# transformation of images: apply all filters
cpt=0

for image in x_trainBinaire:
    # integral image computation
    int_image = transform.integral_image(image)
    side = int(np.sqrt(int_image.shape[0]))
    int_image = int_image.reshape(side, side)
    # print(int_image.shape)
    # Haar filters computation
    features = feature.haar_like_feature(int_image, 0, 0, 28, 28,feature_type=feat_type,feature_coord=feat_coord)
    if cpt == 0:
        ftrain = [features]
    else:
        ftrain = np.append(ftrain,[features],axis=0)
    cpt += 1

(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(28, 28)
(

KeyboardInterrupt: 

The previous cell may encounter problem of size. Try to remove some filters. Which ones ? How many ?

In [None]:
# for you

Another solution: let's build the list of filters!

In [13]:
feat_coord = np.array([list([[(0, 0), (27, 13)], [(14, 0), (27, 27)]]),
       list([[(0, 0), (13, 13)], [(14, 0), (27, 13)]])])
# this is just an example: write the list of filters you think you need
feat_type = np.array(['type-2-x', 'type-2-x'])

### boosting
Now compute the binary boosting using the Haar filters representation and compare with the previous one.

In [None]:
# for you