## Experiments on the MNIST dataset

The goal of this notebook is to train  several classifiers on the classical MNIST database. This database is very popular in the machine learning community as a first test for new algorithms. This dataset is quite simple and artificial : having good results on MNIST does not mean that your algorithm is good, but having bad results surely means that you have to improve your algorithm. 


<img src="https://kermorvant.github.io/csexed-ml/dataiku/images/MnistExamples.png">



You can find reference results on the MNIST dataset [here](http://yann.lecun.com/exdb/mnist/).

### Libraries 

We will use the following libraries : 
* sklearn (scikit-learn) for machine learning algorithms
* pandas for manipulating data
* PIL for image processing
* seaborn for graphs


In [None]:
import pandas as pd
import seaborn as sns
# Display the graphs in the notebook
%matplotlib inline

### Feature extraction

The first step is to extract features from the images to convert the image into a feature vectors. All the images  have the same size, 32x32 pixels. We have reduced them to 8x8 pixels and use the 64 pixels values vector as the features. This data is available here : https://kermorvant.github.io/csexed-ml/data/


The features for all the images is read from the file `MNIST_all_features.csv`. The features are stored in a matrix `X` and the target class in a vector `y`.


In [None]:
READ_N_SAMPLES = 10000
IMG_SIZE = 16
all_df = pd.read_csv("MNIST_all_features16.csv.gz" ,nrows=READ_N_SAMPLES)

# define the features and the targets
X = all_df.iloc[:,:IMG_SIZE*IMG_SIZE]
y = all_df['class']
all_df.head()


It seems that all the features are equal to 0 but it is not the case: since the digits are centered, the first and last rows of pixels are white (=0). The `describe` allows to check the distribution of the features values : 

In [None]:
pd.set_option('display.max_columns', 70)
all_df.describe()

After loading the data, we plot the class distribution.

In [None]:
# plot class distriubtion
sns.countplot(data=all_df,y='class')

Let's have a look to the images after feature sub-resolution and check that our dataset is correct, i.e that the labels correspond to the image : 


In [None]:
import random
import matplotlib.pyplot as plt
import numpy as np

number_plot_images = 18
random_indices = random.sample(range(READ_N_SAMPLES), number_plot_images)

fig, axes = plt.subplots(3,6, 
                        figsize=(10,5),
                        sharex=True, sharey=True,
                        subplot_kw=dict( aspect='equal')) 

for i in range(number_plot_images):
    
    subplot_row = i//6 
    subplot_col = i%6  
    ax = axes[subplot_row, subplot_col]
    # plot image on subplot
    plottable_image = all_df.iloc[random_indices[i],:IMG_SIZE*IMG_SIZE].values.reshape((IMG_SIZE,IMG_SIZE))
    ax.imshow(plottable_image, cmap='gray_r')
    
    ax.set_title('Digit Label: {}'.format(all_df['class'][random_indices[i]]))
    ax.set_xbound([0,IMG_SIZE])

plt.tight_layout()
plt.show()

### Train/dev/test split

When training a classifier, the data **must** be separated into different sets : at least one training set and one test set. The split must be random and uniform, which means that the class distribution must be identical in the training and test sets.

**Question:**
> * Use [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to create `X_train/y_train` and `X_test/y_test`. Use 80% of the data for training and 20% for testing.
> * Train a [k-nearest neighbors](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) classififier with k=1
> * Test the k-NN with k= 1 on both the training and the test set. Print the score produced by [`clf.score()`](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score)

When evaluating a classifier, it is important to report the error rate both on the training and the classification set. These values are needed to understand what is wrong with the classifier and how to improve it.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn import neighbors

test_percent = None # YOUR CODE HERE
X_train, X_test, y_train, y_test = train_test_split(None, None, test_size=None, random_state=32)# YOUR CODE HERE

# create a kNN classifier with a given k
k = None # YOUR CODE HERE
clf = neighbors.KNeighborsClassifier(None,n_jobs=-1)# YOUR CODE HERE

# Train the classifier on training set
clf.fit(None,None)# YOUR CODE HERE

# Predict and evaluate on train set
print ('Train accuracy:',clf.score(None, None))# YOUR CODE HERE
# Predict and evaluate on test set
print ('Test accuracy:',clf.score(None,None))# YOUR CODE HERE

### Hyperparameter optimization


The main parameter of the kNN algorithm is the number of neighbors (k). The best value for this parameter depends on the classification task and has to be found by trying different values and selecting the one with the best accuracy. However, this search for the best value **must not** be done on the set used to evaluate the classifier (the test set) but on a validation set. 

**Question** : 


>  * Create three sets : train set (60%), validation set (20%) and test set (20%), using twice `train_test_split`
>  * Train a kNN classifier with different values of k in [1,3,5,7 9] and report the train/valid/test accuracy. 
>  * Select the is best value for k according to the accuracy on the dev set. Report the performance performance of the classifier on the test set for this value of k. 

In [None]:
from IPython.display import clear_output

# Create Train/dev/test sets
# YOUR CODE HERE
# Create validation set so that train = 60% , validation = 20% and test =  20%
# Create X_train, X_dev, y_train, y_dev, X_test, y_test


#  list of k values to test
k_values = [None] # YOUR CODE HERE

# store the score in a dataframe
df_scores = pd.DataFrame(columns=['train','dev','test'],index=k_values)

# iterate on différent values of k
for k in None:# YOUR CODE HERE
    print("k={}".format(k))
    
    # create a kNN classifier with a given k
    clf = neighbors.KNeighborsClassifier(None,n_jobs=-1) # YOUR CODE HERE
    
    # Train the classifier on training set
    clf.fit(None,None) # YOUR CODE HERE
    
    # Compute the classification score on the different sets
    for _name,_X,_y in [('train',X_train,y_train),('dev',X_dev,y_dev),('test',X_test,y_test)]:
        df_scores.at[k,_name] = float("{:.3f}".format(clf.score(None,None))) # YOUR CODE HERE
        clear_output(wait=True)
        print(df_scores)
_g = df_scores.plot()


## Logistic regression

Another simple yet very effective classifier is the logisitic regression. This classifier is a version of the linear regression model adapted to classification. 

**Question** : 

> Using the MNIST Data, train a [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier (`clf = linear_model.LogisticRegression()`) train set and  report the accuracy on dev and test sets. Compare to the best result of the kNN classifier.



In [None]:
from sklearn import linear_model
# Train a logisitic regression model 
clf = linear_model.LogisticRegression()
clf.fit(None,None) # YOUR CODE HERE

print ("logistic regression accuracy on dev set:",clf.score(None,None)) # YOUR CODE HERE
print ("logistic regression accuracy on test set:",clf.score(None,None)) # YOUR CODE HERE

The accuracy of the classifiers depends on the size of the training set : the more data, the better accuracy. Will study the impact of the size of the training set on kNN and Logisitic Regression.

For this study, we first load 2000 training samples and we separate the data in train/test set with 80%/20% ratio. Then we train the classifier on an increasing training set corresponding to  `[1%,10%,30%,50%,70%,100%]` percents of the training set. We will always be tested on the same test set (the original one). The following Figure explains the different splits : 

<p align="center">
  <img src="https://kermorvant.github.io/csexed-ml/images/learning_curve_sets.png" width="300" >
</p>


**Question** : 


> * Report the training and test set accuracies for the Logisitic Regresstion, 1NN and kNN (k being the best value for `k` you previously found) on an increasing training set of [1,10,30,50,70,100] percent of the initial training set.
> * Plot the training curves on a plot similar to : 

<p align="center">
  <img src="https://kermorvant.github.io/csexed-ml/images/training_curves.png" width="300" align="center">
</p>

You can use [pandas plot function](https://pandas.pydata.org/pandas-docs/stable/visualization.html#basic-plotting-plot).



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
nb_train = X_train.shape[0]
print (nb_train)
SIZES = [int(nb_train*x/100.) for x in [None]] # YOUR CODE HERE
df_scores = pd.DataFrame(columns=['LogRegTrain','LogRegTest','1NNTrain','1NNTest','2NNTrain','2NNTest'],index=SIZES)

# define the 3 classifiers
clf1 = None # YOUR CODE HERE
clf2 = None # YOUR CODE HERE
clf3 = None # YOUR CODE HERE

# loop on the training set size
for sub_size in SIZES:
    # get the index of the selected samples and extract the corresponding targets
    X_sub = X_train.iloc[:sub_size]  
    y_sub = y_train.iloc[:sub_size]
    
    # train the models
    # YOUR CODE HERE

    # compute accuracy on train set
    df_scores.at[sub_size,'LogRegTrain'] =  None # YOUR CODE HERE
    # compute accuracy on test set
    df_scores.at[sub_size,'LogRegTest'] =  None # YOUR CODE HERE
    
    # compute accuracy on train set
    df_scores.at[sub_size,'1NNTrain'] =  None # YOUR CODE HERE
    # compute accuracy on test set
    df_scores.at[sub_size,'1NNTest'] =  None # YOUR CODE HERE
    
    # compute accuracy on train set
    df_scores.at[sub_size,'2NNTrain'] =  None # YOUR CODE HERE
    # compute accuracy on test set
    df_scores.at[sub_size,'2NNTest'] =  None # YOUR CODE HERE
    clear_output(wait=True)
    print(df_scores)
styles1 = ['bs--','bs-','r^--','r^-','go--','go-']
df_scores.plot(style=styles1)

#### Pipelines

Scikit-learn has a special class for dealing with hyperparamter optimization :  [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

We can search for the best values of the hyperparameters  by defining a pipeline. 



In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('clf', neighbors.KNeighborsClassifier())
    
])

parameters = {
    'clf__n_neighbors':(1,3)
}


Then, the GridSearchCV object will be fitted on the train set with all the possible combinaison of parameter values and evaluated on a validation set with cross validation. The train/dev spit and the cross-validation is done automatically. Progress can be monitored with greater values of the parameter verbose.

**Question** : 


>  *  reproduce the experiments on the parameter k in [1,3,5,7,9] from the previous question using GridSearchCV
>  * add the exploration of different values of the feature dimension. 

In [None]:
from sklearn.model_selection import  GridSearchCV
from sklearn.metrics import classification_report
from pprint import pprint
if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected block

    # Define the grid search to find the best parameters for both the feature extraction and the classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2)

    # Split the dataframe with the file names in train/test
    df_train_dev, df_test, y_train_dev, y_test = train_test_split(all_df, all_df['class'], test_size=0.2)
    
    print("Performing grid search")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    
    # Run the grid search
    grid_search.fit( None,None) # YOUR CODE HERE
    all_score = pd.DataFrame(grid_search.cv_results_)
    print (all_score[['params','mean_train_score','mean_test_score']])
    
    # Print all experiments results
    print("Grid scores on development set:")
    print()
    means = grid_search.cv_results_['mean_test_score']
    stds = grid_search.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, grid_search.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()
    
    # Print best experiment results
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
        
    # predict on test with best parameters
    y_pred = grid_search.predict(None) # YOUR CODE HERE
    print(classification_report(None,None)) # YOUR CODE HERE
    print()


## Support Vector Machines


We will now optimize a Support Vector Machine, SVM (Séparateurs à Vaste Marge in French) classifier on the MNIST database. These classifiers usually give very good results if they are well tuned.

All the SVM classifier share a common parameter `C`: it controls how many examples are allowed to be badly classified during the optimization. For small values of C, some training example are allowed to be misclassified if the margin  (the distance between the separating line and the support vector) is large. For large values of C, the algorithm tries to minimize the number of misclassified training example, even if it lead to a small margin. The impact of the value for C is shown on the following figure : 

<p align="center">
  <img src="https://kermorvant.github.io/csexed-ml/images/svm_values_for_C.png" width="400" >
</p>
 


The RBF kernel (see the [scikit-learn kernels documentation](http://scikit-learn.org/stable/modules/svm.html#svm-kernels) has one more main parameter that must be optimized on the data :  `gamma`.

`gamma` is a parameter controlling the *spread* of the RBF kernel : if `gamma` is small, the kernel takes into account many training samples and the decision boundary is smooth. When `gamma` is large, the kernel is focused on few training examples and the decision boundary is complex. The impact of `gamma` is illustrated on the following Figures : 

- `gamma` = 1
<p align="center">
  <img src="https://kermorvant.github.io/csexed-ml/images/svc_parameters_using_rbf_kernel_17_0.png" width="300" >
</p>

- `gamma` = 100
<p align="center">
  <img src="https://kermorvant.github.io/csexed-ml/images/svc_parameters_using_rbf_kernel_21_0.png" width="300" >
</p>

Moreover, for the RBF kernel, the data must be normalizedn you can use the [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to scale to [zero mean and unit variance](https://en.wikipedia.org/wiki/Feature_scaling#Standardization).  

**Question** : 
> * add the StandardScaler as a pipeline step
> * add the RBF kernel to the GridSearch [svm.SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) 
> * add the optimization of  `gamma` and `C` with values :  `C in [0.1,1,5,10,50]` and `gamma in [0.0005,0.001,0.005,0.01,0.05]`. 

You can add a different kernel with its specific parameters this way : 
<pre>
        {
         'features__dim': (8,),
         'clf__kernel': ['rbf'],
         'clf__gamma': [1e-3, ],
         'clf__C': [1,]
        },
</pre>



In [None]:
READ_N_SAMPLES = 10000
IMG_SIZE = 16
all_df = pd.read_csv("MNIST_all_features16.csv.gz" ,nrows=READ_N_SAMPLES)
all_df.head()
# define the features and the targets
X = all_df.iloc[:,:IMG_SIZE*IMG_SIZE]

y = all_df['class']

sns.countplot(data=all_df,y='class')

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV


pipeline = Pipeline([
    # YOUR CODE HERE
    ('clf', svm.SVC())    
])

gamma_range = None # YOUR CODE HERE
C_range = None # YOUR CODE HERE
parameters = [
        {
        'clf__kernel': ['linear'],
        'clf__C': C_range
        }
        # YOUR CODE HERE
    
]

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2, cv=2)

    X_train_dev, X_test, y_train_dev, y_test = train_test_split(X, y, test_size=0.2)
    
    print (X_train_dev.shape,X_test.shape,y_train_dev.shape,y_test.shape)
    print("Performing grid search")

    grid_search.fit( None,  None) # YOUR CODE HERE
    
    # Print all experiments results
    print("Grid scores on development set:")
    print()
    means = grid_search.cv_results_['mean_test_score']
    stds = grid_search.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, grid_search.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()
    
    # Print best experiment results
    print("Best score: %0.3f" % grid_search.best_score_)

    # predict on test with best parameters
    y_pred = grid_search.predict(None)# YOUR CODE HERE
    print(classification_report(None, None))# YOUR CODE HERE
    print()


In [None]:
# Plots are from https://github.com/ksopyla/svm_mnist_digit_classification 
from matplotlib.colors import Normalize
class MidpointNormalize(Normalize):

    def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
        self.midpoint = midpoint
        Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
        return np.ma.masked_array(np.interp(value, x, y))
def plot_param_space_scores(scores, C_range, gamma_range):
    
    plt.figure(figsize=(8, 6))
    plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
    plt.imshow(scores, interpolation='nearest', cmap=plt.cm.jet,
               norm=MidpointNormalize(vmin=0.5, midpoint=0.9))
    plt.xlabel('gamma')
    plt.ylabel('C')
    plt.colorbar()
    plt.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)
    plt.yticks(np.arange(len(C_range)), C_range)
    plt.title('Validation accuracy')
    plt.show()
    


scores = grid_search.cv_results_['mean_test_score'].reshape(len(C_range),
                                                     len(gamma_range))

plot_param_space_scores(scores, C_range, gamma_range)

Now that we have found the best classifier (SVM RBF) and the best hyperparameter values, we can train the final classifier on all the data and evaluate it.

**Question** : 
> * train the best classifier with the best hyperparameter values on 80% of the data and evaluate on 20%


In [None]:
from sklearn import svm
from sklearn.model_selection import train_test_split

# train on all the data
READ_N_SAMPLES = 20000
IMG_SIZE = 16
all_df = pd.read_csv("MNIST_all_features16.csv.gz" ,nrows=READ_N_SAMPLES)
all_df.head()
# define the features and the targets
X = all_df.iloc[:,:IMG_SIZE*IMG_SIZE]
y = all_df['class']
X_train, X_test, y_train, y_test = train_test_split(None, None, test_size=0.2)# YOUR CODE HERE
classifier = svm.SVC(gamma=None,C=None,verbose=True) # YOUR CODE HERE
classifier.fit( None,  None)# YOUR CODE HERE

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
y_pred = classifier.predict(X_test)
classification_report(y_pred, y_test)