# Neural Networks

In this notebook we are going to explore the Neural Networks for image classification. We are going to use the same dataset of the SVM notebook: Fashion MNIST (https://pravarmahajan.github.io/fashion/), a dataset of small images of clothes and accessories.

The dataset labels are the following:

| Label | Description |
| --- | --- |
| 0 | T-shirt/top |
| 1 | Trouser |
| 2 | Pullover |
| 3 | Dress |
| 4 | Coat |
| 5 | Sandal |
| 6 | Shirt |
| 7 | Sneaker |
| 8 | Bag |
| 9 | Ankle boot |

In [None]:
#load the required packages

%matplotlib inline  

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

In [None]:
# helper function to load Fashion MNIST dataset from disk
def load_mnist(path, kind='train'):
    import os
    import gzip
    import numpy as np
    labels_path = os.path.join(path, '%s-labels-idx1-ubyte.gz' % kind)
    images_path = os.path.join(path, '%s-images-idx3-ubyte.gz' % kind)
    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,offset=8)
    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,offset=16).reshape(len(labels), 784)
    return images, labels

# TODO 
Place your ID ("numero di matricola") that will be used as seed for random generator. You can try to change the seed to see the impact of the randomization.

In [None]:
ID = 1165385
np.random.seed(ID)

In [None]:
#load the MNIST dataset and let's normalize the features so that each value is in [0,1]
X, y = load_mnist("data")
# rescale the data
X = X / 255.0

Now split into training and test. Make sure that each label is present at least 10 times
in training frequencies.

In [None]:
#random permute the data and split into training and test taking the first 500
#data samples as training and the rests as test
permutation = np.random.permutation(X.shape[0])

X = X[permutation]
y = y[permutation]

m_training = 500

X_train, X_test = X[:m_training], X[m_training:]
y_train, y_test = y[:m_training], y[m_training:]

labels, freqs = np.unique(y_train, return_counts=True)
print("Labels in training dataset: ", labels)
print("Frequencies in training dataset: ", freqs)


In [None]:
#function for plotting a image and printing the corresponding label
def plot_input(X_matrix, labels, index):
    print("INPUT:")
    plt.imshow(
        X_matrix[index].reshape(28,28),
        cmap          = plt.cm.gray_r,
        interpolation = "nearest"
    )
    plt.show()
    print("LABEL: %i"%labels[index])
    return

In [None]:
#let's try the plotting function
plot_input(X_train,y_train,10)
plot_input(X_test,y_test,100)
plot_input(X_test,y_test,10000)

## TO DO 1

Now use a Feed-forward Neural Network for prediction. Use the multi-layer perceptron classifier, with the following parameters: max_iter=300, alpha=1e-4, solver='sgd', tol=1e-4, learning_rate_init=.1, random_state=ID (this last parameter ensures the run is the same even if you run it more than once). The alpha parameter is the regularization term.

Then, using the default activation function, pick four or five architectures to consider, with different numbers of hidden layers and different sizes. It is not necessary to create huge neural networks, you can limit to 3 layers and, for each layer, its maximum size can be of 100. Evaluate the architectures you chose using the GridSearchCV with cv=5.


In [None]:
parameters = {'hidden_layer_sizes': [(10,), (50,), (10,10,), (50,50,),(50,50,50,),(100,80,50,),(50,80,100,)]}

mlp = MLPClassifier(max_iter=300, alpha=1e-4,
                    solver='sgd', tol=1e-4, random_state=ID,
                    learning_rate_init=.1)

clf = GridSearchCV(mlp, parameters, cv=5)
clf.fit(X_train, y_train)

print ('RESULTS FOR NN\n')

print("Best parameters set found:")
print(clf.best_params_)

print("Score with best parameters:")
mean = max(clf.cv_results_['mean_test_score'])
params = clf.best_params_
print("%0.3f "% mean)

print("\nAll scores on the grid:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"% (mean, std * 2, params))

### QUESTION 1

What do you observe for different architectures and their scores? How the nummber of layers and their sizes affect the performances?

##### Answer (1):
the results shows that for the small number of samples if we increase the layer to 3 the score would decrease to due the fact that there is not enough nodes to connect and make a meaningful structure that leads us to better answer. In short, a medium structure for midium range of samples has the best results.

### TO DO 2

Now get training and test error for a NN with best parameters from above. Use verbose=True
in input so to see how loss changes in iterations

In [None]:
#get training and test error for the best NN model from CV

#best_param = clf.best_params_
best_mlp = MLPClassifier(max_iter=300, alpha=1e-4,
                    solver='sgd', tol=1e-4, random_state=ID,
                    learning_rate_init=.1,hidden_layer_sizes = clf.best_params_['hidden_layer_sizes'])
best_mlp.fit(X_train, y_train)


training_error = 1. - best_mlp.score(X_train,y_train)
test_error = 1. - best_mlp.score(X_test,y_test)

print ('\nRESULTS FOR BEST NN\n')

print ("Best NN training error: %f" % training_error)
print ("Best NN test error: %f" % test_error)

## More data 
Now let's do the same but using 10000 (or less if it takes too long on your machine) data points for training. Use the same NN architectures as before, but you can try more if you want!

In [None]:
X = X[permutation]
y = y[permutation]

m_training = 3000

X_train, X_test = X[:m_training], X[m_training:]
y_train, y_test = y[:m_training], y[m_training:]

print("Labels and frequencies in training dataset: ")
np.unique(y_train, return_counts=True)

## TO DO 3

Now train the NNs with the added data points. Feel free to try more different architectures than before if you want, or less if it takes too much time. We suggest that you use 'verbose=True' so have an idea of how long it takes to run 1 iteration (eventually reduce also the number of iterations to 50).

In [None]:
#for NN we try the same architectures as before
parameters = {'hidden_layer_sizes': [(10,), (50,),(100,),(10,10,),(50,50,),(80,80,),(100,80,50,),(50,80,100,)]}

mlp_large = MLPClassifier(max_iter=50, alpha=1e-4,
                    solver='sgd', tol=1e-4, random_state=ID,
                    learning_rate_init=.1)

mlp_large_CV = GridSearchCV(mlp_large, parameters, cv=5)
mlp_large_CV.fit(X_train, y_train)

print ('\nRESULTS FOR NN\n')

print("Best parameters set found:")
print(mlp_large_CV.best_params_)

print("Score with best parameters:")
print(mlp_large_CV.best_score_)

print("\nAll scores on the grid:")
print(mlp_large_CV.cv_results_['mean_test_score'])

## QUESTION 2
Describe your architecture choices and the results you observe with respect to the layers and sizes used.

##### Answer (2):
First of all because of my lap-top low performance I had to decrease the number of input samples to 3000. Second of all i expected that by increasing the number of samples, more layer with higher complexity in comparison with the previous question should be the result! but, in contrst the number of layer decreased to one and with complexity of 50. the reasons could be randomness nor number of iteration which leads not to converge situation for samples. It maybe possible by increasing the number of iteration the complexity structure would increase.

## TO DO 4

Get the train and test error for the best NN you obtained with 3000 points. This time you can run for 100 iterations. 


In [None]:
#get training and test error for the best NN model from CV

best_mlp_large = MLPClassifier(max_iter=100, alpha=1e-4,
                    solver='sgd', tol=1e-4, random_state=ID,
                    learning_rate_init=.1,hidden_layer_sizes = mlp_large_CV.best_params_['hidden_layer_sizes'])

best_mlp_large.fit(X_train, y_train)


training_error = 1. - best_mlp_large.score(X_train,y_train)
test_error = 1. - best_mlp_large.score(X_test,y_test)

print ('RESULTS FOR BEST NN\n')

print ("Best NN training error: %f" % training_error)
print ("Best NN test error: %f" % test_error)

## QUESTION 3

Compare the train and test error you got with a large number of samples with the best one you obtained with only 500 data points. Are the architectures the same or do they differ? What about the errors you get?

##### Answer (3):


In [None]:
print("Best parameters set found for 500 samples:")
print(clf.best_params_)
print("\nBest parameters set found for 3000 samples:")
print(mlp_large_CV.best_params_)

As I mentioned above the artitectures are different beside the expectation but the error decreased for the new structure and oncreasing the number of input samples as we expected also.

### TO DO 5

Plot a digit that was missclassified by NN with m=500 training data points and it is now instead correctly classified by NN with m=10000 training data points.

In [None]:
NN_prediction = clf.predict(X_test)
large_NN_prediction = best_mlp_large.predict(X_test)

zip_func = zip(NN_prediction,large_NN_prediction)
for i,j in enumerate(zip_func):
    if ((j[0] != y_test[i]) & (j[1] == y_test[i] )):
        break
print("NN_prediction",NN_prediction[i])
print("large_NN_prediction",large_NN_prediction[i])
plot_input(X_test,y_test,i)
        

Let's plot the weigths of the multi-layer perceptron classifier, for the best NN we get with 500 data points and with 3000 data points. Notice that the code assumes that the NNs are called "mlp" and "best_mlp_large" , you could need to replace with your variable names.



In [None]:
print("Weights with 500 data points:")

fig, axes = plt.subplots(4, 4)
vmin, vmax = best_mlp.coefs_[0].min(), best_mlp.coefs_[0].max()
for coef, ax in zip(best_mlp.coefs_[0].T, axes.ravel()):
    ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin, vmax=.5 * vmax)
    ax.set_xticks(())
    ax.set_yticks(())

plt.show()

print("Weights with 3000 data points:")

fig, axes = plt.subplots(4, 4)
vmin, vmax = best_mlp_large.coefs_[0].min(), best_mlp_large.coefs_[0].max()
for coef, ax in zip(best_mlp_large.coefs_[0].T, axes.ravel()):
    ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin, vmax=.5 * vmax)
    ax.set_xticks(())
    ax.set_yticks(())
plt.show()

## QUESTION 4

Describe what do you observe by looking at the weights

##### Answer (4):
First of all I had to mention because of the low performance I decreased the input sample numbers to 3000 which cause that the difference is not so obvious but if we get more cautious we could find that the accuracy increased and the number of the coefficient stand for best performances.

### TO DO 7

Report the best SVM model and its parameters, you found in the last notebook. Fit it on a few data points and compute its training and test scores.

In [None]:
m_training = 700

X_train, X_test = X[:m_training], X[m_training:2*m_training]
y_train, y_test = y[:m_training], y[m_training:2*m_training]

# best parameters found in the SVM notebook
# Create the SVM and perform the fit

best_svc = SVC(kernel = 'rbf')
parameter = {'C': [10], 'gamma': [0.01]}
best_SVM = GridSearchCV(best_svc, parameter, cv=5)
best_SVM.fit(X_train, y_train)

print ('RESULTS FOR SVM\n\n')

SVM_training_error = 1. - best_SVM.score(X_train,y_train)

print("Training score SVM:")
print(SVM_training_error)

SVM_test_error = 1. - best_SVM.score(X_test,y_test)
print("\nTest score SVM:")
print(SVM_test_error)

## QUESTION 5
Compare the results of SVM and of NN. Which one would you preferer? Which are its tradeoffs?

##### Answer (5):


First of all the SVM is way faster than NN for larger number of samples, but the accuracy of NN networks are better than the SVM due to error just for the large number of input samples. I think it would be a tradeoff between acuuracy, performance, speed, complexity and etc. always find the best way and the best algorithm would specifically depends on the input data and the ristriction of the hardware and etc.