Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE" and remove every line containing the expression: "raise ..." (if you leave such a line your code will not run).

Do not remove any cell from the notebook you downloaded. You can add any number of cells (and remove them if not more necessary).

Do not leave any variable initialized to None.

## IMPORTANT: make sure to rerun all the code from the beginning to obtain the results for the final version of your notebook, since this is the way we will do it before evaluating your notebook!!!

## Make sure to name your notebook file (.ipynb) correctly:
### - NL_NAMESURNAME_ID (E.g. : NL_MARIOROSSI_2204567)

## Fill in your name, surname and id number (numero matricola) below:

In [None]:
NAME = "Victor Miguel Velazquez Espitia"
ID_number = int("2043179")

import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

---

## HOMEWORK #3

### Non linear models for classification 

In this notebook we are going to explore the use of SVM and Neural Networks for image classification. We are going to use the famous MNIST dataset, that is a dataset of handwritten digits. We get the data from mldata.org, that is a public repository for machine learning data.

In [None]:
# Load the required packages
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn

import sklearn
from sklearn.datasets import fetch_openml
from sklearn.neural_network import MLPClassifier

np.random.seed(ID_number)

In [None]:
#load the MNIST dataset 
#Load data from https://www.openml.org/d/554
X,Y = fetch_openml('mnist_784', version=1, return_X_y=True,as_frame = False)

print(f'Each image is represented as vector of shape {X[0].shape}')
print(f'The image is represented in gray scale levels {X[0]}')
print(f'Here it is a label: {Y[0]}')

In [None]:
#let's normalize the features so that each value is between [0,1]

# Rescale the data
X = X / 255.

In a classification problem it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.
We can achieve this by setting the “stratify” argument of the function "train_test_split" to the Y component of our dataset.

We are going to use 500 samples in the train dataset, the remaining ones are used for testing.

In [None]:
from sklearn.model_selection import train_test_split

m_t = 500
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=m_t/len(Y), random_state=ID_number, stratify=Y)

print(f'Lenght train dataset: {len(y_train)}, Labels and frequencies: \n {list(zip(*np.unique(y_train, return_counts=True)))}')
print(f'Lenght test dataset: {len(y_test)}, Labels and frequencies: \n {list(zip(*np.unique(y_test, return_counts=True)))}')

In [None]:
# Function to plot a digit and print the corresponding label
def plot_digit(X_matrix, labels, index):
    print("INPUT:")
    plt.imshow(
        X_matrix[index].reshape(28,28),
        cmap          = plt.cm.gray_r,
        interpolation = "nearest"
    )
    plt.show()
    print(f"LABEL: {labels[index]}")
    return

In [None]:
#let's try the plotting function
plot_digit(x_train, y_train, 100)
plot_digit(x_test, y_test, 40000)

## TO DO 1
SVM with cross validation to pick the best model. Use SVC from sklearn.svm and GridSearchCV from sklearn.model_selection (5-fold cross-validation).

Print the best parameters found as well as the best score obtained by the 'optimal' model.
Choose the grid, depending on the kernel you are using different hyper-parameters are needed (C, gamma, ...). 
You do not need to use more than 5 values for each hyper-parameter (otherwise the cell could be very slow). 

In [None]:
#import SVC
from sklearn.svm import SVC
#import for Cross-Validation
from sklearn.model_selection import GridSearchCV


def compute_best_SVM_with_CV(kernel_type : str, parameters : dict, x_train : np.ndarray, y_train : np.ndarray) -> tuple:
    '''
    Use Cross validation to find the best SVM on the given parameters. Return the best parameters set together with 
    the corresponding score. Return also the scores for all the other parameters given as input.
    :param kernel_type: Type of kernel (i.e. linear, rbf, poly)
    :param parameters: Dict containing kernel parameters (e.g. {'C': [1, 10, 100, 1000], 'gamma': [0.01, 0.001], ...})
    :param x_train: Train dataset
    :param y_train: Train labels
    
    :returns: (best_param, best_score, all_scores)
        WHERE:
        best_param: best parameter set (this is a dictionary)
        best_score: best score obtained for the given parameters (float)
        all_scores: all scores computed for each parameter (np.ndarray)
    '''
    SVM_model = SVC(kernel=kernel_type)
    # Use GridSearchCV to find the best parameter set.
    # YOUR CODE HERE
    cv = GridSearchCV(SVM_model, parameters)
    cv.fit(x_train,  y_train)
    #raise NotImplementedError() # Remove this line
    
    print('#####################################')
    print(f'RESULTS for {kernel_type} KERNEL\n')
    # Store the best parameters set and print them
    print("Best parameters set found:")
    best_param = None
    # YOUR CODE HERE
    best_param = cv.best_params_
    #raise NotImplementedError() # Remove this line
    print(best_param)
    
    # Store and print the score of the best parameters set
    print("\nScore with best parameters:")
    best_score = None
    # YOUR CODE HERE
    best_score= cv.best_score_
    #raise NotImplementedError() # Remove this line
    print(best_score)
    
    # Store and print all the scores for the given parameters (average of the validation scores)
    print("\nAll scores on the grid:")
    all_scores = None
    # YOUR CODE HERE
    all_scores= cv.cv_results_['mean_test_score']
    #raise NotImplementedError() # Remove this line
    print(all_scores)
    
    return best_param, best_score, all_scores

# Choose the grid for parameters of the linear SVM kernel
linear_parameters = None

# YOUR CODE HERE
linear_parameters = {'C': [0.01, 0.1, 1, 10, 100]}
#raise NotImplementedError() # Remove this line
best_param_lin, best_score_lin, all_scores_lin = compute_best_SVM_with_CV('linear', linear_parameters, x_train, y_train)
# Choose the grid for parameters of the rbf SVM kernel
rbf_parameters = None
# YOUR CODE HERE
rbf_parameters={'C': [1, 10, 100, 1000] , 'gamma': [0.01, 0.001] }
best_param_rbf, best_score_rbf, all_scores_rbf = compute_best_SVM_with_CV('rbf', rbf_parameters, x_train, y_train)
# Choose the grid for parameters of the poly SVM kernel (do not forget to choose the degree)
poly_parameters = None
# YOUR CODE HERE
poly_parameters ={'C': [1, 10, 100, 1000] , 'degree':[2,3,4] }
#raise NotImplementedError() # Remove this line
best_param_poly, best_score_poly, all_scores_poly = compute_best_SVM_with_CV('poly', poly_parameters, x_train, y_train)

In [None]:
assert type(best_param_rbf) == dict
assert type(best_score_rbf) == np.float64
assert np.prod(np.array([len(params) for params in rbf_parameters.values()])) == len(all_scores_rbf)


In [None]:
# TODO 2: 
# Get training and test error for the best SVM model obtained from CV (you need to choose across different kernels 
# too). You just need to look at the best model for each kernel and choose the best one (you can do this by hand).

# YOUR CODE HERE
best_kernel_type, best_parameters = 'rbf' , {'C': 10, 'gamma': 0.01}

best_SVM = SVC(kernel=best_kernel_type, **best_parameters)
best_SVM.fit(x_train, y_train)

# Compute training and test error for this model (use the usual sklearn built-in functions)
training_error, test_error = 1-best_SVM.score(x_train,y_train) , 1-best_SVM.score(x_test,y_test)
# YOUR CODE HERE

print (f"Best SVM training error: {training_error}")
print (f"Best SVM test error: {test_error}")

In [None]:
assert type(training_error) == np.float64
assert type(test_error) == np.float64


### TO DO 3
Now we use feed-forward neural networks for classification. 
In particular, we use the Multi-Layer-Perceptron (the multi-layer structure we have seen in class, see http://scikit-learn.org/stable/modules/neural_networks_supervised.html).

Similarly as before, we use cross validation to pick the best model, you need to complete the function 'compute_best_MLP_with_CV()' that finds the best MLP architecture given a specific activation function.

Note that the starting random state is fixed to make the runs reproducible (random_state=ID_number).
The following options for the MLP are used: max_iter=1000, alpha=1e-4, solver='sgd', tol=1e-4, random_state=ID_number, learning_rate_init=.1, activation = activation_f. 

In [None]:
def compute_best_MLP_with_CV(activation_f : str, parameters : dict, x_train : np.ndarray, y_train : np.ndarray) -> tuple:
    '''
    Use Cross validation to find the best MLP architecture given a specific activation function. 
    Return the best parameters set together with the corresponding score. Return also the scores for all the other parameters given as input.
    :param activation_f: Type of activation function (e.g. 'logistic', 'tanh', 'relu')
    :param parameters: architectures (e.g. {'hidden_layer_sizes': [(10,), (50,), (10,10,), (50,50,)]})
    :param x_train: Train dataset
    :param y_train: Train labels
    
    :returns: (best_param, best_score, all_scores)
        WHERE:
        best_param: best parameter set (this is a dictionary)
        best_score: best score obtained for the given parameters (float)
        all_scores: all scores computed for each parameter (np.ndarray)
    '''
    
    
    mlp = MLPClassifier(max_iter=1000, alpha=1e-4, solver='sgd', tol=1e-4, random_state=ID_number, learning_rate_init=.1,activation = activation_f)
    
    #Use GridSearchCV to find the various paramters the function returns: best_param, best_score, all_scores
    mlp_CV = GridSearchCV(mlp, parameters)
    mlp_CV.fit(x_train,  y_train)

    best_param = mlp_CV.best_params_
    best_score= mlp_CV.best_score_
    all_scores= mlp_CV.cv_results_['mean_test_score']


    # YOUR CODE HERE
    return best_param, best_score, all_scores

In [None]:
#test various architectures (hidden_layer_sizes) and activation functions (e.g. 'logistic','tanh','relu') for the MLP.

mlp_parameters = parameters = {'hidden_layer_sizes': [(10,), (50,), (10,10,), (50,50,)]}
 #leave here maximum 3 architectures when you submit

# next test different architectures and activation functions: use compute_best_MLP_with_CV()
# YOUR CODE HERE


In [None]:
#simple autotest with relu
best_param_relu, best_score_relu, all_scores_relu = compute_best_MLP_with_CV('relu', mlp_parameters, x_train, y_train)

assert type(best_param_relu) == dict
assert type(best_score_relu) == np.float64
assert np.prod(np.array([len(params) for params in mlp_parameters.values()])) == len(all_scores_relu)


In [None]:
#Select the best activation function and architecture you found so that it can be used next

best_activation_type, mlp_best_param = 'relu', mlp_parameters
# YOUR CODE HERE


## TO DO 4


Now get training and test error for the NN with the best parameters from above. We use verbose=True
in input so to see how loss changes in iterations (see how this changes if the number of iterations is changed)

In [None]:
# Get training and test error for the best NN model found using CV
max_iter = 1000
mlp = MLPClassifier(**mlp_best_param, max_iter=max_iter, alpha=1e-4, solver='sgd', tol=1e-4, random_state=ID_number,
                    learning_rate_init=.1,activation=best_activation_type, verbose=True)

# ADD CODE: FIT MODEL & COMPUTE TRAINING AND TEST ERRORS
mlp.fit(x_train,  y_train)
training_error, test_error = 1-mlp.score(x_train,y_train) , 1-mlp.score(x_test,y_test)# YOUR CODE HERE

print ('\nRESULTS FOR BEST NN\n')

print ("Best NN training error: %f" % training_error)
print ("Best NN test error: %f" % test_error)

plt.plot(mlp.loss_curve_, label='Training Loss')
plt.title('Training loss MLP')
plt.xlabel('Iter'), plt.ylabel('Loss')

In [None]:
assert type(training_error) == np.float64
assert type(test_error) == np.float64


## TO DO  5
Write a function to find and plot the first digit (in x_test) that is missclassified by NN and correctly classified by SVM.

Write a function to compute the confusion matrix for the predictions of a model (on testset). If you are not familiar with what a confusion matrix is, have a look at this link: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html . You are not allowed to use sklearn to create the confusion matrix BUT you can compare your solution with the sklearn implementation to check you wrote it right (see assert checks). 

In [None]:
def find_and_print_first_mismatched_prediction(SVM_prediction : np.ndarray, NN_prediction : np.ndarray,
                                               x_test : np.ndarray, y_test : np.ndarray) -> int:
    '''
    Function to find and print the first digit that is missclassified by NN and correctly classified by SVM.
    :param SVM_prediction: SVM predicitons.
    :param NN_prediction: MLP predicitons.
    :param x_test: Test set inputs.
    :param y_test: Test set labels.
    
    :returns:
        i: returns the first index in which there is a mismatch between NN_prediction and true labels but no mismatch 
           between SVM_prediction and true labels. 
    '''
    i = 0
    found = False
    while ((not found) and (i<len(y_test))):
        # YOUR CODE HERE
        if (y_test[i] == SVM_prediction[i] and y_test[i] != NN_prediction[i]):
            found = True
        else:
            i = i+1
    return i
    
    
def confusion_matrix_by_hand(true_labels : np.ndarray, predicted_labels : np.ndarray) -> np.ndarray:
    '''
    Function used to compute the confusion matrix given true and predicted labels. 
    :param true_labels: True labels.
    :param predicted_labels: Predicted labels (note this function does not require to know which model generated 
                             the predictions).
    
    :returns:
        confusion_matrix: Confusion matrix for the given true and predicted labels.
    '''
    labels = np.unique(true_labels)
    map_labels_to_index = {label:i for i, label in enumerate(labels)}
    confusion_matrix = np.zeros((len(labels), len(labels)))
    # YOUR CODE HERE
    for p,a in zip(true_labels, predicted_labels):
        confusion_matrix[ map_labels_to_index[p]][ map_labels_to_index[a]] += 1
    return confusion_matrix.astype(int)
#predicted & true labels

# Let's test our functions
SVM_prediction = best_SVM.predict(x_test)
NN_prediction = mlp.predict(x_test)


first_index = find_and_print_first_mismatched_prediction(SVM_prediction, NN_prediction, x_test, y_test)

SVM_CM = confusion_matrix_by_hand(y_test, SVM_prediction)
MLP_CM = confusion_matrix_by_hand(y_test, NN_prediction)

print(f'SVM confusion matrix: {SVM_CM}')
print(f'MLP confusion matrix: {MLP_CM}')

# Convert confusion matrices to pandas data frames
labels = np.unique(y_test)
SVM_CM_df = pd.DataFrame(SVM_CM, index = labels, columns = labels)
MLP_CM_df = pd.DataFrame(MLP_CM, index = labels, columns = labels)

# Plot confusion matrices
fig, axes = plt.subplots(1,2, figsize=(15,5))
sn.heatmap(SVM_CM_df, annot=True, ax=axes[0], cmap='rocket_r', vmax=450)
sn.heatmap(MLP_CM_df, annot=True, ax=axes[1], cmap='rocket_r', vmax=450)
axes[0].set_title('SVM'), axes[1].set_title('MLP')

In [None]:
from sklearn.metrics import confusion_matrix
skl_confusion_matrix_SVM = confusion_matrix(y_test, SVM_prediction)
skl_confusion_matrix_NN = confusion_matrix(y_test, NN_prediction)

assert np.sum(skl_confusion_matrix_SVM - SVM_CM) == 0
assert np.sum(skl_confusion_matrix_NN - MLP_CM) == 0


## TO DO 6: explain the results you got (max 5 lines)
According to the cross-validation results, would you choose SVMs or NNs when 500 data points are available for training? Is this a good choice, given the results on the test set?

Looking at the confusion matrices what to do you observe? On which classes each model is more likely to make mistakes? 

(Answer in the next cell, no need to add code)

Observing the values of the error on the test set for best SVM (the one using rbf kernel function, as we expect from the theory because the rbf kernel's shape is similar to a gaussian) and best NN (with 50/50) it is possible to notice that SVM creates a better model because the test error for SVM (0.108) is less than that of NN (0.1467). This is what we expect from the theory: while NN finds a generale hyperplane that separates the datapoints, SVM finds the one that is at the maximum distance from the points of the different subsets. The confusion matrices confirm that SVM creates a better model than NN since the values on the diagonal of the matrices, which corresponds to the number of correctly classified data, is greater for SVM than for NN. Furthermore, observing the confusion matricies it is possible to notice what are the classes where each model makes more mistakes: both SVM and NN confuse the number 4 with 9, 7 with 9, and 3 with 5.

## More Data

Now let's do the same but using more data points for training SVM and NN. For SVM we are going to use the best hyperparameters set (kernel, C, gamma, ...) found using 500 data points. For NN we are going to use the best architecture found using 500 data points for the relu kernel since such architecture is usually fast to train.

In [None]:
#let restart the random generator with the given seed
np.random.seed(ID_number)

m_t = 60000
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=m_t/len(Y), random_state=ID_number, stratify=Y)

print(f'Lenght train dataset: {len(y_train)}, Labels and frequencies: \n {list(zip(*np.unique(y_train, return_counts=True)))}')
print(f'Lenght test dataset: {len(y_test)}, Labels and frequencies: \n {list(zip(*np.unique(y_test, return_counts=True)))}')

In [None]:
# As we did with the first HW let's use a decorator to measure time 
from collections import defaultdict
running_times = defaultdict(list)

def measure_time(function):
    def wrap(*args, **kw):
        import time 
        t_start = time.time()
        result = function(*args, **kw)
        t_end = time.time()
        running_times[type(args[0]).__name__].append(t_end - t_start)
        return result
    return wrap

@measure_time
def fit_classification_model(model, x_train, y_train):
    model.fit(x_train, y_train)

In [None]:
n_data = [250, 500, 1000, 2000, 5000, 7500]
svm_train_err, svm_test_err = [], [] 
mlp_train_err, mlp_test_err = [], [] 
for n in n_data: 
    print(f'Processing with {n} data ...')
    # Initialize models according to the best we got using 500 data
    svm = SVC(kernel=best_kernel_type, **best_parameters)
    mlp = MLPClassifier(**best_param_relu, max_iter=max_iter, alpha=1e-4, solver='sgd', tol=1e-4, 
                        random_state=ID_number, learning_rate_init=.1,activation='relu')

    # fit svc
    fit_classification_model(svm, x_train[:n], y_train[:n])
    # get svc train and test error
    svm_train_err.append(1. - svm.score(x_train[:n], y_train[:n]))
    svm_test_err.append(1. - svm.score(x_test, y_test))
    
    # fit mlp
    fit_classification_model(mlp, x_train[:n], y_train[:n])
    # get mlp train and test error
    mlp_train_err.append(1. - mlp.score(x_train[:n], y_train[:n]))
    mlp_test_err.append(1. - mlp.score(x_test, y_test))

In [None]:
fig, axes = plt.subplots(1,2, figsize=(15, 5))
axes[0].plot(n_data, np.array(svm_train_err), label='SVM train err')
axes[0].plot(n_data, np.array(svm_test_err), label='SVM test err')
axes[0].plot(n_data, np.array(mlp_train_err), label='MLP train err')
axes[0].plot(n_data, np.array(mlp_test_err), label='MLP test err')
axes[0].set_xlabel('N data'), axes[0].set_ylabel('Loss')
axes[0].legend(), axes[0].set_title('SVC vs MLP Errors')

for model, times in running_times.items():
    axes[1].plot(n_data, times, label=model)
axes[1].set_xlabel('N data'), axes[1].set_ylabel('Time (s)')
axes[1].legend(), axes[1].set_title('Training Time')

# TODO 7: Complete dataset
Just for comparison, since it may not be possible to learn a SVM on too many data (due to time and memory complexity issues as you can notice from the plots above), let's use logistic regression (with standard parameters from scikit-learn but the number of iteration).

In [None]:
from sklearn import linear_model

# Fit and test a logistic regression model
max_iter = 1000

# YOUR CODE HERE
log_reg = linear_model.LogisticRegression(max_iter = max_iter)
log_reg.fit(x_train, np.ravel(y_train))
training_error_lr = 1 - log_reg.score(x_train, y_train)
test_error_lr = 1 - log_reg.score(x_test, y_test)

print (f"Best logistic regression training error: {training_error_lr:.4f}")
print (f"Best logistic regression test error: {test_error_lr:.4f}")

We now learn the NN. Below we use the same best architecture as before (found with 500 data for the relu activation function), feel free to try larger ones (and to use again CV), or smaller ones if it takes too much time. (We suggest that you use 'verbose=True' so have an idea of how long it takes to run 1 iteration). 

*Note*: If you do again CV to choose the best architecture remember to save the best set of parameters into the variable: "best_param_relu".

In [None]:
#get training and test error for the best NN model from CV

# YOUR CODE HERE
parameters = {'hidden_layer_sizes': [(10,), (50,), (10,10,), (50,50,)]}
mlp = MLPClassifier(max_iter=1000, alpha=1e-4, solver='sgd', tol=1e-4, random_state=ID_number, learning_rate_init=.1)
mlp_CV = GridSearchCV(mlp, parameters)
mlp_CV.fit(x_train,  y_train)
mlp_best_param = mlp_CV.best_params_
max_iter = 1000
best_mlp_larger = MLPClassifier(**mlp_best_param, max_iter=max_iter, alpha=1e-4, solver='sgd', tol=1e-4, random_state=ID_number,
                    learning_rate_init=.1, verbose=True)
best_mlp_larger.fit(x_train,  y_train)
training_error, test_error = 1-best_mlp_larger.score(x_train,y_train) , 1-best_mlp_larger.score(x_test,y_test)

print ('\nRESULTS FOR BEST NN\n')

print (f"Best NN training error: {training_error:.4f}")
print (f"Best NN test error: {test_error:.4f}")

In [None]:
assert type(training_error) == np.float64
assert type(test_error) == np.float64


In [None]:
## TODO 8: compute the confusion matrices both on train and test set for Logistic regression (trained on 60k)
# and MLP (trained on 60k).

# Log Reg Confusion matrices
log_reg_CM_train, log_reg_CM_test = None, None
# YOUR CODE HERE
predict_train = log_reg.predict(x_train)
predict_test = log_reg.predict(x_test)
log_reg_CM_train = confusion_matrix_by_hand(y_train, predict_train)
log_reg_CM_test = confusion_matrix_by_hand(y_test, predict_test)
# mlp
mlp_CM_train, mlp_CM_test = None, None
# YOUR CODE HERE
predict_train_mlp = best_mlp_larger.predict(x_train)
predict_test_mlp = best_mlp_larger.predict(x_test)
mlp_CM_train = confusion_matrix_by_hand(y_train, predict_train_mlp)
mlp_CM_test = confusion_matrix_by_hand(y_test, predict_test_mlp)


# Convert confusion matrices to pandas data frames
labels = np.unique(y_test)
log_reg_CM_train_df = pd.DataFrame(log_reg_CM_train, index = labels, columns = labels)
log_reg_CM_test_df = pd.DataFrame(log_reg_CM_test, index = labels, columns = labels)

mlp_CM_train_df = pd.DataFrame(mlp_CM_train, index = labels, columns = labels)
mlp_CM_test_df = pd.DataFrame(mlp_CM_test, index = labels, columns = labels)

# Plot confusion matrices
fig, axes = plt.subplots(1,2, figsize=(15,5))
sn.heatmap(log_reg_CM_train_df, annot=True, ax=axes[0], cmap='rocket_r', vmax=250)
sn.heatmap(log_reg_CM_test_df, annot=True, ax=axes[1], cmap='rocket_r', vmax=250)
axes[0].set_title('Log Reg Train'), axes[1].set_title('Log Reg Test')

fig, axes = plt.subplots(1,2, figsize=(15,5))
sn.heatmap(mlp_CM_train_df, annot=True, ax=axes[0], cmap='rocket_r', vmax=50)
sn.heatmap(mlp_CM_test_df, annot=True, ax=axes[1], cmap='rocket_r', vmax=50)
axes[0].set_title('MLP Train'), axes[1].set_title('MLP Test')

In [None]:
assert log_reg_CM_train.shape == (10, 10)
assert log_reg_CM_test.shape == (10, 10)
assert mlp_CM_train.shape == (10, 10)
assert mlp_CM_test.shape == (10, 10)


## TO DO 9
Compare and discuss:
- compare the computational time required to fit a SVM and a MLP. Which is faster as the number of data increase? Why? Can you apply both methods in the high data regime?
- the results from SVM m=7500 and NN with m=60000 training data points.
- the results from NN with m=500 and m=60000 training data points.
- What do you observe in the confusion matrices? Which are the hardest classes? Are the hardest and easiest classes the same both for mlp and logistic regression?

(Answer in the next cell, no need to write code)

(1)Every time the code is run, a different computational time graph is obtained because the time depends on the speed of the processor used at that moment. In any case it can be observed that the computational time has an increasing trend both for SVM and for NN. However, SVM has a better computational time than NN if the number of data is small: in fact, it solves a convex optimization problem for which there exist some tools to solve it faster, on the other hand it is a quadric problem, therefore, if the number of date is very larger it is very difficult to be solved so, in this second case, NN has a better computational time as it is possibile to see in the graphs. To concluse, if the number of data is small is more convinient to use SVM, otherwise, if the number of data is big, NN result more convinient.
(2)The test error for NN with 60000 data(0.0268) is slightly better the the test error for SVM with 7500 data (0.0411), but it is important to notice that the time occured to determine the first model is much bigger that the time used by SVM with 7500. Therefore, despite NN with 60000 data determines a better model it is more conviniet to use SVM with 7500 data.
(3)As we expect from the theory, NN with 60000 training data creates a better model than NN with 500 data since the error for the test set int the first case(0.0268) is lower than the one for the second case(0.146849), because the method is able to create a better model using more data.
(4)Observing the confusion matricies it is possible to notice that MLP creates a better model than logistic regression since it creates a model that fit perfectly the data from the training set (no class is confused with another as you can see from the associated confusion matrix), while logistic regression makes some mistakes. This is what we expect from the theory: Logistic regression is able to do only linear classification while MLP method works also for non linear classification problems. If we look at the confusion matrices of both methods applied at the test set, we observe that with logistic regression more classes are confused each other than with MLP. In particular, logistic regression confuses the number 3 with 5, 8 with 5 and 9 with 7, while MLP confuses the number 3 with 5, 8 with 9, and 9 with 7. Forthumore it is possible to observe that while with 500 data points the most confused classes for MLP were 7 with 9, now using 60000 data MLP commits less error on these classes and they are not the ones associated with the grater number of mistakes.


## Data normalization

In the following the importance of data normalization before investigated. In particular, a MLP with a (50,50,) architecture and a 'logistic' activation function is trained with the original MNIST data and the effects are analized. 

In [None]:
# data are restored to their original scale 
X = X*255.
print(X[1])

#train-test data split
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=m_t/len(Y), random_state=ID_number, stratify=Y)

In [None]:
best_mlp_large = MLPClassifier(hidden_layer_sizes=(50,50,), max_iter=max_iter, alpha=1e-4,activation='logistic', solver='sgd', tol=1e-4, 
                               random_state=None, learning_rate_init=.1, verbose=True)
best_mlp_large.fit(x_train, y_train)
training_error = 1. - best_mlp_large.score(x_train, y_train)
test_error = 1. - best_mlp_large.score(x_test, y_test)


print ('\nRESULTS FOR BEST NN\n')

print (f"Best NN training error: {training_error:.4f}")
print (f"Best NN test error: {test_error:.4f}")

## TO DO 10

Do you think data normalization is important? Why? Do you observe any difference between the results you obtained before and after scaling the data?

(Answer in the next cell, no need to write code)

Yes, data normalization is often important when working with machine learning algorithms. This is because many algorithms, especially those that use distance measures, assume that all features are on the same scale and have the same distribution. If this is not the case, then some features may dominate others and make it difficult for the algorithm to learn effectively.
In my experience, data normalization can make a significant difference in the performance of machine learning algorithms. I have observed that algorithms often perform better and converge faster when the input data is normalized. However, the impact of normalization can vary depending on the specific algorithm and dataset, so it is always a good idea to try both normalized and non-normalized versions of the data to see which one works better