# Block 6 Exercise 1: Non-Linear Classification

## MNIST Data
We return to the MNIST data set on handwritten digits to compare non-linear classification algorithms ...   

In [1]:
#imports 
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml

In [2]:
# Load data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)


In [3]:
#the full MNIST data set contains 70k samples of digits 0-9 as 28*28 gray scale images (represented as 784 dim vectors)
np.shape(X)

(70000, 784)

In [4]:
X.min()

0.0

In [5]:
#look at max/min value in the data
X.max()

255.0

### E1.1: Cross-Validation and Support Vector Machines
Train and optimize  C-SVM classifier on MNIST (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
* use a RBF kernel
* use *random search* with cross-validation to find the best settings for *gamma* and *C* (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)

In [6]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

In [8]:
svm = SVC()

In [13]:
parameters = dict(C=[1,1.5,1.75,2,2.25,2.5],gamma=[10,5,4,3,2.75,2.5,1,0.5,0.25,0.1,'scale'])
search = RandomizedSearchCV(svm, parameters, n_jobs=20,n_iter=5)

#res=search.fit(X_train,y_train) <- It's a very good idea not using all data in this case
#                                                 - because after 5 hours there are no results :P

res=search.fit(X_train[:10000,:],y_train[:10000])


In [14]:
res.cv_results_

{'mean_fit_time': array([368.77409959, 368.31122675, 368.39364691, 368.77333145,
        143.93932977]),
 'std_fit_time': array([0.72971949, 0.75008831, 0.96537832, 0.56344559, 1.56178747]),
 'mean_score_time': array([40.06399422, 40.1446044 , 39.88161807, 40.01777482, 14.36956434]),
 'std_score_time': array([0.31458483, 0.42980579, 0.17255148, 0.13253761, 0.35395314]),
 'param_gamma': masked_array(data=[0.25, 0.25, 5, 1, 10],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_C': masked_array(data=[2.5, 1.5, 2.5, 2.5, 1],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'gamma': 0.25, 'C': 2.5},
  {'gamma': 0.25, 'C': 1.5},
  {'gamma': 5, 'C': 2.5},
  {'gamma': 1, 'C': 2.5},
  {'gamma': 10, 'C': 1}],
 'split0_test_score': array([0.114, 0.114, 0.114, 0.114, 0.114]),
 'split1_test_score': array([0.114, 0.114, 0.114, 0.114, 0.114]),
 'split2_test_score'

In [15]:
res.best_score_

0.11400000000000002

In [17]:
res.best_params_

{'gamma': 0.25, 'C': 2.5}

### E1.2: Pipelines and simple Neural Networks
Split the MNIST data into  train- and test-sets and then train and evaluate a simple Multi Layer Perceptron (MLP) network. Since the non-linear activation functions of MLPs are sensitive to the scaling on the input (recall the *sigmoid* function), we need to scale all input values to [0,1] 

* combine all steps of your training in a SKL pipeline (https://scikit-learn.org/stable/modules/compose.html#pipeline)
* use a SKL-scaler to scale the data (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* MLP Parameters: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
    * use a *SGD* solver
    * use *tanh* as activation function
    * compare networks with 1, 2 and 3 layers, use different numbers of neurons per layer
    * adjust training parameters *alpha* (regularization) and *learning rate* - how sensitive is the model to these parameters?
    * Hint: do not change all parameters at the same time, split into several experiments
* How hard is it to find the best parameters? How many experiments would you need to find the best parameters?
    


In [18]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

In [19]:
clf = make_pipeline(
    StandardScaler(),
    MLPClassifier(random_state=1,activation='tanh', hidden_layer_sizes=(32), solver='sgd', alpha=0.0001))

clf.fit(X_train,y_train)
clf.score(X_test,y_test)



0.9486666666666667

In [20]:
clf = make_pipeline(
    StandardScaler(),
    MLPClassifier(random_state=1,activation='tanh', hidden_layer_sizes=(32,32), solver='sgd', alpha=0.0001))

clf.fit(X_train,y_train)
clf.score(X_test,y_test)



0.9453333333333334

In [21]:
clf = make_pipeline(
    StandardScaler(),
    MLPClassifier(random_state=1,activation='tanh', hidden_layer_sizes=(32,32,32), solver='sgd', alpha=0.0001))

clf.fit(X_train,y_train)
clf.score(X_test,y_test)



0.9459047619047619

In [None]:
Looks like there is no much difference between the amount of hidden layers.
It's very hard to define good hyperparameters - maybe we should try to optimize them automatically :P ?