<a href="https://colab.research.google.com/github/matteobolner/AML_Basic/blob/master/mnist_svm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import time
import datetime as dt

In [None]:
from sklearn import datasets, svm, metrics
from sklearn.datasets import fetch_openml

In [None]:
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

In [None]:
from sklearn.preprocessing import MinMaxScaler
from numpy import set_printoptions
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

In [None]:
X[1]

In [None]:
rescaledX[1]

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

In [None]:
x_train, x_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.15, random_state=42)
data = {"train": {"X": x_train, "y": y_train}, "test": {"X": x_test, "y": y_test},}

In [None]:
from sklearn.svm import SVC

C -> regularization parameter
In practice, the reason that SVMs tend to be resistant to over-fitting, even in cases where the number of attributes is greater than the number of bservations, is that it uses regularization. They key to avoiding over-fitting lies in careful tuning of the regularization parameter, C , and in the case of non-linear SVMs, careful choice of kernel and tuning of the kernel parameters.

The SVM is an approximate implementation of a bound on the generalization error, that depends on the margin (essentially the distance from the decision boundary to the nearest pattern from each class), but is independent of the dimensionality of the feature space (which is why using the kernel trick to map the data into a very high dimensional space isn't such a bad idea as it might seem). So in principle SVMs should be highly resistant to over-fitting, but in practice this depends on the careful choice of C

and the kernel parameters. Sadly, over-fitting can also occur quite easily when tuning the hyper-parameters as well, which is my main research area, see

G. C. Cawley and N. L. C. Talbot, Preventing over-fitting in model selection via Bayesian regularisation of the hyper-parameters, Journal of Machine Learning Research, volume 8, pages 841-861, April 2007. (www)

and

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (www)

Both of those papers use kernel ridge regression, rather than the SVM, but the same problem arises just as easily with SVMs (also similar bounds apply to KRR, so there isn't that much to choose between them in practice). So in a way, SVMs don't really solve the problem of over-fitting, they just shift the problem from model fitting to model selection.

It is often a temptation to make life a bit easier for the SVM by performing some sort of feature selection first. This generally makes matters worse, as unlike the SVM, feature selection algorithms tend to exhibit more over-fitting as the number of attributes increases. Unless you want to know which are the informative attributes, it is usually better to skip the feature selection step and just use regularization to avoid over-fitting the data.

In short, there is no inherent problem with using an SVM (or other regularised model such as ridge regression, LARS, Lasso, elastic net etc.) on a problem with 120 observations and thousands of attributes, provided the regularisation parameters are tuned properly.


In [None]:
#poly-9 degrees are used with virtual svms (virtualized data)
#mnist_classifier = SVC(probability=False, kernel="poly", degree=9, C =2, gamma=0.01)

In [None]:
#mnist_classifier = SVC(probability=False, kernel="rbf", C=1, gamma=0.05)

In [None]:
examples = len(data["train"]["X"])
#mnist_classifier.fit(data["train"]["X"][:1000], data["train"]["y"][:1000])


In [None]:
#from sklearn import metrics

#predicted = mnist_classifier.predict(data["test"]["X"])
#print("Confusion matrix:\n%s" % metrics.confusion_matrix(data["test"]["y"], predicted))
#print("Accuracy: %0.4f" % metrics.accuracy_score(data["test"]["y"], predicted))

# try_id = 1
#out = clf.predict(data["test"]["X"][try_id])  # clf.predict_proba
#print("out: %s" % out)
#size = int(len(data["test"]["X"][try_id]) ** (0.5))
#view_image(
#    data["test"]["X"][try_id].reshape((size, size)), data["test"]["y"][try_id]
#)'''

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
gamma_range = np.outer(np.logspace(-3, 0, 4),np.array([1,5]))

In [None]:
gamma_range = gamma_range.flatten()

In [None]:
gamma_range

In [None]:
#gamma_range = [0.001, 0.1, 10]

In [None]:
C_range = np.outer(np.logspace(-1, 1, 3),np.array([1,5]))

In [None]:
C_range = C_range.flatten()

In [None]:
#C_range = [0.1,1,10] #for testing, less parameters

In [None]:
parameters = {'kernel':['rbf'], 'C':C_range, 'gamma': gamma_range}

In [None]:
svm_clsf = svm.SVC()

In [None]:
grid_clsf = GridSearchCV(estimator=svm_clsf,param_grid=parameters,n_jobs=6, verbose=2)

In [None]:
start_time = dt.datetime.now()
print('Start param searching at {}'.format(str(start_time)))

In [None]:

grid_clsf.fit(data["train"]["X"][:examples], data["train"]["y"][:examples])

In [None]:
elapsed_time= dt.datetime.now() - start_time
print('Elapsed time, param searching {}'.format(str(elapsed_time)))

In [None]:
sorted(grid_clsf.cv_results_.keys())

classifier = grid_clsf.best_estimator_
params = grid_clsf.best_params_
scores = grid_clsf.cv_results_['mean_test_score'].reshape(len(C_range), len(gamma_range))


In [None]:
grid_clsf.cv_results_

In [None]:
from sklearn import metrics

predicted = grid_clsf.predict(data["test"]["X"])
print("Confusion matrix:\n%s" % metrics.confusion_matrix(data["test"]["y"], predicted))
print("Accuracy: %0.4f" % metrics.accuracy_score(data["test"]["y"], predicted))
