Train an SVM classifier on the MNIST dataset. Since SVM classifiers are binary classifiers, you will need to use one-versus-all to classify all 10 digits. You may want to tune the hyperparameters using small validation sets to speed up the process. What accuracy can you reach?

First, let's load the dataset and split it into a training set and a test set. We could use train_test_split() but people usually just take the first 60,000 instances for the training set, and the last 10,000 instances for the test set (this makes it possible to compare your model's performance with others):

In [1]:
try:
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1, cache=True, as_frame=False)
except ImportError:
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

X = mnist["data"]
y = mnist["target"]

X_train = X[:60000]
y_train = y[:60000]
X_test = X[60000:]
y_test = y[60000:]

Many training algorithms are sensitive to the order of the training instances, so it's generally good practice to shuffle them first:

In [3]:
import numpy as np

np.random.seed(42)
rnd_idx = np.random.permutation(60000)
X_train = X_train[rnd_idx]
y_train = y_train[rnd_idx]

Let's start simple, with a linear SVM classifier. It will automatically use the One-vs-All (also called One-vs-the-Rest, OvR) strategy, so there's nothing special we need to do. Easy!

In [5]:
from sklearn.svm import LinearSVC

lin_clf = LinearSVC(random_state=42)
lin_clf.fit(X_train, y_train)



LinearSVC(random_state=42)

Let's make predictions on the training set and measure the accuracy (we don't want to measure it on the test set yet, since we have not selected and trained the final model yet):

In [6]:
from sklearn.metrics import accuracy_score

y_pred = lin_clf.predict(X_train)
accuracy_score(y_train, y_pred)

0.8656166666666667

86% accuracy on MNIST is a really bad performance. This linear model is certainly too simple for MNIST, but perhaps we just needed to scale the data first:

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float32))
X_test_scaled = scaler.transform(X_test.astype(np.float32))

In [8]:
lin_clf = LinearSVC(random_state=42)
lin_clf.fit(X_train_scaled, y_train)



LinearSVC(random_state=42)

In [9]:
y_pred = lin_clf.predict(X_train_scaled)
accuracy_score(y_train, y_pred)

0.92025

That's much better (we cut the error rate in two), but still not great at all for MNIST. If we want to use an SVM, we will have to use a kernel. Let's try an SVC with an RBF kernel (the default).

In [11]:
from sklearn.svm import SVC

svm_clf = SVC(decision_function_shape="ovr", gamma="auto")
svm_clf.fit(X_train_scaled[:10000], y_train[:10000])

SVC(gamma='auto')

In [12]:
y_pred = svm_clf.predict(X_train_scaled)
accuracy_score(y_train, y_pred)

0.9476

That's promising, we get better performance even though we trained the model on 6 times less data. Let's tune the hyperparameters by doing a randomized search with cross validation. We will do this on a small dataset just to speed up the process:

In [13]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform

param_distributions = {"gamma": reciprocal(0.001, 0.1), "C": uniform(1, 10)}
rnd_search_cv = RandomizedSearchCV(svm_clf, param_distributions, n_iter=10, verbose=2, cv=3)
rnd_search_cv.fit(X_train_scaled[:1000], y_train[:1000])

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END ....C=8.852316058423087, gamma=0.001766074650481071; total time=   0.3s
[CV] END ....C=8.852316058423087, gamma=0.001766074650481071; total time=   0.3s
[CV] END ....C=8.852316058423087, gamma=0.001766074650481071; total time=   0.2s
[CV] END ...C=1.8271960104746645, gamma=0.006364737055453384; total time=   0.3s
[CV] END ...C=1.8271960104746645, gamma=0.006364737055453384; total time=   0.3s
[CV] END ...C=1.8271960104746645, gamma=0.006364737055453384; total time=   0.3s
[CV] END ....C=9.875199193765326, gamma=0.051349833451870636; total time=   0.3s
[CV] END ....C=9.875199193765326, gamma=0.051349833451870636; total time=   0.3s
[CV] END ....C=9.875199193765326, gamma=0.051349833451870636; total time=   0.3s
[CV] END ......C=6.59992909281409, gamma=0.05991666578466177; total time=   0.3s
[CV] END ......C=6.59992909281409, gamma=0.05991666578466177; total time=   0.3s
[CV] END ......C=6.59992909281409, gamma=0.05991

RandomizedSearchCV(cv=3, estimator=SVC(gamma='auto'),
                   param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000028F621B3790>,
                                        'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000028F621B35B0>},
                   verbose=2)

In [14]:
rnd_search_cv.best_estimator_

SVC(C=8.852316058423087, gamma=0.001766074650481071)

In [15]:
rnd_search_cv.best_score_

0.8630037222851593

This looks pretty low but remember we only trained the model on 1,000 instances. Let's retrain the best estimator on the whole training set (run this at night, it will take hours):

In [16]:
rnd_search_cv.best_estimator_.fit(X_train_scaled, y_train)

SVC(C=8.852316058423087, gamma=0.001766074650481071)

In [17]:
y_pred = rnd_search_cv.best_estimator_.predict(X_train_scaled)
accuracy_score(y_train, y_pred)

0.99965

Ah, this looks good! Let's select this model. Now we can test it on the test set:

In [18]:
y_pred = rnd_search_cv.best_estimator_.predict(X_test_scaled)
accuracy_score(y_test, y_pred)

0.9709