# Exercise 1

What is the fundamental idea behind Support Vector Machines?

Fit the widest margin, either between classes (classification) or within the data (regression).

# Exercise 2

What is a support vector?

A support vector determines a hyperplane. The support vector is orthogonal to the hyperplance and points to the upperside of the plane. A hyperplane is an n-1 dimensional subspace in an n-dimensional space. It cuts the space in two halves.

In an SVM, a support vector is any data point within the margin around the seperating hyperplance between classes.

# Exercise 3

Why is it important to scale the inputs when using SVMs?

If features are not scaled, larger feature will become more important to the determination of support vectors as their deveations are larger while smaller features will be neglected.

For the implementation in Scikit-learn, features need to be centered as well.

# Exercise 4

Can an SVM classifier output a confidence score when it classifies an instance? What about a probability?

SVM is a discriminative method, and does not calculate probabilities or confidence scores. The distance to the hyperplane can be reported, which acts like a confidence score.

Scikit-learn fits a logistic-regression on the predicted classes of the SVM to determine class probabilities.

# Exercise 5

Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?

For linear kernel, if the number of features is less than the number of instances, the primal problem should be solved. If the number of instances is larger than the number of features, solving the dual problem is more efficient.

In this case, the primal problem should be solved.

# Exercise 6

Say you’ve trained an SVM classifier with an RBF kernel, but it seems to underfit the training set. Should you increase or decrease γ ( gamma )? What about C ?

Underfitting suggests that the model requires more degrees of freedom to capture the problem. Gamma should be increased such that the decision boundary becomes more "wiggly". C should be increased as well, to place more weight on the slack variables.

# Exercise 7

How should you set the QP parameters (H, f, A, and b) to solve the soft margin linear SVM classifier problem using an off-the-shelf QP solver?

# Exercise 8

Train a LinearSVC on a linearly separable dataset. Then train an SVC and a SGDClassifier on the same dataset. See if you can get them to produce roughly the same model.

# Exercise 9

Train an SVM classifier on the MNIST dataset. Since SVM classifiers are binary classifiers, you will need to use one-versus-the-rest to classify all 10 digits. You may want to tune the hyperparameters using small validation sets to speed up the process. What accuracy can you reach?

In [39]:
from multiprocessing import cpu_count
n_jobs = cpu_count() - 1

In [40]:
import joblib

import numpy as np

from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

In [41]:
X,y = fetch_openml('mnist_784', version=1, return_X_y=True)
X = (X/256.).astype(np.float32)
y = y.astype(np.uint8)

X_train = X[:60000]
y_train = y[:60000]
X_test = X[60000:]
y_test = y[60000:]

Try first a linear SVC kernel, to establish **baseline performance**. Solve dual problem if n_samples <= n_features, otherwise solve primal problem.

In [42]:
n_samples, n_features = X_train.shape

In [43]:
clf = make_pipeline(StandardScaler(), LinearSVC(dual = n_samples <= n_features))

In [44]:
clf.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('linearsvc',
                 LinearSVC(C=1.0, class_weight=None, dual=False,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)

In [45]:
y_pred = clf.predict(X_train)

In [46]:
accuracy_score(y_train, y_pred)

0.9274666666666667

In [47]:
y_pred = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
print('Train Accuracy: {:6.2f}'.format(accuracy_score(y_train, y_pred)))
print('Test Accuracy: {:6.2f}'.format(accuracy_score(y_test, y_pred_test)))

Train Accuracy:   0.93
Test Accuracy:   0.92


Build pipeline over multiple approaches to find best approach

In [48]:
pipeline_model = Pipeline([('scaler',StandardScaler()), ('clf',KNeighborsClassifier())])

In [49]:
param_grid = [
    {'clf':[KNeighborsClassifier()], 'clf__weights': ["uniform", "distance"], 'clf__n_neighbors': [3, 4, 5]},
    {'clf':[SVC()], 'clf__kernel':['linear', 'rbf'], 'clf__C': [1., 10., 100., 1000.], 'clf__gamma':[0.01, 0.1, 1.]},
    {'clf':[RandomForestClassifier()], 'clf__n_estimators': [10, 100, 1000], 'clf__max_depth':[None, 5, 10]},
    {'clf':[GradientBoostingClassifier()], 'clf__n_estimators': [10, 100, 1000], 'clf__learning_rate':[0.01, 0.1, 0.25]},
    {'clf':[MLPClassifier(early_stopping=True)], 'clf__hidden_layer_sizes':[(16,), (16,16,), (32,16,),(32,32,)], 'clf__alpha':[0.000001, 0.00001, 0.0001], 'clf__learning_rate':['invscaling', 'adaptive']}
]

In [50]:
grid = GridSearchCV(pipeline_model, param_grid, scoring="accuracy", cv=3, n_jobs=n_jobs, verbose=2)

In [51]:
grid.fit(X_train, y_train)

Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=11)]: Using backend LokyBackend with 11 concurrent workers.
[Parallel(n_jobs=11)]: Done  19 tasks      | elapsed: 59.4min
[Parallel(n_jobs=11)]: Done 140 tasks      | elapsed: 1521.4min
[Parallel(n_jobs=11)]: Done 216 out of 216 | elapsed: 1689.7min finished


GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('clf',
                                        KNeighborsClassifier(algorithm='auto',
                                                             leaf_size=30,
                                                             metric='minkowski',
                                                             metric_params=None,
                                                             n_jobs=None,
                                                             n_neighbors=5, p=2,
                                                             weights='uniform'))],
                                verbose=False),
           

In [52]:
joblib.dump(grid, 'grid.pkl', compress=1)

['grid.pkl']

In [53]:
with open('grid.pkl', 'rb') as f:
    grid = joblib.load(f)

In [54]:
grid.best_estimator_

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('clf',
                 GradientBoostingClassifier(ccp_alpha=0.0,
                                            criterion='friedman_mse', init=None,
                                            learning_rate=0.1, loss='deviance',
                                            max_depth=3, max_features=None,
                                            max_leaf_nodes=None,
                                            min_impurity_decrease=0.0,
                                            min_impurity_split=None,
                                            min_samples_leaf=1,
                                            min_samples_split=2,
                                            min_weight_fraction_leaf=0.0,
                                            n_estimators=1000,
                                            n_iter_no_change=None,
                 

In [56]:
y_pred = grid.predict(X_train)
y_pred_test = grid.predict(X_test)
print('Train Accuracy: {:6.2f}'.format(accuracy_score(y_train, y_pred)))
print('Test Accuracy: {:6.24f}'.format(accuracy_score(y_test, y_pred_test)))

Train Accuracy:   1.00
Test Accuracy: 0.977299999999999946531659
