# Support Vector Machines

In [1]:
import random
import numpy as np
import pandas as pd

In the following sections there is a code used commonly in the previous exercises to generate the dataset and to define explanatory and target variables.

In [2]:
n = {'A': 30, 'B': 20}
cov = np.array([[4, 2], [2, 4]])
means = {'A': np.array([-1, -1]), 'B': np.array([2, 2])}

In [3]:
data = pd.DataFrame(index=range(sum(n.values())), columns=['x', 'y', 'label'])
data.loc[:n['A']-1, ['x', 'y']] = np.random.multivariate_normal(means['A'], cov, n['A'])
data.loc[:n['A']-1, ['label']] = 'A'
data.loc[n['A']:, ['x', 'y']] = np.random.multivariate_normal(means['B'], cov, n['B'])
data.loc[n['A']:, ['label']] = 'B'

In [4]:
X = data[['x', 'y']]
y = data['label']

We will check a performance of the Support Vector Machines classifier with linear kernel and built in L2 regularization. Regularization is a technique used to avoid an overfitting by adding some kind of an information. The most frequent methods are L1 and L2 regularization. What makes a difference between them is the penalty term. For SVM classifier we can define the regularization parameter $C$ which serves as a degree of importance that is given to miss-classifications. Intuitively when $C$ grows larger the less wrongly classified examples are allowed. When $C$ tends to 0 more of the miss-classifications are allowed. We will apply some values of the regularization parameter into the model and check when the classifier has the best accuracy.

In [5]:
from sklearn import svm
from sklearn.model_selection import cross_val_score

Cs = [0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10]

for C in Cs:
    clf = svm.SVC(kernel='linear', C=C)
    clf.fit(X, y)
    accuracy = cross_val_score(clf, X, y, cv=10).mean()
    print('C: {} --> Accuracy: {}'.format(C, accuracy))

C: 0.01 --> Accuracy: 0.6799999999999999
C: 0.05 --> Accuracy: 0.76
C: 0.1 --> Accuracy: 0.74
C: 0.2 --> Accuracy: 0.68
C: 0.5 --> Accuracy: 0.7
C: 1 --> Accuracy: 0.72
C: 2 --> Accuracy: 0.72
C: 5 --> Accuracy: 0.7
C: 10 --> Accuracy: 0.7


From the results we can conclude that regularization parameter of $C = 0.05$ makes the classifier to have the superior accuracy over the other parameters. Now we will check a performance of the LDA classifier and compare it to the SVM method.

In [6]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

clf = LinearDiscriminantAnalysis()
clf.fit(X, y)
accuracy = cross_val_score(clf, X, y, cv=10).mean()
print('Accuracy: {}'.format(accuracy))

Accuracy: 0.74


Although the accuracy of LDA method does not differ significantly from the accuracy obtained by SVM classifier, there are a lot of differences between those methods. First of all SVM classification is an optimization problem while LDA has an analytical solution. LDA makes a use of the entire dataset to estimate the covariance matrices therefore it is sensitive to outliers. On the other hand SVM is optimized over a subset of the data only. Also SVM makes no assumptions about the data (in LDA data needs to be normally distributed) meaning it's more flexible method but it comes at a price of an interpretability. LDA is a linear classifier while SVM can make a use of kernel to change the classifier from linear to non-linear.