# Support Vector Machine #

SVMs are a class of supervised machine learning algorithms. SVMs can be seen as similar to linear regression in that it is separating two classes, however, instead of measuring the separator against all data points, it will only use some of them. There's also a support vector clustering algorithm for unsupervised learning.

The SVM algorithm identifies a hyperplane (line if just in 2D space) separating the classes that maximizes the distance between the hyperplane and the nearest points. This space between the hyperplane and nearest points is referred to as the *margin* and the nearest points are referred to as the *supports*. 

Given data points $x = (x_1, \dotsc, x_n)$ and labels $y = (y_1, \dotsc, y_n)$

* Hyperplane: $w^Tx - b = 0$
* Distance to supports: $d_i = \frac{w^T x_i + b}{||w||}$
* Linear classifier:
    * $\hat{y} = 1$ if $w^Tx + b \geq 0$
    * $\hat{y} = 0$ if $w^Tx + b < 0$

The algorithm may have a hard or soft margin, depending on whether the hyperplane is required to separate the data without misclassifications or allows misclassification, but with a penalty (usually hinge loss).

##### Kernels ##### 

SVMs can be used to classify non-linear data using the kernel trick to map inputs into higher-dimensional space where the data may be separable.

Popular kernels include:

* linear
* polynomial 
* gaussian RBF
* sigmoid

In [23]:
# import necessary packages

from sklearn import datasets, model_selection, pipeline, metrics
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

### Import Data ###

Import the built-in breast cancer data set and split into training and testing set.

In [85]:
data = datasets.load_breast_cancer()

X = data['data']
Y = data['target']

X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X, Y, train_size=0.7)


### Create Linear Model ###

Create an SVM classification model with a linear kernal. Note that SVM works best with standardized mean and variance. I have integrated this data processing into the model creation using scikit-learn's Pipeline function.

In [24]:
model = pipeline.make_pipeline(StandardScaler(), SVC(kernel='linear'))
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

### Analyze Linear Model ###

The SVM model does a good job of classifying the malignant and benign cases, as evidenced by a high accuracy on the testing set and high precision and recall. In particular, the latters two measures show the model is correctly labeling positive cases as positive without much erronous positive labeling.

##### For comparison:##### 
This data set was also demonstrated when implementing logistic regression with the following metrics:

* Accuracy: 0.68
* Precision: 0.64
* Recall: 1
* F1-score: 0.78

Therefore, it can be seen that the SVM does a better job of classifying this data set than logistic regression.

In [28]:
train_acc = model.score(X_train, Y_train)
print('Training set accuracy:', train_acc)
test_acc = model.score(X_test, Y_test)
print('Testing set accuracy:', test_acc)
print()

prec = metrics.precision_score(Y_test, Y_pred)
print('Precision:', prec)
recall = metrics.recall_score(Y_test, Y_pred)
print('Recall:', recall)
f1 = metrics.f1_score(Y_test, Y_pred)
print('F1 score:', f1)
print()
auc = metrics.roc_auc_score(Y_test, Y_pred)
print('AUC:', auc)

Training set accuracy: 0.992462311557789
Testing set accuracy: 0.9532163742690059

Precision: 0.9433962264150944
Recall: 0.9803921568627451
F1 score: 0.9615384615384616

AUC: 0.9467178175618074


## Non-Linearly Separable Data Example ##

As mentioned above, SVMs can be used on data that is not linearly separable by using the kernel trick. In this example, I'll compare outcomes under two possible kernels: polynomial and RBF.

### Load Data Set ###

Import the built-in Iris data set and divide into training and testing sets.

In [86]:
data = datasets.load_iris()
X = data['data']
Y = data['target']

X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X, Y, train_size=0.7)

### Create Model: Polynomial Kernel ###

Create SVM model using a polynomial kernel. The polynomial function requires the degree. In this case, selecting the degree is like selecting a hyperparameter value. Therefore, I will create a subset of the training data to be a validation set for testing the degree that should be used.

Polynomial kernel:
<center>$K(x, y) = (x^T y + c)^d$</center>

Where $K$ is the kernel transforming data points $(x, y)$ with degree $d$.

In [87]:
# make validation set
X_train,X_val,Y_train,Y_val = model_selection.train_test_split(X_train, Y_train, train_size=0.8)

# degrees to test
degrees = [2, 3, 4]

for d in degrees:
    model = pipeline.make_pipeline(StandardScaler(), SVC(kernel='poly', degree=d))
    model.fit(X_train, Y_train)
    # test accuracy on the hold out validation set
    acc = model.score(X_val, Y_val)
    print('Degree:', d, 'Accuracy:', acc)

Degree: 2 Accuracy: 0.8571428571428571
Degree: 3 Accuracy: 0.8571428571428571
Degree: 4 Accuracy: 0.7619047619047619


Note that the randomness of the training and validation set groups can play a role here. Therefore, the above code block was run several times which resulted in three cases:

* all degrees had same accuracy
* degree 2 and 3 had the same accuracy that was better than degree 3
* degree 3 had the best accuracy

Based on these results, I will select to run with degree 3.

In [51]:
model = pipeline.make_pipeline(StandardScaler(), SVC(kernel='poly', degree=3))
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

### Analyze Model: Polynomial Kernel ###

Analyzing the results using the accuracy and confusion matrix.

In [54]:
acc = model.score(X_test, Y_test)
print('Accuracy:', acc)
print()

conf_matrix = metrics.confusion_matrix(Y_test, Y_pred)
print('Confusion matrix:')
print(conf_matrix)


Accuracy: 0.8666666666666667

Confusion matrix:
[[16  0  0]
 [ 0 14  0]
 [ 0  6  9]]


The results show a decent accuracy. 

However, looking at the confusion matrix, it seems the model performs well with the first class but less well for the second and third classes. In particular, misclassifying some of the third class as members of the second. 

*Note:* in this data set, the first class is linearly separable from the other two, but the second and third are not. Therefore these issues make sense.

| | predict 0 | predict 1 | predict 2|
|-------| ------- | ------- | -------|
| actual 0 | 16 | 0 | 0 |
| actual 1 | 0 | 14 | 0 |
| actual 2 | 0 | 6 | 9 |

Caluating precision, recall, and F1 score as:

* Precision: $\frac{TP}{TP + FP}$
* Recall: $\frac{TP}{TP + FN}$
* F1 score: $2 \times \frac{precision \times recall}{precision + recall}$

| Class | Precision | Recall | F1 score|
| ------- | ------- | -------- | ------- |
| 0 | 1.0 | 1.0 | 1.0 |
| 1 | 0.7 | 1.0 | 0.82 |
| 2 | 1.0 | 0.6 | 0.75 |

These results show that the precision with labeling class 1 comes at the expense of missing some positive cases. In particular, as noted below it's an issue of distinguishing class 1 from class 2.

When compared to the Naive Bayes model, I find that the SVM class performs the same for class 0, but worse for classes 1 and 2.

### Create Model: RBF Kernel ###

Create an SVM model with the RBF Kernel, defined as:

<center>$K(x, x') = exp(-\frac{||x-x'||^2}{2\sigma^2})$</center>

where $||x-x'||^2$ is the squared Euclidean distnace between two data points.

In [89]:
# no longer need a separate training and validation data set
import numpy as np

np.concatenate((X_train, X_val))
np.concatenate((Y_train, Y_val))

# construct model
model = pipeline.make_pipeline(StandardScaler(), SVC(kernel='rbf'))
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

### Analyze Output: RBF Kernel ###

Calculate the accuracy of the model and print the confusion matrix.

In [90]:
acc = model.score(X_test, Y_test)
print('Accuracy:', acc)
print()

conf_matrix = metrics.confusion_matrix(Y_test, Y_pred)
print(conf_matrix)

Accuracy: 1.0

[[15  0  0]
 [ 0 18  0]
 [ 0  0 12]]


Accuracy on this model is perfect, therefore the confusion matrix also shows perfect precision, recall, and f1 score.

Therefore, in this instance, the RBF kernel trick is better than the polynomial at making the data linearly separable and in turn, possible to perform classification.

### Create Linear Model ###

For reference, run a linear kernel SVM on this data set.

In [91]:
# construct model using the same data set as in the RBF example.
model = pipeline.make_pipeline(StandardScaler(), SVC(kernel='linear'))
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

acc = model.score(X_test, Y_test)
print('Accuracy:', acc)
print()
conf_matrix = metrics.confusion_matrix(Y_test, Y_pred)
print(conf_matrix)

Accuracy: 0.9777777777777777

[[15  0  0]
 [ 0 18  0]
 [ 0  1 11]]


Confusion matrix:

| | predict 0 | predict 1 | predict 2|
| ------- | ------- | ------- | -------- |
| actual 0 | 15 | 0 | 0 |
| actual 1 | 0 | 18 | 0 |
| actual 2 | 0 | 1 | 11 |

Recall as mentioned, class 0 can be linearly separated from the other 2 classes. We see that in the results here. In fact, we also see a simple linear model still distinguishes between class 1 and 2 better than the polynomial. However, the RBF function does the best still.