# Support Vector Machines

In this demo, we will use the SVM implementation in sklearn to classify 2D datasets and understand how to tune the hyperparameters of SVMS.
This notebook is adapted from http://nbviewer.jupyter.org/github/jdwittenauer/ipython-notebooks/blob/master/notebooks/ml/ML-Exercise6.ipynb

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
%matplotlib inline

### Simple Linear 2D Dataset
Load the dataset:

In [None]:
raw_data = loadmat('ex6data1.mat')
raw_data

We'll visualize it as a scatter plot where the class label is denoted by a symbol (+ for positive, o for negative).

In [None]:
data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']

positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['X1'], positive['X2'], s=50, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=50, marker='o', label='Negative')
ax.legend()

Notice that there is one outlier positive example that sits apart from the others.  The classes are still linearly separable but it's a very tight fit.  We're going to train a linear support vector machine experimenting with various values of C to learn the class boundary. 

#### Recap: How to Choose C
###### Large C:
* Similar to hard margin SVM - goal is to misclassify few training points
* Often results in small margins
* Very sensitive to outliers
* Risk of overfitting

###### Small C:
* Maximizes margin at cost of misclassifying training data points
* Risk of underfitting

In [None]:
### YOUR CODE HERE - build svm

In [None]:
### YOUR CODE HERE - test svm

It appears that it mis-classified the outlier.  Let's see what happens with a larger value of C.

In [None]:
### YOUR CODE HERE - build svm
### YOUR CODE HERE - test svm

This time we got a perfect classification of the training data, however by increasing the value of C we've created a decision boundary that is no longer a natural fit for the data.  We can visualize this by looking at the confidence level for each class prediction, which is a function of the point's distance from the hyperplane.

In [None]:
data['SVM 1 Confidence'] = svc.decision_function(data[['X1', 'X2']])

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=50, c=data['SVM 1 Confidence'], cmap='seismic')
ax.set_title('SVM (C=1) Decision Confidence')
### YOUR CODE HERE - plotting the decision boundary

In [None]:
data['SVM 2 Confidence'] = svc2.decision_function(data[['X1', 'X2']])

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=50, c=data['SVM 2 Confidence'], cmap='seismic')
ax.set_title('SVM (C=100) Decision Confidence')
### YOUR CODE HERE - plotting the decision boundary

The difference is a bit subtle but when C=1, the decision boundery is a much steeper negative line than when C=1000. In an attempt to classify the outlier, when C=1000, the decision boundary overfits.

### Nonlinear Dataset

In [None]:
raw_data = loadmat('ex6data2.mat')

data = pd.DataFrame(raw_data['X'], columns=['X1', 'X2'])
data['y'] = raw_data['y']

positive = data[data['y'].isin([1])]
negative = data[data['y'].isin([0])]

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['X1'], positive['X2'], s=30, marker='x', label='Positive')
ax.scatter(negative['X1'], negative['X2'], s=30, marker='o', label='Negative')
ax.legend()

For this data set we'll build a support vector machine classifier using the built-in RBF kernel and examine its accuracy on the training data.  To visualize the decision boundary, this time we'll shade the points based on the predicted probability that the instance has a negative class label.  We'll see from the result that it gets most of them right.

In [None]:
### YOUR CODE HERE - build svm

In [None]:
### YOUR CODE HERE - test svm

In [None]:
data['Probability'] = svc.predict_proba(data[['X1', 'X2']])[:,0]

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=30, c=data['Probability'], cmap='Reds')

### Using sklearn Grid_Search to find optimal hyperparameters 

In [None]:
### YOU CODE HERE - do cross val

In [None]:
svc_param_selection(data[['X1', 'X2']], data['y'], 5)

In [None]:
svc = svm.SVC(C=10, gamma=100, probability=True)
svc

In [None]:
svc.fit(data[['X1', 'X2']], data['y'])
svc.score(data[['X1', 'X2']], data['y'])

In [None]:
data['Probability'] = svc.predict_proba(data[['X1', 'X2']])[:,0]

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(data['X1'], data['X2'], s=30, c=data['Probability'], cmap='Reds')