## Introduction

Support Vector Machines (SVMs) are non-probabilistic binary linear classifiers. As they rely on labled data, they belong to the class of supervised learning models. SVMs can be used both for classification as well as regression. This exercise will focus on the classification part.

The documentation for SVMs in scikit-learn can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

## Imports and Helper Functions

In [None]:
# enables inline plotting
%matplotlib inline
#enables inline plotting and interactivity
#%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
# use seaborn plotting defaults
import seaborn as sns; sns.set()
from sklearn.datasets import make_blobs

In [None]:
def plot_svc_decision_function(model, ax=None, plot_support=True):
    """Plot the decision function for a 2D SVC"""
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # create grid to evaluate model
    x = np.linspace(xlim[0], xlim[1], 30)
    y = np.linspace(ylim[0], ylim[1], 30)
    Y, X = np.meshgrid(y, x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    
    # plot decision boundary and margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    
    # plot support vectors
    if plot_support:
        ax.scatter(model.support_vectors_[:, 0],
                   model.support_vectors_[:, 1],
                   s=300, linewidth=1, facecolor='none', edgecolor='red');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)



## Linear Classification
#### Task 1: Constructing linear separators (5 Minutes)
Consider the following simple case of a classification task, in which the two classes of points are well separable.

**Todo:** Modify the coefficients of the linear equations so that three lines separate the data into two classes, creating three different classification models. How did you decide on the values of $m$ and $b$?

In [None]:
# create 50 samples with 2 centers and a standard deviation of 0.7
data, labels = make_blobs(n_samples=50, centers=2, random_state=0, cluster_std=0.70)
# plot the created data, with the lables as the color, the size set to 50 and the color map set to summer
plt.figure(figsize=(16, 9))
plt.scatter(data[:, 0], data[:, 1], c=labels, s=50, cmap='summer');

# create a equidistant spaced array from the min to the max values of the data (x)
xfit = np.linspace(np.min(data[:, 0]), np.max(data[:, 0]))

#TODO: modify the coefficients of the linear equation so that the two classes are being classified correctly
for m, b in [(2, 1), (2, 2), (2, 3)]:
    yfit = m * xfit + b
    plt.plot(xfit, yfit, '-')

# set the axis limits to the min and max values of the data
plt.xlim(np.min(data[:, 0]), np.max(data[:, 0]));
plt.ylim(np.min(data[:, 1]), np.max(data[:, 1]));

#### Task 2: Maximizing Margins (5 Minutes)
While the line in the previous task acted as a perfect classifier for the given data, the choice of $m$ and $b$ seemed arbitrary. Support vector machines offer a way to improve on this. The intuition is this: rather than simply drawing a zero-width line between the classes, we can draw a margin of some width around each line, up to the nearest point (maximum margin classifier).

**TODO:** Maximize the margin of the linear equation so that the two classes are still being classified correctly. What does a larger margin imply? Where are the support vectors?

In [None]:
# create 50 samples with 2 centers and a standard deviation of 0.7
data, labels = make_blobs(n_samples=50, centers=2, random_state=0, cluster_std=0.70)
# plot the created data, with the lables as the color, the size set to 50 and the color map set to summer
plt.figure(figsize=(16, 9))
plt.scatter(data[:, 0], data[:, 1], c=labels, s=50, cmap='summer');

# create a equidistant spaced array from the min to the max values of the data (x)
xfit = np.linspace(np.min(data[:, 0]), np.max(data[:, 0]))

#TODO: maximize the margin of the linear equation so that the two classes are still being classified correctly
for m, b, margin in [(2, 1, 0.1), (2, 2, 0.1), (2, 3, 0.1)]:
    yfit = m * xfit + b
    plt.plot(xfit, yfit, '-')
    # draw a margin around the line with the 
    plt.fill_between(xfit, yfit - margin, yfit + margin, edgecolor='none', color='r', alpha=0.1)

# set the axis limits to the min and max values of the data
plt.xlim(np.min(data[:, 0]), np.max(data[:, 0]));
plt.ylim(np.min(data[:, 1]), np.max(data[:, 1]));

## A first SVM
Let's see the result of an actual support vector fit to this data: we will use Scikit-Learn's support vector classifier to train an SVM model on this data. For the time being, we will use a linear kernel and set the C parameter to a very large number (we'll discuss the meaning of these in more depth momentarily).

**Problem**: Classification (i.e. split the two classes
**Solution**: Hyperplane with maximum margin
**Reasoning**: Maximum Margin generalizes best

In [None]:
# import train_test_split function
from sklearn.model_selection import train_test_split
# import svm model
from sklearn import svm

# create data that is harder to split
#data, labels = make_blobs(n_samples=50, centers=2, random_state=0, cluster_std=0.90)

# split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.33)

# create an svm classifier
model = svm.SVC(kernel='linear', C=10)
#model = svm.SVC(kernel='poly')
#model = svm.SVC(kernel='rbf')

# train the classifier using the training set
model.fit(X_train, y_train)

# predict the response for test dataset
y_pred = model.predict(X_test)

# plot the training data
plt.figure(figsize=(16, 9))
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=50, cmap='summer');
# plot the text data as triangles, marker='^' 
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=50, marker='^', cmap='summer');
# plot the decision boundary
plot_svc_decision_function(model);

In [None]:
# import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

A binary classifier is used to classify images either being cat or non-cat.
It classifies a total of 10 images, including 7 cat images (+) and 3 non-cat images (-).  
Out of the 7 cat images 4 cat images are classified as cats (tp).  
Out of the remaining 3 images 1 image is classified as a cat (fp).  
Out of the 7 cat images 3 cat images are classified as non-cats (fn).  
Out of the remaining 3 images 2 images are classified as non-cat (tn).  

**Accuracy** is the number of correct results for all classes divided by the number of all results.  
$\text{accuracy} = \frac{tp+tn}{tp+tn+fp+fn} =\frac{4+2}{4+2+3+1} = 0.6$

**Precision** is the number of correct results of a single class divided by the number of all returned results of that class.  
$\text{precision} = \frac{\text{tp}}{\text{tp} + \text{fp}} = \frac{\text{4}}{\text{4} + \text{1}} = 0.8$

**Recall** is the number of correct results of a single class divided by the number of results that should have been returned of that class.  
$\text{recall} = \frac{\text{tp}}{\text{tp} + \text{fn}} = \frac{\text{4}}{\text{4} + \text{3}} \approx 0.57$

## Non Linear SVM: Kernel SVM
While linearly serparable data is nice to play with, it often isn't that easy to find in reality. To overcome the problem SVMs can be combined with kernels. 
Where SVM becomes extremely powerful is when it is combined with kernels to perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. To motivate the need for kernels, let's look at some data that is not linearly separable.

In [None]:
from sklearn.datasets import make_circles
data, labels = make_circles(100, factor=.1, noise=.1)

# create an svm classifier
model = svm.SVC(kernel='linear', C=10)
# train the classifier
model.fit(data, labels)

plt.figure(figsize=(16, 9))
plt.scatter(data[:, 0], data[:, 1], c=labels, s=50, cmap='summer')
# plot the decision boundary
plot_svc_decision_function(model, plot_support=False);

y_pred = model.predict(X_test)


#### Task 3: Transforming Data (5 Minutes)
It is clear that no linear discrimination will ever be able to separate this data. But we might be able to transform the data into a higher dimension such that a linear separator would be sufficient.

**TODO:** Choose and apply a kernel to the data so that it becomes linearily separable in a higher dimension. Why did you choose this specific kernel?

In [None]:
#TODO: apply a kernel to the data so that it becomes seperable in a higher dimension.
r = np.sum(1)

from mpl_toolkits import mplot3d

def plot_3D(data, labels, r, elev=10, azim=30):
    plt.figure(figsize=(16, 9))
    ax = plt.subplot(projection='3d')
    ax.scatter3D(data[:, 0], data[:, 1], r, c=labels, s=50, cmap='summer')
    ax.view_init(elev=elev, azim=azim)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('r')
    return ax

def plot_3D_hyperplane(data,r,ax):
    x = np.linspace(np.min(data[:, 0]), np.max(data[:, 0]),2)
    y = np.linspace(np.min(data[:, 1]), np.max(data[:, 1]),2)
    X, Y = np.meshgrid(x, y)
    r = np.ones(X.shape)*r
    ax.plot_surface(X, Y, r, color='r', alpha=0.4);

# plot the data in 3d
ax = plot_3D(data,labels,r)
# draw a possible hyperplane for linear separation
plot_3D_hyperplane(data,np.average(r),ax)

Using a *RBF* kernel we see that the SVM projects the data into a higher dimensional space, so that it becomes linearly seperable again. If we project the resulting hyperplane into 2D it becomes a highly nonlinear function.

In [None]:
# create an svm classifier
model = svm.SVC(kernel='linear')
# train the classifier
model.fit(data, labels)


model = svm.SVC(kernel='rbf', gamma='auto');
model.fit(data, labels);

In [None]:
# plot the data in 2d
plt.figure(figsize=(16, 9))
plt.scatter(data[:, 0], data[:, 1], c=labels, s=50, cmap='summer')
# project the resulting hyperplane to 2D
plot_svc_decision_function(model)

## Tuning the SVM: Softening Margins
Once we have noise in our data, we need to tune our classificator to achieve the best compromise.

$$y_{i}\left(w^{T} x_{i}+b\right) \geq 1-\xi_{i}$$

In [None]:
# create 50 samples with 2 centers and a standard deviation of 1.2
data, labels = make_blobs(n_samples=50, centers=2, random_state=12, cluster_std=2)
# split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.33, random_state=4)
# plot the created data, with the lables as the color, the size set to 50 and the color map set to summer
plt.figure(figsize=(16, 9))

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=50, cmap='summer');
# plot the text data as triangles, marker='^' 
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=50, marker='^', cmap='summer');

To handle these cases, we can soften the margin of the SVM, so that some of the points to are allowed to creep into the margin if that allows a better fit.

The hyperparameter C controls the hardness of the margin, where large values for C push data out of the margin (more correct) and lower values are more benign (maximum margin).

In [None]:
# create a figure 1 by 2 figure as subplots
fig, ax = plt.subplots(1, 2, figsize=(16, 9))

# create a tuples with the axis and the C values (margin values)
for axi, C in zip(ax, [10, 0.3]):
    model = svm.SVC(kernel='linear', C=C)
    model.fit(X_train,y_train)
    # predict the response for test dataset
    y_pred = model.predict(X_test)
    # model Accuracy: how often is the classifier correct?
    print("C = {:.2f}, Accuracy = {:.2f}".format(C,metrics.accuracy_score(y_test, y_pred)))
    axi.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=50, cmap='summer');
    # plot the text data as triangles, marker='^' 
    axi.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=50, marker='^', cmap='summer');
    plot_svc_decision_function(model, axi)
    axi.set_title('C = {:.1f}'.format(C), size=14)

## Example: Face Recognition
As an example of support vector machines in action, let's take a look at the facial recognition problem. We will use the Labeled Faces in the Wild dataset, which consists of several thousand collated photos of various public figures. A fetcher for the dataset is built into Scikit-Learn:

In [None]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)

print('Number of different faces in the dataset: {}.\n'.format(len(faces.target_names)))

print('The following names are present:\n {}\n'.format(faces.target_names))

print('The size of the images is: {}.'.format(faces.images.shape))

In [None]:
# create a figure 3 by 3 figure as subplots
fig, ax = plt.subplots(3, 3, figsize=(8,6))

# plot 9 of the faces to get an idea what we are working with
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])

Each image contains [62×47] or nearly 3,000 pixels. We could proceed by simply using each pixel value as a feature, but often it is more effective to use some sort of preprocessor to extract more meaningful features; here we will use a principal component analysis (see In Depth: Principal Component Analysis) to extract 150 fundamental components to feed into our support vector machine classifier. We can do this most straightforwardly by packaging the preprocessor and the classifier into a single pipeline:


In [None]:
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

# create the principal component analysis
pca = PCA(n_components=150, whiten=True, random_state=42)
# create the classifier
svc = SVC(kernel='rbf', class_weight='balanced')
# create the model combining both
model = make_pipeline(pca, svc)

For the sake of testing our classifier output, we will split the data into a training and testing set:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(faces.data, faces.target, test_size=0.33)

Finally, we can use a grid search cross-validation to explore combinations of parameters. Here we will adjust C (which controls the margin hardness) and gamma (which controls the size of the radial basis function kernel), and determine the best model: [Grid Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search_2)

In [None]:
from sklearn.model_selection import GridSearchCV
# set the grid for the grid search
param_grid = {'svc__C': [1, 5, 10, 50],
              'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}

# create an exhaustive search over specified parameter values for a given estimator
grid = GridSearchCV(model, param_grid)

# time the search for the best hyperparameters
%time grid.fit(X_train, y_train);
print(grid.best_params_)



In [None]:
# create the model from with the best hyperparameters
model = grid.best_estimator_

# predict the response for test dataset
yfit = model.predict(X_test)

In [None]:
# plot the result for the test data
fig, ax = plt.subplots(4, 6, figsize=(16,9))
for i, axi in enumerate(ax.flat):
    axi.imshow(X_test[i].reshape(62, 47), cmap='bone')
    axi.set(xticks=[], yticks=[])
    axi.set_ylabel(faces.target_names[yfit[i]].split()[-1], color='black' if yfit[i] == y_test[i] else 'red')
fig.suptitle('Predicted Names; Incorrect Labels in Red', size=16);

In [None]:
# display some metrics
from sklearn.metrics import classification_report
print(classification_report(y_test, yfit, target_names=faces.target_names))

In [None]:
from sklearn.metrics import confusion_matrix
# create a confusion matrix to analyze the SVMs abilities
plt.figure(figsize=(16, 9))
mat = confusion_matrix(y_test, yfit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, 
            xticklabels=faces.target_names,
            yticklabels=faces.target_names, annot_kws={"size": 14})
plt.xlabel('true label', fontsize=14)
plt.ylabel('predicted label', fontsize=14);