# Support vector machines

Support vector machines (SVMs) are a particularly powerful and flexible class of supervised algorithms for both classification and regression. In this section, we will develop the intuition behind support vector machines and their use in classification problems.
Find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each other.


In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults
import seaborn as sns; sns.set()

In [None]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
                  random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');

A liner discriminative classifier would attempt to draw a straingt line separating the two sets of data. However, there are many viable solutions to this problem.

TASK: manually find 3 lines that seperate the points. For each line you draw, to which class will the red cross belong?

In [None]:
xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plt.plot([0.6], [2.1], 'x', color='red', markeredgewidth=2, markersize=10)

plt.plot(xfit, 1 * xfit + 0.65, '-k')

plt.xlim(-1, 3.5);

TASK: call the SVC to fit the above dataset.

In [None]:
from sklearn.svm import SVC # "Support vector classifier"
model = ...
model.fit(X, y)

In [None]:
def plot_svc_decision_function(model, ax=None, plot_support=True):
    """Plot the decision function for a 2D SVC"""
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # create grid to evaluate model
    x = np.linspace(xlim[0], xlim[1], 30)
    y = np.linspace(ylim[0], ylim[1], 30)
    Y, X = np.meshgrid(y, x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    
    # plot decision boundary and margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    
    # plot support vectors
    if plot_support:
        ax.scatter(model.support_vectors_[:, 0],
                   model.support_vectors_[:, 1],
                   s=300, linewidth=1, facecolors='none');
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model);

In [None]:
model.support_vectors_

How to handle unbalanced classes? Below you see an example of an unbalanced problem. There are 1000 blue samples and just 100 red samples. The SVM is very sensitive to this. Here we will explore the SVMs behaviour in the unbalanced case.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

# we create clusters with 1000 and 100 points
rng = np.random.RandomState(0)
n_samples_1 = 1000
n_samples_2 = 100
X = np.r_[1.5 * rng.randn(n_samples_1, 2),
          0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y = [0] * (n_samples_1) + [1] * (n_samples_2)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors='k')
plt.legend()

In [None]:
# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)

We see that the fitted line goes right through the red cluster, which leads a very poor classification performance. 

TASK: print the f1 score the precision score and the recall score for the unbalanced classifier. 

In [None]:
from sklearn.metrics import f1_score, recall_score, precision_score
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors='k')
plot_svc_decision_function(clf);

y_pred = clf.predict(X)
f1_u = ...
rec_u = ...
prec_u = ...

print('F1: {}, Recall: {}, Precision:{}'.format(f1_u, rec_u, prec_u))

We use a technique called class balaning which gives more weight to underepresented classes. 

TASK: class weights are passed by a dictionary where each class is weighted. Select a weight that you think is suitable for the underrepresented class. 

TASK: print the f1 score the precision score and the recall score for the balanced classifier. What differences do you detect in terms of precision and recall?

In [None]:
# fit the model and get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={0: 1.0, 1: ..})
wclf.fit(X, y)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors='k')
plot_svc_decision_function(wclf);

y_wpred = wclf.predict(X)
f1_b = ...
rec_b = ...
prec_b = ...

print('F1: {}, Recall: {}, Precision:{}'.format(f1_b, rec_b, prec_b))