# Support Vector Machines = SVMs

We find hyperplanes that separate clusters of data points or samples that belong to different classes in feature space.

The optimal hyperplane is the one that maximizes the distance to both clusters; the data points closest to that plate are the support vectors.

The method can be extended to non-linear cases (beyond hyperplanes) by adding additional dimensions to the data; for example: two concentric rings of samples can be represented in 3D with adding Z values in a parabolloid, so that one class has lower Z values than the other. We basically have mapping functions that take the features of a sample and compute a new value in a new dimension; in that higher dimensional space, it is possible to separate linearly the clusters and the project back to the lower dimensional space the separation element (which is not linear anymore).

However, computing higher dimensional spaces is expensive if we have large datasets; therefore, the preferred method is the **kernel trick**.

Landmarks are placed in the dataset and kernel functions applied in those landmarks; the typical kernel function is the radial basis function = the Gaussian function N(mu,s2). The landmarks are strategically chosen, e.g., in places where class blobs appear inside a larger class cluster. The variance s2 of the kernel regulates the wideness of the Gaussian, which acts as a local classifier between the interior blob and the bigger container cluster. This way, we do not add an additional dimension, but work with kernels set on landmarks. In addition, we can solve complex concave cases with kernels. Typical kernels:
- Gaussian or radial basis function (RBF): N(mu,s2)
- Sigmoid and hyperbolic tangent (this one is directional, not symmetric)
- Polynomial kernels

The Gaussian or RBF kernel is by default active in Sckit-Learn when using SVM.

Grid search is often necessary to select the optimum parameters for SVMs, mainly C and gamma; see below.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [11]:
# We load the breast cancer dataset
from sklearn.datasets import load_breast_cancer

In [12]:
# We convert it to a dictionary
cancer = load_breast_cancer()

In [13]:
# We extract the keys of the dictionary to see what's inside
cancer.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

In [10]:
# Description of the dataset
print(cancer['DESCR'])

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        

In [14]:
# Data frame itself
df_feat = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])

In [19]:
df_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
mean radius                569 non-null float64
mean texture               569 non-null float64
mean perimeter             569 non-null float64
mean area                  569 non-null float64
mean smoothness            569 non-null float64
mean compactness           569 non-null float64
mean concavity             569 non-null float64
mean concave points        569 non-null float64
mean symmetry              569 non-null float64
mean fractal dimension     569 non-null float64
radius error               569 non-null float64
texture error              569 non-null float64
perimeter error            569 non-null float64
area error                 569 non-null float64
smoothness error           569 non-null float64
compactness error          569 non-null float64
concavity error            569 non-null float64
concave points error       569 non-null float64
symmetry error             569 

In [20]:
df_feat.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [21]:
df_feat.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [22]:
# We have many variables/features
# Unless we are experts in the field, it is difficult to interpret them
# Therefore, we can plot something, but we'll focus on how to use SVM

In [23]:
from sklearn.model_selection import train_test_split

In [None]:
X = df_feat
y = cancer['target']
#y = pd.DataFrame(cancer['target'],columns=['malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [37]:
# Load Support Vector Classification module SVC
from sklearn.svm import SVC

In [46]:
# Instantiate SVC
# IMPORTANT NOTE: there are many parameters that need to be tuned, without them, it fails most of the times
# We leave them for now to see how it fails and then we perform a Grid Search for tuning params
# The most important params are:
# - C: it controls the cost of missclassification; if small -> less penalization, higher bias and lower variance
# - kernel: here we choose if we'd like to use the Kernel trick or not
#     rbf = Gaussian Radio Basis Function, default, usually the best to use
#     linear = no kernel trick, just linear - this is worse most of the times
#     there are some other possibilities, look in the docs!
# - gamma: free param of the RBF kernel; if small -> Gaussian for large variance: small bias, large variance 
model = SVC()

In [32]:
# Train/Fit
model.fit(X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [41]:
# Predict/Infer
predictions = model.predict(X_test)

In [42]:
# Evaluate
from sklearn.metrics import classification_report, confusion_matrix

In [43]:
# We see how the classifier fails:
# non-valid values replaced by 0, leading to a precision of 0.0 for class 0
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        63
          1       0.63      1.00      0.77       108

avg / total       0.40      0.63      0.49       171



In [44]:
print(confusion_matrix(y_test,predictions))

[[  0  63]
 [  0 108]]


In [45]:
# We need to perfom a Grid Search to get the params for SVM
from sklearn.grid_search import GridSearchCV

In [47]:
# We set up a dictionary with the params to be tested
# These params are usually C & gamma; we pass testing values that vary in orders of magnitude
param_grid = {'C':[0.1,1,10,100,1000], 'gamma':[1,0.1,0.01,0.001,0.0001]}

In [48]:
# We create our grid search object with:
# - the classifier; we can pass several types
# - the param_grid we created to our specific type of classifier
# - verbose, to see what's happening, because ot takes long if many params to be tested
grid = GridSearchCV(SVC(),param_grid,verbose=3)

In [50]:
# We fit our data set to the grid so that the best param set is found
# All combinations in the set are tested, the best chosen
# We can later extract the best params, estimator (classifier), and score
grid.fit(X_train,y_train)

Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV] C=0.1, gamma=1 ..................................................
[CV] ......................... C=0.1, gamma=1, score=0.624060 -   0.0s
[CV] C=0.1, gamma=1 ..................................................
[CV] ......................... C=0.1, gamma=1, score=0.624060 -   0.0s
[CV] C=0.1, gamma=1 ..................................................
[CV] ......................... C=0.1, gamma=1, score=0.628788 -   0.0s
[CV] C=0.1, gamma=0.1 ................................................
[CV] ....................... C=0.1, gamma=0.1, score=0.624060 -   0.0s
[CV] C=0.1, gamma=0.1 ................................................
[CV] ....................... C=0.1, gamma=0.1, score=0.624060 -   0.0s
[CV] C=0.1, gamma=0.1 ................................................
[CV] ....................... C=0.1, gamma=0.1, score=0.628788 -   0.0s
[CV] C=0.1, gamma=0.01 ...............................................
[CV] ...........

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV] .................... C=0.1, gamma=0.0001, score=0.902256 -   0.0s
[CV] C=0.1, gamma=0.0001 .............................................
[CV] .................... C=0.1, gamma=0.0001, score=0.909774 -   0.0s
[CV] C=0.1, gamma=0.0001 .............................................
[CV] .................... C=0.1, gamma=0.0001, score=0.901515 -   0.0s
[CV] C=1, gamma=1 ....................................................
[CV] ........................... C=1, gamma=1, score=0.624060 -   0.0s
[CV] C=1, gamma=1 ....................................................
[CV] ........................... C=1, gamma=1, score=0.624060 -   0.0s
[CV] C=1, gamma=1 ....................................................
[CV] ........................... C=1, gamma=1, score=0.628788 -   0.0s
[CV] C=1, gamma=0.1 ..................................................
[CV] ......................... C=1, gamma=0.1, score=0.624060 -   0.0s
[CV] C=1, gamma=0.1 ..................................................
[CV] .

[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    0.8s finished


GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [51]:
# We can see the best params
grid.best_params_

{'C': 10, 'gamma': 0.0001}

In [55]:
# We can see the best estimator/classifiee (SVC)
grid.best_estimator_

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [56]:
# We can see the score of the best param-estimator (best score)
grid.best_score_

0.9447236180904522

In [57]:
# We can directly use grid to predict
grid_redictions = grid.predict(X_test)

In [60]:
# The results are much better now, with the optimal params
print(classification_report(y_test,grid_redictions))
print('\n')
print(confusion_matrix(y_test,grid_redictions))

             precision    recall  f1-score   support

          0       0.97      0.94      0.95        63
          1       0.96      0.98      0.97       108

avg / total       0.96      0.96      0.96       171



[[ 59   4]
 [  2 106]]
