# SVM Demonstration

In this tutorial we will demonstrate how to use the `SVM` class in `scikit-learn` to perform logistic regression on a dataset. 

The synthetic dataset we will use is the cancer dataset that is produced by the data_gen notebook. 

This is a simple dataset that predicts if someone has cancer based on the number of kilograms of tobacco they have smoked in total.

This dataset, therefore, has only one feature and a binary target variable (1 is they have cancer, 0 if they don't).

We will use the `SVM` class to fit a model to the data and then plot the decision boundary.

We will also use the `SVM` class to predict the probability of a person having cancer based on the number of kilograms of tobacco they have smoked.

We will use GridSearchCV to find the best hyper-parameters for the model - and we will test rbf, linear and polynomial kernels. The scoring metric we will use is custom beta score, with a beta of 2 (which means we are more interested in recall than precision).

The reason for this metric is that their is a difference between the cost of a false positive and a false negative in this case. A false negative is much more costly than a false positive, as it means someone with cancer is not being treated. But, we cannot fully ignore precision, as we don't want to be treating people who don't have cancer.

## 1. Setup

Import modules

In [1]:
import pandas as pd
from sklearn.svm import SVC
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

np.random.seed(1)

## 2. Load data

Load data (it's already cleaned and preprocessed)

In [2]:
# Uncomment the following snippet of code to debug problems with finding the .csv file path
# This snippet of code will exit the program and print the current working directory.
#import os
#print(os.getcwd())

In [3]:
df = pd.read_csv('./data/cancer.csv') # let's use the same data as we did in the logistic regression example
df.head(3)

Unnamed: 0,kgs_smoked,cancer
0,-0.65956,0
1,5.78149,0
2,-8.247713,0


In [4]:
# Use sklearn to split df into a training set and a test set

X = df[['kgs_smoked']]
y = df['cancer']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


## 3. Model the data

First, let's create a dataframe to load the model performance metrics into.

In [5]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": [], "F2": [], "Parameters": []})

In [6]:
# create a fbeta 2 scorer
from sklearn.metrics import make_scorer
from sklearn.metrics import fbeta_score
f2_scorer = make_scorer(fbeta_score, beta=2)


### 3.1 Fit a SVM classification model using linear kernal

In [7]:
# defining parameter range 
param_grid = {'C': [0.01, 0.1, 0.5, 1, 5, 10, 50, 100],  
              'kernel': ['linear']}
  
#grid = GridSearchCV(SVC(), param_grid, scoring='f1', refit = True, verbose = 3, n_jobs=-1) 

grid = GridSearchCV(SVC(), param_grid, scoring=f2_scorer, refit = True, verbose = 3, n_jobs=-1) 
  
# fitting the model for grid search 
_ = grid.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[CV 3/5] END ..............C=0.1, kernel=linear;, score=1.000 total time=   0.0s
[CV 2/5] END ..............C=0.1, kernel=linear;, score=0.556 total time=   0.0s
[CV 2/5] END ..............C=0.5, kernel=linear;, score=0.556 total time=   0.0s
[CV 5/5] END ..............C=0.1, kernel=linear;, score=0.000 total time=   0.0s
[CV 1/5] END ..............C=0.1, kernel=linear;, score=0.500 total time=   0.0s
[CV 5/5] END ..............C=0.5, kernel=linear;, score=0.000 total time=   0.0s
[CV 5/5] END .............C=0.01, kernel=linear;, score=0.000 total time=   0.0s
[CV 1/5] END ..............C=0.5, kernel=linear;, score=0.500 total time=   0.0s
[CV 4/5] END ..............C=0.5, kernel=linear;, score=1.000 total time=   0.0s
[CV 4/5] END ..............C=0.1, kernel=linear;, score=1.000 total time=   0.0s
[CV 2/5] END .............C=0.01, kernel=linear;, score=0.556 total time=   0.0s
[CV 3/5] END ...............C=50, kernel=linear;, score=1.000 total time=   0.0s
[CV 4/5] END ...............

In [8]:
# print best parameter after tuning 
print(grid.best_params_) 
  
# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_)

y_pred = grid.predict(X_test) 

recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
f2 = fbeta_score(y_test, y_pred, beta=2)

performance = pd.concat([performance, pd.DataFrame({"model": ["SVM Linear"], "Accuracy": [accuracy], "Precision": [precision], "Recall": [recall], "F1": [f1], "F2": [f2], "Parameters": [grid.best_params_]})])


{'C': 0.01, 'kernel': 'linear'}
SVC(C=0.01, kernel='linear')


In [9]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1,F2,Parameters
0,SVM Linear,0.625,1.0,0.5,0.666667,0.555556,"{'C': 0.01, 'kernel': 'linear'}"


### 3.2 Fit a SVM classification model using rbf kernal

In [10]:
# defining parameter range 
param_grid = {'C': [0.1, 1, 10, 100],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['rbf']}
  
grid = GridSearchCV(SVC(), param_grid, scoring=f2_scorer, refit = True, verbose = 3, n_jobs=-1) 
  
# fitting the model for grid search 
_ = grid.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.000 total time=   0.0s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.000 total time=   0.0s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.000 total time=   0.0s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.000 total time=   0.0s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.000 total time=   0.0s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.000 total time=   0.0s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.000 total time=   0.0s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.000 total time=   0.0s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.000 total time=   0.0s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.000 total time=   0.0s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=1.000 total time=   0.0s
[CV 3/5] END .....C=0.1, gamma=0.01, kernel=rbf

In [11]:
# print best parameter after tuning 
print(grid.best_params_) 
  
# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_)

y_pred = grid.predict(X_test) 

recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
f2 = fbeta_score(y_test, y_pred, beta=2)

performance = pd.concat([performance, pd.DataFrame({"model": ["SVM rbf"], "Accuracy": [accuracy], "Precision": [precision], "Recall": [recall], "F1": [f1], "F2": [f2],"Parameters": [grid.best_params_]})])

{'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
SVC(C=1, gamma=0.01)


### 3.3 Fit a SVM classification model using polynomial kernal

In [12]:
# defining parameter range 
param_grid = {'C': [0.01, 0.1, 0.5, 1, 5, 10, 50, 100],  
              'coef0': [0.01, 0.1, 0.5, 1, 5, 10, 50, 100],
              'kernel': ['poly']}
  
grid = GridSearchCV(SVC(), param_grid, scoring=f2_scorer, refit = True, verbose = 3, n_jobs=-1) 
  
# fitting the model for grid search 
_ = grid.fit(X_train, y_train)

Fitting 5 folds for each of 64 candidates, totalling 320 fits
[CV 1/5] END ...C=0.01, coef0=0.01, kernel=poly;, score=0.556 total time=   0.0s
[CV 2/5] END ...C=0.01, coef0=0.01, kernel=poly;, score=0.556 total time=   0.0s
[CV 5/5] END ...C=0.01, coef0=0.01, kernel=poly;, score=0.000 total time=   0.0s
[CV 1/5] END ....C=0.01, coef0=0.1, kernel=poly;, score=0.556 total time=   0.0s
[CV 3/5] END ...C=0.01, coef0=0.01, kernel=poly;, score=1.000 total time=   0.0s
[CV 4/5] END ....C=0.01, coef0=0.1, kernel=poly;, score=0.000 total time=   0.0s
[CV 4/5] END ...C=0.01, coef0=0.01, kernel=poly;, score=0.000 total time=   0.0s
[CV 2/5] END ....C=0.01, coef0=0.1, kernel=poly;, score=0.556 total time=   0.0s
[CV 3/5] END ....C=0.01, coef0=0.1, kernel=poly;, score=1.000 total time=   0.0s
[CV 5/5] END ....C=0.01, coef0=0.1, kernel=poly;, score=0.000 total time=   0.0s
[CV 1/5] END ....C=0.01, coef0=0.5, kernel=poly;, score=0.556 total time=   0.0s
[CV 5/5] END ....C=0.01, coef0=0.5, kernel=poly

In [13]:
# print best parameter after tuning 
print(grid.best_params_) 
  
# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_)

y_pred = grid.predict(X_test) 

recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
f2 = fbeta_score(y_test, y_pred, beta=2)

performance = pd.concat([performance, pd.DataFrame({"model": ["SVM Poly"], "Accuracy": [accuracy], "Precision": [precision], "Recall": [recall], "F1": [f1], "F2": [f2], "Parameters": [grid.best_params_]})])

{'C': 0.01, 'coef0': 5, 'kernel': 'poly'}
SVC(C=0.01, coef0=5, kernel='poly')


## 4.0 Summary

From out results, we can see that the linear kernel and rbf models perform the best. For the SVC model with a linear kernel, the best C value is 0.1. For the SVC model with a rbf kernel, the best C value is 0.1 and the best gamma value is 0.1. The polynomial kernel model did not perform as well as the other two models and therefore we will not use it.

In [14]:
performance.sort_values(by="F2", ascending=False)

Unnamed: 0,model,Accuracy,Precision,Recall,F1,F2,Parameters
0,SVM Linear,0.625,1.0,0.5,0.666667,0.555556,"{'C': 0.01, 'kernel': 'linear'}"
0,SVM rbf,0.625,1.0,0.5,0.666667,0.555556,"{'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}"
0,SVM Poly,0.375,1.0,0.166667,0.285714,0.2,"{'C': 0.01, 'coef0': 5, 'kernel': 'poly'}"
