# Support Vector Machines

Support Vector Machines (SVM) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. They were extremely popular around the time they were developed in the 1990s and continue to be the go-to method for a high-performing algorithm with little tuning

Given a set of training examples, each marked for belonging to one of two categories, and SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary classifier.

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class 1.

The distance between the line and the closest data points is referred to as the margin. The best or optimal line that can separate the two classes is the line that as the largest margin. This is called the Maximal-Margin hyperplane. The margin is calculated as the perpendicular distance from the line to only the closest points. Only these points are relevant in defining the line and in the construction of the classifier. These points are called the support vectors. They support or define the hyperplane. The hyperplane is learned from training data using an optimization procedure that maximizes the margin.

New examples are then mapped into the same space and predicted to belong to a category based on which side of the gap they fall on.

The SVM algorithm is implemented in practice using a kernel.

The learning of the hyperplane in linear SVM is done by transforming the problem using some linear algebra, which is out of the scope of this introduction to SVM.

A powerful insight is that the linear SVM can be rephrased using the inner product of any two given observations, rather than the observations themselves. The inner product between two vectors is the sum of the multiplication of each pair of input values.

For example, the inner product of the vectors $[2, 3]$ and $[5, 6]$ is $2 * 5 + 3 * 6$ or $28$.

The equation for making a prediction for a new input using the dot product between the input (x) and each support vector (xi) is calculated as follows:

$f(x) = B_0 + \sum(a_i \dot (x,x_i))$

This is an equation that involves calculating the inner products of a new input vector (x) with all support vectors in training data. The coefficients $B_0$ and $a_i$ (for each input) must be estimated from the training data by the learning algorithm.

## Libraries

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neural_network import MLPClassifier
from sklearnex import patch_sklearn 
patch_sklearn()
from sklearn.svm import SVC
from collections import Counter
from imblearn.over_sampling import SMOTE, ADASYN, SMOTENC
from imblearn.combine import SMOTETomek, SMOTEENN 
import matplotlib.pyplot as plt
from numpy import where
import time

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Read the data from csv

In [2]:
df_train = pd.read_csv('../data/df_train.csv')
df_test = pd.read_csv('../data/df_test.csv')

X_train = df_train.drop('kill', axis=1)
y_train = df_train['kill']
X_test = df_test.drop(['kill'], axis=1)
y_test = df_test['kill']

X_train = X_train.values
y_train = y_train.values
X_test = X_test.values
y_test = y_test.values

## Scaling

In [3]:
scaler = StandardScaler()
#scaler = MinMaxScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

**Call the SVC() model from sklearn and fit the model to the training data.**

In [4]:
model = SVC()

Now its time to train a Support Vector Machine Classifier. 

In [5]:
model.fit(X_train,y_train)

SVC()

## Predictions and Evaluations

**Now get predictions from the model and create a confusion matrix and a classification report.**

In [6]:
predictions = model.predict(X_test)

In [7]:
from sklearn.metrics import classification_report,confusion_matrix

In [8]:
print(confusion_matrix(y_test,predictions))

[[20086   135]
 [ 2434   386]]


In [9]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.89      0.99      0.94     20221
           1       0.74      0.14      0.23      2820

    accuracy                           0.89     23041
   macro avg       0.82      0.57      0.59     23041
weighted avg       0.87      0.89      0.85     23041



In [10]:
def fit_and_print(model, X_train, y_train):
    model.fit(X_train, y_train)  
    y_pred = model.predict(X_test)
    print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))  
    print("Classification Report: \n", classification_report(y_test, y_pred))  
    print("Accuracy: ", round(accuracy_score(y_test, y_pred),3))
    print("Precision:", round(precision_score(y_test, y_pred),3))
    print("Recall:", round(recall_score(y_test, y_pred),3))
    print("f1: ", round(f1_score(y_test, y_pred),3))

In [11]:
fit_and_print(model,X_train,y_train)

Confusion Matrix: 
 [[20086   135]
 [ 2434   386]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.89      0.99      0.94     20221
           1       0.74      0.14      0.23      2820

    accuracy                           0.89     23041
   macro avg       0.82      0.57      0.59     23041
weighted avg       0.87      0.89      0.85     23041

Accuracy:  0.889
Precision: 0.741
Recall: 0.137
f1:  0.231


Woah! Notice that we are classifying everything into a single class! This means our model needs to have it parameters adjusted (it may also help to normalize the data).

We can search for parameters using a GridSearch!

Let's see if we can tune the parameters to try to get even better!

# Gridsearch

Finding the right parameters (like what C or gamma values to use) is a tricky task! But luckily, we can be a little lazy and just try a bunch of combinations and see what works best! This idea of creating a 'grid' of parameters and just trying out all the possible combinations is called a Gridsearch, this method is common enough that Scikit-learn has this functionality built in with GridSearchCV! The CV stands for cross-validation which is the

GridSearchCV takes a dictionary that describes the parameters that should be tried and a model to train. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested. 

**Create a dictionary called param_grid and fill out some parameters for C and gamma.**

**Soft Margin Classifier**

In practice, real data is messy and cannot be separated perfectly with a hyperplane.

The constraint of maximizing the margin of the line that separates the classes must be relaxed. This is often called the soft margin classifier. This change allows some points in the training data to violate the separating line.

An additional set of coefficients are introduced that give the margin wiggle room in each dimension. These coefficients are sometimes called slack variables. This increases the complexity of the model as there are more parameters for the model to fit to the data to provide this complexity.

A tuning parameter is introduced called simply $C$ that defines the magnitude of the wiggle allowed across all dimensions. The $C$ parameters defines the amount of violation of the margin allowed. A $C=0$ is no violation and we are back to the inflexible Maximal-Margin Classifier described above. The larger the value of $C$ the more violations of the hyperplane are permitted.

During the learning of the hyperplane from data, all training instances that lie within the distance of the margin will affect the placement of the hyperplane and are referred to as support vectors. And as $C$ affects the number of instances that are allowed to fall within the margin, $C$ influences the number of support vectors used by the model.

- The smaller the value of $C$, the more sensitive the algorithm is to the training data (higher variance and lower bias).
- The larger the value of $C$, the less sensitive the algorithm is to the training data (lower variance and higher bias).

Finally, we can also have a more complex radial kernel. For example:

\begin{equation} K(x,x_{i}) = e^{- \gamma\sum(x – x_{i}^2)} \end{equation}

Where $\gamma$ (gamma) is a parameter that must be specified to the learning algorithm. A good default value for gamma is 0.1, where gamma is often $0 < \gamma < 1$. The radial kernel is very local and can create complex regions within the feature space, like closed polygons in two-dimensional space.

In [12]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']} 

**Import GridsearchCV from SciKit Learn.**

In [13]:
from sklearn.model_selection import GridSearchCV

One of the great things about GridSearchCV is that it is a meta-estimator. It takes an estimator like SVC, and creates a new estimator, that behaves exactly the same - in this case, like a classifier. You should add refit=True and choose verbose to whatever number you want, higher the number, the more verbose (verbose just means the text output describing the process).

In [14]:
grid = GridSearchCV(estimator=SVC(), param_grid=param_grid, verbose=2, cv = 5, n_jobs = -1)

What fit does is a bit more involved then usual. First, it runs the same loop with cross-validation, to find the best parameter combination. Once it has the best combination, it runs fit again on all data passed to fit (without cross-validation), to built a single new model using the best parameter setting.

**Now take that grid model and create some predictions using the test set and create classification reports and confusion matrices for them. Were we able to improve?**

In [15]:
# May take awhile!
grid.fit(X_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV] END ...........................C=1, gamma=1, kernel=rbf; total time= 5.5min
[CV] END ......................C=100, gamma=0.01, kernel=rbf; total time= 8.1min
[CV] END ......................C=0.1, gamma=0.01, kernel=rbf; total time= 2.3min
[CV] END ........................C=10, gamma=0.1, kernel=rbf; total time= 9.4min
[CV] END ...................C=1000, gamma=0.0001, kernel=rbf; total time= 1.9min
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time= 2.9min
[CV] END ......................C=10, gamma=0.001, kernel=rbf; total time= 1.8min
[CV] END ......................C=100, gamma=0.01, kernel=rbf; total time= 9.5min
[CV] END ....................C=0.1, gamma=0.0001, kernel=rbf; total time=  22.5s
[CV] END .......................C=1, gamma=0.001, kernel=rbf; total time= 2.4min
[CV] END ........................C=10, gamma=0.1, kernel=rbf; total time= 9.5min
[CV] END ...................C=1000, gamma=0.000

GridSearchCV(cv=5, estimator=SVC(), n_jobs=-1,
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['rbf']},
             verbose=2)

You can inspect the best parameters found by GridSearchCV in the best_params_ attribute, and the best estimator in the best\_estimator_ attribute:

In [16]:
grid.best_params_

{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}

In [17]:
best_grid = grid.best_estimator_
best_grid

SVC(C=10, gamma=0.1)

Then you can re-run predictions on this grid object just like you would with a normal model.

In [18]:
grid_predictions = grid.predict(X_test)

In [19]:
print(confusion_matrix(y_test,grid_predictions))

[[20014   207]
 [ 2298   522]]


In [20]:
print(classification_report(y_test,grid_predictions))

              precision    recall  f1-score   support

           0       0.90      0.99      0.94     20221
           1       0.72      0.19      0.29      2820

    accuracy                           0.89     23041
   macro avg       0.81      0.59      0.62     23041
weighted avg       0.87      0.89      0.86     23041



In [21]:
fit_and_print(best_grid, X_train, y_train)

Confusion Matrix: 
 [[20014   207]
 [ 2298   522]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.90      0.99      0.94     20221
           1       0.72      0.19      0.29      2820

    accuracy                           0.89     23041
   macro avg       0.81      0.59      0.62     23041
weighted avg       0.87      0.89      0.86     23041

Accuracy:  0.891
Precision: 0.716
Recall: 0.185
f1:  0.294


In [24]:
def calculate_pred_and_inf_time(best_grid, X_test):
    # get the start time
    st_wall_inf = time.time()

    # Generate generalization metrics
    grid_predictions = best_grid.predict(X_test)

    # get the end time
    et_wall_inf = time.time()

    # get execution time
    wall_time_inf = et_wall_inf - st_wall_inf
    print(f'Inference Time: {1000*wall_time_inf:.3f} miliseconds')

calculate_pred_and_inf_time(best_grid, X_test)

Inference Time: 266.469 miliseconds


You should have done about the same or exactly the same, this makes sense, there is basically just one point that is too noisey to grab, which makes sense, we don't want to have an overfit model that would be able to grab that.

### Further Reading

Support Vector Machines are a huge area of study. There are numerous books and papers on the topic. Here let's list some of the seminal and most useful results if you are looking to dive deeper into the background and theory of the technique.

- Check Chapter 9 of **Introduction to Statistical Learning** by Gareth James, et al.

http://faculty.marshall.usc.edu/gareth-james/

There are countless tutorials and journal articles on SVM. Below is a link to a seminal paper on SVM by Cortes and Vapnik and another to an excellent introductory tutorial.  

- Support-Vector Networks by Cortes and Vapnik 1995

https://link.springer.com/article/10.1007/BF00994018
- A Tutorial on Support Vector Machines for Pattern Recognition 1998

https://www.di.ens.fr/~mallat/papiers/svmtutorial.pdf


Finally, there are a lot of posts on Q&A sites asking for simple explanations of SVM, below are two picks that you might find useful.

- What does support vector machine (SVM) mean in layman’s terms?

https://www.quora.com/What-does-support-vector-machine-SVM-mean-in-laymans-terms
- Please explain Support Vector Machines (SVM) like I am a 5 year old

https://www.reddit.com/r/MachineLearning/comments/15zrpp/please_explain_support_vector_machines_svm_like_i/

## Resampling

### SMOTE

In [25]:
# Oversample and plot imbalanced dataset with SMOTE

# summarize class distribution
counter = Counter(y_train)
print(counter)
# transform the dataset
oversample = SMOTE(random_state=42)
X_train_rel, y_train_rel = oversample.fit_resample(X_train, y_train)
# summarize the new class distribution
counter = Counter(y_train_rel)
print(counter)

fit_and_print(best_grid, X_train_rel, y_train_rel)

calculate_pred_and_inf_time(best_grid, X_test)

Counter({0: 114988, 1: 15577})
Counter({0: 114988, 1: 114988})
Confusion Matrix: 
 [[15591  4630]
 [  764  2056]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.95      0.77      0.85     20221
           1       0.31      0.73      0.43      2820

    accuracy                           0.77     23041
   macro avg       0.63      0.75      0.64     23041
weighted avg       0.87      0.77      0.80     23041

Accuracy:  0.766
Precision: 0.308
Recall: 0.729
f1:  0.433
Inference Time: 654.650 miliseconds


### ADASYN

In [26]:
# Oversample and plot imbalanced dataset with ADASYN

# summarize class distribution
counter = Counter(y_train)
print(counter)
# transform the dataset
oversample = ADASYN(random_state=42)
X_train_rel, y_train_rel = oversample.fit_resample(X_train, y_train)
# summarize the new class distribution
counter = Counter(y_train_rel)
print(counter)

fit_and_print(best_grid, X_train_rel, y_train_rel)

calculate_pred_and_inf_time(best_grid, X_test)

Counter({0: 114988, 1: 15577})
Counter({1: 119427, 0: 114988})
Confusion Matrix: 
 [[14423  5798]
 [  605  2215]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.96      0.71      0.82     20221
           1       0.28      0.79      0.41      2820

    accuracy                           0.72     23041
   macro avg       0.62      0.75      0.61     23041
weighted avg       0.88      0.72      0.77     23041

Accuracy:  0.722
Precision: 0.276
Recall: 0.785
f1:  0.409
Inference Time: 752.793 miliseconds


### SMOTE and TL

In [27]:
# Oversample and plot imbalanced dataset with SMOTE and TL

# summarize class distribution
counter = Counter(y_train)
print(counter)
# transform the dataset
oversample = SMOTETomek(random_state=42)
X_train_rel, y_train_rel = oversample.fit_resample(X_train, y_train)
# summarize the new class distribution
counter = Counter(y_train_rel)
print(counter)

fit_and_print(best_grid, X_train_rel, y_train_rel)

calculate_pred_and_inf_time(best_grid, X_test)

Counter({0: 114988, 1: 15577})
Counter({0: 111748, 1: 111748})
Confusion Matrix: 
 [[15597  4624]
 [  769  2051]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.95      0.77      0.85     20221
           1       0.31      0.73      0.43      2820

    accuracy                           0.77     23041
   macro avg       0.63      0.75      0.64     23041
weighted avg       0.87      0.77      0.80     23041

Accuracy:  0.766
Precision: 0.307
Recall: 0.727
f1:  0.432
Inference Time: 614.533 miliseconds


### SMOTE and ENN

In [28]:
# Oversample and plot imbalanced dataset with SMOTE and ENN

# summarize class distribution
counter = Counter(y_train)
print(counter)
# transform the dataset
oversample = SMOTEENN(random_state=42)
X_train_rel, y_train_rel = oversample.fit_resample(X_train, y_train)
# summarize the new class distribution
counter = Counter(y_train_rel)
print(counter)

fit_and_print(best_grid, X_train_rel, y_train_rel)

calculate_pred_and_inf_time(best_grid, X_test)

Counter({0: 114988, 1: 15577})
Counter({1: 96382, 0: 81992})
Confusion Matrix: 
 [[14785  5436]
 [  635  2185]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.96      0.73      0.83     20221
           1       0.29      0.77      0.42      2820

    accuracy                           0.74     23041
   macro avg       0.62      0.75      0.62     23041
weighted avg       0.88      0.74      0.78     23041

Accuracy:  0.737
Precision: 0.287
Recall: 0.775
f1:  0.419
Inference Time: 321.345 miliseconds
