# Support Vector Machine (SVM)
## <a href="#I">I Simple SVM</a>
## <a href="#II">II Implementing SVM with Scikit-Learn</a>
### <a href="#II.1">II.1 Preparing the Data</a>
### <a href="#II.2">II.2 Training the Algorithm</a>
### <a href="#II.3">II.3 Making Predictions</a>
### <a href="#II.4">II.4 Evaluating the algorithm</a>
## <a href="#III">III Other Kernel SVM</a>
### <a href="#III.1">III.1 Preparing the Data</a>
### <a href="#III.2">III.2 Training the Algorithm and making predictions</a>
### <a href="#III.3">III.3 Evaluating the algorithm</a>

# Support Vector Machine (SVM)

A __support vector machine__ (__SVM__) is a type of supervised machine learning classification algorithm. 

<a id="I"></a>
## I Simple SVM

In case of linearly separable data in two dimensions a typical machine learning algorithm tries to find a boundary that divides the data in such a way that the misclassification error can be minimized. 
In the picture below you can see that there can be several boundaries (3 lines here) that correctly divide the data points. 

<img src="nbimages/svm1.jpg" width=300 height=300/>

SVM differs from the other classification algorithms in the way that it chooses the decision boundary that maximizes the distance from the nearest data points of all the classes: the SVM finds the most optimal decision boundary.

The most optimal decision boundary is the one which has maximum margin from the nearest points of all the classes. 
The nearest points from the _decision boundary_ that maximize the distance between the decision boundary and the points are called __support vectors__. 
The _decision boundary_ in case of support vector machines is called the __maximum margin classifier__, or the __maximum margin hyper plane__.

<img src="nbimages/svm2.jpg" width=300 height=300/>

There is complex mathematics involved behind finding the support vectors, calculating the margin between decision boundary and the support vectors and maximizing this margin. 

<a id="II"></a>
## II Implementing SVM with Scikit-Learn

Our task is to predict whether a bank currency note is authentic or not based upon four attributes of the note i.e. skewness of the wavelet transformed image, variance of the image, entropy of the image, and curtosis of the image. This is a __binary classification__ problem?


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv('datasets/bill_authentication.csv')
dataset.head()
dataset.info()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
Variance    1372 non-null float64
Skewness    1372 non-null float64
Curtosis    1372 non-null float64
Entropy     1372 non-null float64
Class       1372 non-null int64
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


<a id="II.1"></a>
### II.1 Preparing the Data

Preparing the data involves:

1. Dividing the data into attributes and labels 
2. Dividing the data into training and testing sets.

In [2]:
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
type(X)
type(y)
y.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train.head()
y_train.head()

pandas.core.frame.DataFrame

pandas.core.series.Series

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

Unnamed: 0,Variance,Skewness,Curtosis,Entropy
1326,-1.2943,2.6735,-0.84085,-2.0323
1109,-0.40857,3.0977,-2.9607,-2.6892
1139,-1.5228,-6.4789,5.7568,0.87325
657,-0.278,8.1881,-3.1338,-2.5276
704,3.7022,6.9942,-1.8511,-0.12889


1326    1
1109    1
1139    1
657     0
704     0
Name: Class, dtype: int64

<a id="II.2"></a>
### II.2 Training the Algorithm

We have divided the data into training and testing sets. Now is the time to train our SVM on the training data.
The SVM class constructor takes, as parameter, a kernel type. 
In the case of a simple SVM we set this parameter as "__linear__" (simple SVMs can only classify linearly separable data). 
We will see non-linear kernels ("__rbf__","__polynomial__") in the next section.

In [3]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

<a id="II.3"></a>
### II.3 Making Predictions


In [4]:
y_pred = svclassifier.predict(X_test)

<a id="II.4"></a>
### II.4 Evaluating the Algorithm

__Confusion matrix__, __precision__, __recall__, and __F1 measures__ are the most commonly used metrics for classification tasks. 
Here is the code for finding these metrics:

In [5]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[155   2]
 [  0 118]]
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       157
           1       0.98      1.00      0.99       118

    accuracy                           0.99       275
   macro avg       0.99      0.99      0.99       275
weighted avg       0.99      0.99      0.99       275



From the results it can be observed that there is only one misclassification: a very good result !

<a id="III"></a>
## III Other Kernel SVM

The simple SVM algorithm (the one using the kernel __linear__) can be used to find decision boundary for linearly separable data. 
In the case of non-linearly separable data, such as the one shown belo, a straight line cannot be used as a decision boundary. A different SVM Kernel is used.

<img src="nbimages/svm3.jpg" width=300 height=300/>

Basically, with a non linear kernel, the SVM algorithm projects the non-linearly separable data lower dimensions to linearly separable data in higher dimensions in such a way that data points belonging to different classes are allocated to different dimensions. 
Again, there is complex mathematics behind this algorithm.

In this example we will try to predict the category to which a plant belongs based on four attributes: _sepal-width_, _sepal-length_, _petal-width_ and _petal-length_.


In [6]:
colnames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset2 = pd.read_csv('datasets/iris.data', names=colnames)
dataset2.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


<a id="III.1"></a>
### III.1 Preparing the Data

In [7]:
X = dataset2.drop('Class', axis=1)
y = dataset2['Class']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


<a id="III.2"></a>
### III.2 Training the Algorithm and making predictions

In the case of the simple SVM we used "__linear__" as the value for the kernel parameter.<br> 
In this situation, we are dealing with non-linearly separable data, we should use a different SVM kernel: __Radial basis function__ (__rbf__), __polynomial__ (__poly__) or __sigmoid__. <br>

__Note__: the parameter __degree__ represent the degree of the polynomial kernel function (it is ignored by all other kernels).

In [8]:
from IPython.display import display                               
from ipywidgets import interactive
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

def testKernel(kernelType, degree):
    svclassifier = SVC(kernel=kernelType, degree=degree, gamma='scale')
    svclassifier.fit(X_train, y_train)
    y_pred = svclassifier.predict(X_test)
    print(confusion_matrix(y_test,y_pred))
    print(classification_report(y_test,y_pred))

inter=interactive(testKernel 
   , kernelType = ['poly', 'rbf', 'sigmoid']
   , degree = (1,20))

display(inter)

interactive(children=(Dropdown(description='kernelType', options=('poly', 'rbf', 'sigmoid'), value='poly'), In…

<a id="III.3"></a>
### III.3 Evaluating the Algorithm

If we compare the performance of the different types of kernels we can clearly see that the __sigmoid__ kernel performs the worst. This is due to the reason that sigmoid function returns two values, 0 and 1, therefore it is more suitable for binary classification problems. However, in our case we had three output classes !

Amongst the __Radial basis function__ kernel and __polynomial__ kernel, we can see that Gaussian kernel achieved a perfect 100% prediction rate while polynomial kernel misclassified one instance. Therefore the Gaussian kernel performed slightly better. <br>
However, there is no hard and fast rule as to which kernel performs best in every scenario. It is all about testing all the kernels and selecting the one with the best results on your test dataset.

