# Support Vector Machine

<img src='s1.png' />

In [None]:
real world examples

- image classification
- OCR(Optical Character Recognition)
- bioinformatics - gene classification, protine strucute prediction
- speech recognition

For Images and OCR - 
https://pyimagesearch.com/

## When to use SVM

In [None]:
- binary classification
- small or medium size dataset
- high dimensional data
- linear or non-linear

## Hyperplane, Margin, Support Vector

<img src='s2.png' />


## Margin and Hyperplane to choose

<img src='s3.png' />

In [None]:
objective -> select a hyperplane with the maximum possible margin between support vectors

2 steps ->

1 - generate hyperplanes which segregate the classes in a best possible way.
    the best hyperplane should represents the largest seperation or margin between the two classes.
    
2 - maximum margin hyperplane -> the linear classifier it defines is known as
                                 maximum margin classifier.



## Kernel Trick


<img src='s4.png' />

In [None]:
kernel trick -> function that maps the data to a higher dimension

4 types of kernels

1 - Linear Kernel -> when you have linear seperable dataset

2 - Polynomial Kernel -> when the decision boundary is a polynomial(curved plane or line)
                        use this mostly in NLP

3 - Radical Basis Function(rbf)/gaussian kernel -> when you have no prior knowledge of the data.
                            non-linear data with complex decision boundaries

4 - sigmoid kernel -> the data has sigmoidal decision boundary

# Dataset Pulsar

#### predicting a pulsar star

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('pulsar_data_train.csv')

df.head()

Unnamed: 0,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve,target_class
0,121.15625,48.372971,0.375485,-0.013165,3.168896,18.399367,7.449874,65.159298,0.0
1,76.96875,36.175557,0.712898,3.388719,2.399666,17.570997,9.414652,102.722975,0.0
2,130.585938,53.229534,0.133408,-0.297242,2.743311,22.362553,8.508364,74.031324,0.0
3,156.398438,48.865942,-0.215989,-0.171294,17.471572,,2.958066,7.197842,0.0
4,84.804688,36.117659,0.825013,3.274125,2.790134,20.618009,8.405008,76.291128,0.0


In [3]:
df.shape

(12528, 9)

In [4]:
col_names = df.columns

col_names

Index([' Mean of the integrated profile',
       ' Standard deviation of the integrated profile',
       ' Excess kurtosis of the integrated profile',
       ' Skewness of the integrated profile', ' Mean of the DM-SNR curve',
       ' Standard deviation of the DM-SNR curve',
       ' Excess kurtosis of the DM-SNR curve', ' Skewness of the DM-SNR curve',
       'target_class'],
      dtype='object')

In [5]:
df.columns = df.columns.str.strip()
df.columns

Index(['Mean of the integrated profile',
       'Standard deviation of the integrated profile',
       'Excess kurtosis of the integrated profile',
       'Skewness of the integrated profile', 'Mean of the DM-SNR curve',
       'Standard deviation of the DM-SNR curve',
       'Excess kurtosis of the DM-SNR curve', 'Skewness of the DM-SNR curve',
       'target_class'],
      dtype='object')

In [6]:
df.columns = ['IP Mean', 'IP Sd', 'IP Kurtosis', 'IP Skewness', 'DM-SNR Mean', 'DM-SNR Sd',
             'DM-SNR Kurtosis', 'DM-SNR Skewness', 'target_class']

df.columns

Index(['IP Mean', 'IP Sd', 'IP Kurtosis', 'IP Skewness', 'DM-SNR Mean',
       'DM-SNR Sd', 'DM-SNR Kurtosis', 'DM-SNR Skewness', 'target_class'],
      dtype='object')

In [7]:
df.target_class.value_counts()  # class imbalanced

0.0    11375
1.0     1153
Name: target_class, dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12528 entries, 0 to 12527
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   IP Mean          12528 non-null  float64
 1   IP Sd            12528 non-null  float64
 2   IP Kurtosis      10793 non-null  float64
 3   IP Skewness      12528 non-null  float64
 4   DM-SNR Mean      12528 non-null  float64
 5   DM-SNR Sd        11350 non-null  float64
 6   DM-SNR Kurtosis  12528 non-null  float64
 7   DM-SNR Skewness  11903 non-null  float64
 8   target_class     12528 non-null  float64
dtypes: float64(9)
memory usage: 881.0 KB


In [9]:
df.dropna(inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9273 entries, 0 to 12527
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   IP Mean          9273 non-null   float64
 1   IP Sd            9273 non-null   float64
 2   IP Kurtosis      9273 non-null   float64
 3   IP Skewness      9273 non-null   float64
 4   DM-SNR Mean      9273 non-null   float64
 5   DM-SNR Sd        9273 non-null   float64
 6   DM-SNR Kurtosis  9273 non-null   float64
 7   DM-SNR Skewness  9273 non-null   float64
 8   target_class     9273 non-null   float64
dtypes: float64(9)
memory usage: 724.5 KB


In [10]:
df.target_class.value_counts()

0.0    8423
1.0     850
Name: target_class, dtype: int64

In [11]:
df.isnull().sum()

IP Mean            0
IP Sd              0
IP Kurtosis        0
IP Skewness        0
DM-SNR Mean        0
DM-SNR Sd          0
DM-SNR Kurtosis    0
DM-SNR Skewness    0
target_class       0
dtype: int64

In [13]:
round(df.describe(), 2)

Unnamed: 0,IP Mean,IP Sd,IP Kurtosis,IP Skewness,DM-SNR Mean,DM-SNR Sd,DM-SNR Kurtosis,DM-SNR Skewness,target_class
count,9273.0,9273.0,9273.0,9273.0,9273.0,9273.0,9273.0,9273.0,9273.0
mean,111.13,46.51,0.48,1.79,12.74,26.33,8.33,105.78,0.09
std,25.69,6.78,1.07,6.29,29.77,19.54,4.55,108.17,0.29
min,6.19,24.77,-1.74,-1.79,0.21,7.37,-2.64,-1.98,0.0
25%,100.98,42.4,0.02,-0.19,1.91,14.38,5.79,34.92,0.0
50%,115.23,46.9,0.22,0.2,2.8,18.44,8.43,83.15,0.0
75%,127.33,51.0,0.47,0.93,5.46,28.39,10.72,139.77,0.0
max,189.73,91.81,8.07,68.1,211.95,110.64,34.54,1191.0,1.0


In [None]:
# how to handle outliers with SVM

2 variants

1 - hard margin -> does not deal with outliers

2 - soft margin -> good with outliers
        - for every data point that is classified with less margin will 
            add a penalty in the form of C parameter

low C -> allowing more outliers

high C -> allowing less outliers

In [14]:
x = df.drop(['target_class'], axis=1)

y = df['target_class']

In [15]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(x,y, test_size=0.2, random_state=101)

### feature scaling

In [16]:
cols = xtrain.columns

In [18]:
cols

Index(['IP Mean', 'IP Sd', 'IP Kurtosis', 'IP Skewness', 'DM-SNR Mean',
       'DM-SNR Sd', 'DM-SNR Kurtosis', 'DM-SNR Skewness'],
      dtype='object')

In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

xtrain = scaler.fit_transform(xtrain)

xtest = scaler.transform(xtest)

In [19]:
xtrain = pd.DataFrame(xtrain, columns=[cols])
xtest = pd.DataFrame(xtest, columns=[cols])

In [20]:
xtrain.describe()

Unnamed: 0,IP Mean,IP Sd,IP Kurtosis,IP Skewness,DM-SNR Mean,DM-SNR Sd,DM-SNR Kurtosis,DM-SNR Skewness
count,7418.0,7418.0,7418.0,7418.0,7418.0,7418.0,7418.0,7418.0
mean,2.2330180000000002e-17,-3.011581e-16,3.5919859999999997e-19,1.36271e-17,1.1022910000000001e-17,-5.772621000000001e-17,1.600305e-16,-5.441111e-17
std,1.000067,1.000067,1.000067,1.000067,1.000067,1.000067,1.000067,1.000067
min,-4.071793,-3.200599,-2.0604,-0.5652381,-0.4198631,-0.9729856,-2.37769,-0.9903806
25%,-0.3987415,-0.6028331,-0.4222028,-0.3144755,-0.3630773,-0.6149471,-0.5575565,-0.6525814
50%,0.1601401,0.05847557,-0.2397311,-0.251749,-0.3337741,-0.402084,0.02183587,-0.2096842
75%,0.632007,0.6617423,-0.006742316,-0.1373019,-0.2447974,0.1083867,0.5220626,0.3083765
max,3.057494,6.684538,7.039803,10.43051,6.557319,4.337836,5.742929,9.953967


# SVC classifier

In [21]:
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score

In [22]:
svc = SVC()

svc.fit(xtrain, ytrain)

ypred = svc.predict(xtest)



In [23]:
accuracy_score(ytest, ypred)

0.9719676549865229

In [24]:
# c = 100 and kernel = rbf

svc = SVC(C=100.0)

svc.fit(xtrain, ytrain)

ypred = svc.predict(xtest)

print(accuracy_score(ytest, ypred))



0.9725067385444744




In [25]:
# c = 1000 and kernel = rbf

svc = SVC(C=1000.0)

svc.fit(xtrain, ytrain)

ypred = svc.predict(xtest)

print(accuracy_score(ytest, ypred))



0.9735849056603774




In [26]:
# c = 1 and kernel = linear

linear_svc = SVC(C=1, kernel='linear')

linear_svc.fit(xtrain, ytrain)

ypred = linear_svc.predict(xtest)

print(accuracy_score(ytest, ypred))

0.9730458221024259




# Assignment

- polynomial kernel
- sigmoid kernel

compare the train and test results

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html