# Support Vector Machine

In [None]:
!pip install six
!pip install pandas
!pip install numpy
!pip install sklearn
!pip install matplotlib
!pip install imbalanced-learn

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [None]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from imblearn.under_sampling import RandomUnderSampler

In [None]:
train_set = pd.read_csv('train_set_with_features.csv')


## Data Prep

In [None]:
# Random undersampler to reduce the number of majority class instances to match number of minority class instance.
undersample = RandomUnderSampler(sampling_strategy='majority')

# Extract only engineered features into x and y
x = train_set.drop(['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate', 'Unnamed: 0'], axis=1)
y = train_set[['is_duplicate']]

# Because gridSearch parameter tuning is slow, only use 50% of model data for training the gridSearch model while searching for best parameters for final SVM model.
x_grid_train, x_grid_test, y_grid_train, y_grid_test = train_test_split(x, y, test_size = 0.5, random_state = 42)

# Split 80% of data for the final model training and 20% for testing.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

# Normalize then undersample data used by final model
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
x_train, y_train = undersample.fit_resample(x_train, y_train)

# Normalize then undersample data used by gridSearch model
x_grid_train = scaler.fit_transform(x_grid_train)
x_grid_test = scaler.transform(x_grid_test)
x_grid_train, y_grid_train = undersample.fit_resample(x_grid_train, y_grid_train)
# gridSearch requires labels to be of a particular shape.
y_grid_train = y_grid_train.to_numpy().reshape(-1)
y_grid_test = y_grid_test.to_numpy().reshape(-1)

## Parameter tuning

In [None]:
# Execute gridSearch to try these parameters for SVM.
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'sigmoid']}
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2, n_jobs=3)
grid.fit(x_grid_train ,y_grid_train)

hi
Fitting 5 folds for each of 32 candidates, totalling 160 fits


GridSearchCV(estimator=SVC(), n_jobs=3,
             param_grid={'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001],
                         'kernel': ['rbf', 'sigmoid']},
             verbose=2)

In [None]:
# Best parameters for SVM, but best kernel is not shown
print(grid.best_estimator_)

SVC(C=10, gamma=0.1)


In [None]:
# Print out the performance of the SVM model trained by gridSearch using the best parameters.
grid_predictions = grid.predict(x_test)
print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))

[[32556 18600]
 [ 3580 26118]]
              precision    recall  f1-score   support

           0       0.90      0.64      0.75     51156
           1       0.58      0.88      0.70     29698

    accuracy                           0.73     80854
   macro avg       0.74      0.76      0.72     80854
weighted avg       0.78      0.73      0.73     80854



## Fitting model based on tuned parameters

In [None]:
# Use the parameters found by gridSearch to train the final SVM model with more data (80% instead of 50%).
# After trying multiple kernel types since gridSearch did not reveal the best kernel type, 'rbf' is the best.

# Kernel = 'rbf'
SVM = SVC(C=10, kernel='rbf', degree=3, gamma=0.01)
clf = SVM.fit(x_train,y_train)
predictions_SVM = SVM.predict(x_test)

# Print out the performance of SVM that is trained using the best parameters and 
print(classification_report(y_test,predictions_SVM))

  return f(*args, **kwargs)


              precision    recall  f1-score   support

           0       0.91      0.59      0.71     51156
           1       0.56      0.90      0.69     29698

    accuracy                           0.70     80854
   macro avg       0.74      0.74      0.70     80854
weighted avg       0.78      0.70      0.70     80854



### Process:
1. Normalize feature engineered training data
2. Parameter tuning using GridSearchCV which fits the SVM model using several values of each parameter and evaluating it with a 5-fold cross validation. (10000 rows)
3. Resulting parameters are C = 100, gamma = 0.01.
4. Upon testing, best kernel for those parameters is rbf.

Results suggest that the model is better used to predict that a question is NOT a duplicate.


### Advantages:
1. By using a kernel, there can be separation of the classes even if the data provided is not linearly separable. (https://core.ac.uk/reader/6302770)

2. SVM provides good out of sample generalization as it makes use of regularization which helps to prevent overfitting on the dataset. 

3. SVM can classify data points faster than some other models because it only relies on the support vectors to decide the decision boundary and not all of the data points used to train the model (like kNN).

### Disadvantages:
1. Does not perform too well with skewed dataset, as in our case. There would be high variance of the decision boundary as the under represented class can skew the decision boundary by a lot. 
https://www.quora.com/Why-does-SVM-not-perform-well-for-imbalanced-data

2. Takes a long time to train the model if the data set is large. "As you mention, storing the kernel matrix requires memory that scales quadratically with the number of data points. Training time for traditional SVM algorithms also scales superlinearly with the number of data points. So, these algorithms aren't feasible for large data sets."
https://stats.stackexchange.com/questions/314329/can-support-vector-machine-be-used-in-large-data

