# Lesson 6.05 Support Vector Machines

## What are Support Vector Machines (SVMs)?

Classifier that finds an optimal hyperplane that maximises margin between 2 classes.

* SVMs are fantastic models if all you care about is predictive ability
* They are complete and total black boxes i.e. siginificance of predictors is unknown
* You must **scale your data** since SVM tries to maximize the distance between the separating plane and the support vectors. If one feature (i.e. one dimension in this space) has very large values, it will dominate the other features when calculating the distance. If you rescale all features, they all have the same influence on the distance metric.
* SVMs with polynomial kernel degree = 2 has been shown to work really well for NLP data!


### Pros
- Exceptional perfomance (historically widely used)
- Effective in high-dimensional data
- Can work with non-linear boundaries
- Fast to compute with most datasets (kernel trick)

### Cons
- Black box method i.e. siginificance of predictors is unknown
- Can be slow on large datasets i.e. massive number of rows


### Import Library

Import [Support Vector Machines](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) from `sklearn` and explore the hyperparameters.

In [1]:
from sklearn.svm import SVC

# Explore the hyperparameters and default values of SVC class
# Note we do not tune all the hyperparameters
# Rather we focus on hyperparameters that are most impactful

In [2]:
# Explore the default methods of SVC class
dir(SVC)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_n_features',
 '_check_proba',
 '_compute_kernel',
 '_decision_function',
 '_dense_decision_function',
 '_dense_fit',
 '_dense_predict',
 '_dense_predict_proba',
 '_estimator_type',
 '_get_coef',
 '_get_param_names',
 '_get_tags',
 '_impl',
 '_more_tags',
 '_pairwise',
 '_predict_log_proba',
 '_predict_proba',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_sparse_decision_function',
 '_sparse_fit',
 '_sparse_kernels',
 '_sparse_predict',
 '_sparse_predict_proba',
 '_validate_data',
 '_validate_for_predict',
 '_validate_targets',


## Fit and evaluate a model

We will be using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a simple Support Vector Machines model.

### Read in Data

In [3]:
import joblib
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import warnings

# if you are keen to remove the warnings in the output
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

tr_features = pd.read_csv('data/train_features.csv')
tr_labels = pd.read_csv('data/train_labels.csv', header=None)

## Hyperparameter Tuning

### Hyperparameter tuning

![c](img/c.png)

In [4]:
# Display optimal param values after hyperparamter tuning using GridSearchCV
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))
    
    # mean accuracy of classification
    means = results.cv_results_['mean_test_score']
    
    # std deviation of classification accuracy
    stds = results.cv_results_['std_test_score']
    
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [5]:
svc = SVC()
parameters = {
    'C': [0.1, 1, 10],

}

cv = GridSearchCV(svc, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'C': 10}

0.654 (+/-0.06) for {'C': 0.1}
0.661 (+/-0.048) for {'C': 1}
0.684 (+/-0.07) for {'C': 10}


In [6]:
cv.best_estimator_

SVC(C=10)

### Save model to external file
Save your optimal model settings to a .pkl file so that it can be used to facilitate evaluation across other models, Jupyter Notebooks and stakeholders. 

Might be useful for projects when each member focuses on a separate set of models.

In [7]:
# Save model to file
joblib.dump(cv.best_estimator_, 'data/SVM_model.pkl')

# Read models from file
# models = {}
# for mdl in ['LR','SVM','KNN','RF','GB']:
#     models[mdl] = joblib.load('data/{}_model.pkl',format(mdl))

['data/SVM_model.pkl']