<div id="container" style="position:relative;">
<div style="float:left">

***Kazi Shahid***

***BrainStation Data Science Diploma Candidate***

***Capstone Project***

=============================================================

***Project SteamBuzz: Will Our Game Create a Buzz in the Steam community?***

***Part 4 (d): Sentiment Analysis ML Model 4 - Support Vector Machines***
</div>
<div style="position:relative; float:right"><img style="height:100px" src ="https://i.ibb.co/mcvpL4Z/Steam-Buzz-logo.png" />
</div>
</div>

---
# Overview

In this part of the project, we will train a Support Vector classifier ("SVC") on the data.

A Support Vector Machine, or SVM, is a classifier that finds an optimal hyperplane that maximizes the margin between two classes. Though it can also be applied as a multi-class classifier by running the necessary number of binary "one-versus-rest" SVMs, it is still basically a binary classifier run several times in permutation. A SVM strives to find a line that is in the middle of / evenly spaced between the two classes. This is achieved by maximizing the distance between the decision boundary and the closest points. The perpendicular line from the decision boundary to the closest points in both classes is called Support Vector, which a SVM tries to maximize the length of.

A SVC is very computationally expensive though. We will still give it a run to see if an efficient processing can be reached.

---
# Process Flow

The intended process flow for this part of the project is as follows:

1. Loading the dataset as was prepared and preprocessed in Part 3 of the project
2. Employing the classifier and optimizing its hyperparameters through Grid Search and cross validation
3. Choosing the parameter values for which the classifier performed the best and re-employing the classifier with the optimized hyperparameters
4. Evaluating the model using the appropriate performance measures
5. Derive any valuable insights from the model
6. Wrapping up with concluding remarks, summarizing the findings

In [1]:
# Importing the necessary data analysis and visualization toolkits
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# To display ALL the columns in the dataframes
pd.options.display.max_columns=None

# To display a considerable extent (first 500 characters) of the content of each column of the dataframes
pd.set_option('display.max_colwidth', 100)

# Filtering out potential warnings
import warnings
warnings.filterwarnings('ignore')

---
# Loading the Training and Test Datasets

SVC is a quite computationally-expensive classifier, and our first run of it (fitted to the full sets) had to be terminated after almost a day of running. Instead, in order to gain some insights in time from this model rather than none, we opt for the PCA-transformed X train and test datasets, which still covers 80% of the variance of data and therefore the insights gained from the model trained on it should not be unreasonably far off from the same derived from the full datasets.

Therefore, instead of the full X train and test sets (includes all features), we will load the PCA-transformed data.

In [None]:
# Importing the X_train, X_test, y_train, and y_test datasets from the respective pickle files into Pandas DataFrame forms
X_train = pd.read_pickle("data\\x_train.pkl")
X_test = pd.read_pickle("data\\x_test.pkl")
y_train = pd.read_pickle("data\y_train.pkl")
y_test = pd.read_pickle("data\y_test.pkl")

# Note: The destination paths above includes a duplicated backslash ("\\") rather than single ("\") as otherwise it shows the below error
# "SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 4-5: truncated \xXX escape"
# This is because the "\x..." ("\x_train" in our case) starts an 8-character Unicode escape where digits follow the "\x"
# But in our case, digits do not follow the "\x..." ("\x_train" in our case), making the escape invalid and throwing an error
# This error has been resolved based on https://stackoverflow.com/a/1347854

In [4]:
# Checking that the datasets loaded correctly, displaying the shapes
## Displaying the dataframes themselves take a lot of time, hence choosing to display their shapes
print(f"Shape of X_train dataset: {X_train.shape}")
print(f"Shape of X_test dataset: {X_test.shape}")
print(f"Shape of y_train dataset: {y_train.shape}")
print(f"Shape of y_test dataset: {y_test.shape}")

Shape of X_train dataset: (55248, 876)
Shape of X_test dataset: (13812, 876)
Shape of y_train dataset: (55248,)
Shape of y_test dataset: (13812,)


The shapes of the four sets match the outputs in Part 3 of the project. We can proceed with working them into our ML model in this part.

# Selection of Hyperparameters for Support Vector Classifier

The definition and overview of Support Vector Machine (SVM) has been discussed in Part 4 of this project. A full list and description of the hyperparameters we can tune in a SVM can be found in [its documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

We will consider setting / tuning the three most impactful and important hyperparameters for a Support Vector Classifier (SVC), as discussed below.


## Kernel Method

The Kernel Method (aka Kernel Trick) transforms data that is not linearly separable in an n-dimensional space to a higher dimension where it is linearly separable. An example of it is given below, where in the left picture we cannot separate the red cluster of data in the middle using a plane, but once the data points are transformed as in the second picture, we can use a plane to separate the red cluster from the green cluster of data:

<img src="https://miro.medium.com/max/1400/1*mCwnu5kXot6buL7jeIafqQ.png">

*(Further discussion on Kernel Trick, along with the image above, can be found in [this article on Medium](https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d).)*

The parameter options for kernel trick include `linear` for linear hyperplane (i.e., for two-dimensional data, a line), and `rbf` and `poly` for non-linear hyperplanes.

[The `linear` kernel is generally recommended for text classification](https://www.svm-tutorial.com/2014/10/svm-linear-kernel-good-text-classification/), for the following reasons:

- Most text classification problems are linearly separable.
- The linear kernel works well in case of a large set of features in the data, which is most often the case in text classification problems as we vectorize the text data into hundreds / thousands of features.
- Linear kernel is faster than the other kernels when training a SVM model.
- With linear kernel, we need to optimize only the penalty strength parameter, where with other kernels an additional parameter (gamma) needs to be optimized.

Considering our particular dataset, we will select the `linear` kernel.


## Penalty Strength

The `C` hyperparameter is a penalty term that determines how closely the model will fit to the training set, through a trade-off between smooth decision boundary or a decision boundary attempting to capture the trend in data (gravitating towards underfitting) and classifying the training points correctly (gravitating towards overfitting). 

For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, we can get misclassified examples, often even if our training data is linearly separable.

This acts in the same way as in a logistic regression as we covered in Part 4(a) of this project.

For our model, we will attempt to cycle through a range of `C` values (e.g., `C = 0.01, 0.1, 1, 10, 100` etc.).


## Gamma (γ)

The `gamma` parameter determines how closely the model will try to fit the training data set. The higher the `gamma` value, the more the model will try to exactly fit the training data.

Considering we selected the `linear` kernel, which does not need to optimize for `gamma` parameter, we will have to disregard this parameter (unless we choose a non-linear kernel later on).

# Employing Support Vector Classifier

We start by instantiating a SVC object for our modelling. Then, following suit, we intend to set up the function to iterate through the two hyperparameters chosen for this model in line with our [hyperparameter selection section above](#Selection-of-Hyperparameters-for-Support-Vector-Classifier).

Before attempting hyperparameter optimization though, we will perform a test run with set parameters to see how long the classifier takes to fit to the train data.

## Performing a Test Run with Set Parameters

For the test run, we will choose linear kernel and a `C` value. Unfortunately, the theory for determining a "good" starting `C` value is not very well developed at the moment, so we will start with the common `C=1.0`.

In [None]:
# For sake of maintaing the usual pathway, the test run of model fitting is being done with GridSearchCV, for sake of consistency
# But this should not matter as we are setting the parameter values to only one value for each

# Importing GridSearchCV from scikit-learn library's model_selection module
from sklearn.model_selection import GridSearchCV

In [None]:
# Importing Support Vector Classifier (SVC) from scikit-learn library's SVM module
from sklearn.svm import SVC

# Instantiating a SVC object for our modelling (not passing any parameters yet as we will set them and iterate through later)
svc_test = SVC()

In [None]:
# Creating a dictionary for the hyperparameters to iterate through and optimize
## Performing a test run with set hyperparameter values
## Keeping the dictionary structure so it can be used later with more hyperparameter values to iterate through
parameters_test = {
    'kernel': ['linear'],
    'C': [1.0]
}

# Creating a SVC object for GridSearchCV
## Asking the GridSearchCV to iterate through the parameters values we set above
## Setting the cross validation fold to 3 considering how computationally expensive SVM already is
svc_gscv_test = GridSearchCV(svc, parameters_test, cv=3)

In [None]:
# Fitting to train dataset
svc_gscv_test.fit(X_train, y_train)

It took a staggering number of hours to just fit to a pre-set value of parameters (about 8 hours at the local machine with 16 gigabytes of RAM, and about 4.5 hours at a high-powered machine on Google Colab Pro+ with 52 gigabytes of RAM). We can already consider this model as out of consideration down the road as, in reality when the best model is to be deployed to production, this model will perform extremely slow to generate results.

In [None]:
svc_gscv_test.best_estimator_

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Taking a quick look at the GridSearch CV results:

In [None]:
svc_gscv_test.best_score_

0.5002896032435563

In [None]:
pd.DataFrame(svc_gscv_test.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,2034.838971,1.048127,889.446101,1.163496,1,linear,"{'C': 1.0, 'kernel': 'linear'}",0.495765,0.503095,0.502009,0.50029,0.00323,1


So the `mean_test_score` above also does not give us much confidence where with an usual `C` value we are getting a `mean_test_score` (from the validation sets constructed out of the training set) of only 50.03%.

# Conclusion

Considering the extremely long runtime even on a high-powered machine, we can consider a SVC out of the race. This will not be practicable to deploy in production given the very-lengthy runtime, and will not be anywhere near a good candidate for solving our business problem, considering the very high performance from the other ML models in the previous parts of this project.