# Table of Content (Code Part 5 of 6)

1. [Libraries](#libraries)
2. [Support Vector Classifier](#support-vector-classifier)
3. [SVC with Countvectorizer](#svc-with-countvectorizer)
4. [SVC with TFIDF Vectorizer](#svc-with-tfidf-vectorizer)
5. [SVC TFIDFVectorizer Optimization](#svc-tfidfvectorizer-optimization)

# Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import SVC
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)

In [88]:
# Read allposts csv file
allposts = pd.read_csv('../Data/allposts.csv')

# Support Vector Classifier

## SVC with Countvectorizer

In the Model Selection section, we made use of default hyper parameter and only countvectorizer transformer. In this section, we will try with different hyperparameters as well as TFID transformer to determine which is able to generate better CV score and test accuracy.

In [None]:
# Initialize X and y
X = allposts['stemmatized_tokenized_combinedtext']
y = allposts['subreddit']

# Perform train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=12)

In [None]:
# Set up pipeline with countvectorizer and support vector classifier
pipe_svc_c = Pipeline([
    ('cvec', CountVectorizer(stop_words= 'english')),
    ('svc', SVC())
])

In [None]:
# Pipeline parameters
pipe_svc_c_params = {
    'cvec__max_df': [0.9, 1.0],
    'svc__gamma': ['scale', 'auto'],
    'svc__kernel': ['poly','rbf']
}

In [None]:
# Perform Gridsearch with 3 fold cross validation
gs_svc_c = GridSearchCV(pipe_svc_c, pipe_svc_c_params, cv=3, verbose=10)

In [None]:
# Train the model
gs_svc_c.fit(X_train, y_train);

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV 1/3; 1/8] START cvec__max_df=0.9, svc__gamma=scale, svc__kernel=poly........
[CV 1/3; 1/8] END cvec__max_df=0.9, svc__gamma=scale, svc__kernel=poly;, score=0.676 total time=  11.4s
[CV 2/3; 1/8] START cvec__max_df=0.9, svc__gamma=scale, svc__kernel=poly........
[CV 2/3; 1/8] END cvec__max_df=0.9, svc__gamma=scale, svc__kernel=poly;, score=0.666 total time=  12.4s
[CV 3/3; 1/8] START cvec__max_df=0.9, svc__gamma=scale, svc__kernel=poly........
[CV 3/3; 1/8] END cvec__max_df=0.9, svc__gamma=scale, svc__kernel=poly;, score=0.651 total time=  15.4s
[CV 1/3; 2/8] START cvec__max_df=0.9, svc__gamma=scale, svc__kernel=rbf.........
[CV 1/3; 2/8] END cvec__max_df=0.9, svc__gamma=scale, svc__kernel=rbf;, score=0.847 total time=  11.0s
[CV 2/3; 2/8] START cvec__max_df=0.9, svc__gamma=scale, svc__kernel=rbf.........
[CV 2/3; 2/8] END cvec__max_df=0.9, svc__gamma=scale, svc__kernel=rbf;, score=0.853 total time=   9.9s
[CV 3/3; 2/8] STA

In [None]:
# Gridsearch CV best score
gs_svc_c.best_score_

0.8521254329330364

In [None]:
# Gridsearch best parameters
gs_svc_c.best_params_

{'cvec__max_df': 0.9, 'svc__gamma': 'scale', 'svc__kernel': 'rbf'}

The best parameters for countvectorizer is at max_df = 0.9 while the other best parameters are same as default parameters.

In [None]:
# Train score
print (f'train score: {gs_svc_c.score(X_train, y_train)}')
# Test Score
print (f'test score: {gs_svc_c.score(X_test, y_test)}')

train score: 0.9165
test score: 0.8555


The model is slightly overfitting. Let's take a look at TFID vectorizer to see if it is able to provide better result.

## SVC with TFIDF Vectorizer

Support vector classifier with TFIDF vectorizer transformer and stemmatized tokenized text.

In [None]:
# Set up pipeline with TFIDF Vectorizer and support vector classifier
pipe_svc_t = Pipeline([
    ('tvec', TfidfVectorizer(stop_words='english')),
    ('svc', SVC())
])

In [None]:
# Pipeline parameters
pipe_svc_t_params = {
    'tvec__max_df': [0.9,1.0],
    'svc__gamma': ['scale', 'auto'],
    'svc__kernel': ['poly','rbf']
}

In [None]:
# Perform Gridsearch with 3 fold cross validation
gs_svc_t = GridSearchCV(pipe_svc_t, pipe_svc_t_params, cv=3, verbose=10)

In [None]:
# Train the model
gs_svc_t.fit(X_train, y_train);

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV 1/3; 1/8] START svc__gamma=scale, svc__kernel=poly, tvec__max_df=0.9........
[CV 1/3; 1/8] END svc__gamma=scale, svc__kernel=poly, tvec__max_df=0.9;, score=0.822 total time=  17.7s
[CV 2/3; 1/8] START svc__gamma=scale, svc__kernel=poly, tvec__max_df=0.9........
[CV 2/3; 1/8] END svc__gamma=scale, svc__kernel=poly, tvec__max_df=0.9;, score=0.814 total time=  18.0s
[CV 3/3; 1/8] START svc__gamma=scale, svc__kernel=poly, tvec__max_df=0.9........
[CV 3/3; 1/8] END svc__gamma=scale, svc__kernel=poly, tvec__max_df=0.9;, score=0.821 total time=  19.0s
[CV 1/3; 2/8] START svc__gamma=scale, svc__kernel=poly, tvec__max_df=1.0........
[CV 1/3; 2/8] END svc__gamma=scale, svc__kernel=poly, tvec__max_df=1.0;, score=0.822 total time=  18.5s
[CV 2/3; 2/8] START svc__gamma=scale, svc__kernel=poly, tvec__max_df=1.0........
[CV 2/3; 2/8] END svc__gamma=scale, svc__kernel=poly, tvec__max_df=1.0;, score=0.814 total time=  18.1s
[CV 3/3; 2/8] S

In [None]:
# Gridsearch CV best score
gs_svc_t.best_score_

0.8617496518861625

In [None]:
# Gridsearch CV best parameters
gs_svc_t.best_params_

{'svc__gamma': 'scale', 'svc__kernel': 'rbf', 'tvec__max_df': 0.9}

For this Gridsearch with TFIDF vectorizer, the best parameters are not the default parameters anymore.

In [None]:
# Train score
print (f'train score: {gs_svc_t.score(X_train, y_train)}')
# Test Score
print (f'test score: {gs_svc_t.score(X_test, y_test)}')

train score: 0.9755
test score: 0.8615


The CV score and test score are better than the Gridsearch with countvectorizer. We will make use of TFIDF vectorizer for further optimization to reduce overfitting.

## SVC TFIDFVectorizer Optimization

Since the model is overfitting, we will try to impose certain regularization.

In [None]:
# TFIDFvectorizer transformer with max_df 0.9 as it is the best parameter.
tvec = TfidfVectorizer(max_df=0.9)

X_train_c = tvec.fit_transform(X_train)
X_test_c = tvec.transform(X_test)

In [None]:
# Instantiate Support Vector Classifier
svc_t_opt = SVC(random_state=12)

In [82]:
# Set different regulization strength in new parameters
svc_t_opt_params = {
    'C': [0.85,0.9,0.95]
}

In [83]:
# Perform Gridsearch with 5 fold cross validation
gs_svc_t_opt = GridSearchCV(svc_t_opt, svc_t_opt_params, cv=5, verbose=10)

In [84]:
# Train the model with new parameters
gs_svc_t_opt.fit(X_train_c, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5; 1/3] START C=0.85......................................................
[CV 1/5; 1/3] END .......................C=0.85;, score=0.868 total time=  19.3s
[CV 2/5; 1/3] START C=0.85......................................................
[CV 2/5; 1/3] END .......................C=0.85;, score=0.868 total time=  19.9s
[CV 3/5; 1/3] START C=0.85......................................................
[CV 3/5; 1/3] END .......................C=0.85;, score=0.871 total time=  20.2s
[CV 4/5; 1/3] START C=0.85......................................................
[CV 4/5; 1/3] END .......................C=0.85;, score=0.876 total time=  20.3s
[CV 5/5; 1/3] START C=0.85......................................................
[CV 5/5; 1/3] END .......................C=0.85;, score=0.856 total time=  19.7s
[CV 1/5; 2/3] START C=0.9.......................................................
[CV 1/5; 2/3] END ........................C=0.9;,

In [85]:
# Gridsearch CV best score
gs_svc_t_opt.best_score_

0.868375

In [86]:
# Gridsearch CV best parameters
gs_svc_t_opt.best_params_

{'C': 0.9}

In [87]:
# Train score
print (f'train score: {gs_svc_t_opt.score(X_train_c, y_train)}')
# Test Score
print (f'test score: {gs_svc_t_opt.score(X_test_c, y_test)}')

train score: 0.968
test score: 0.861


After optimization, the CV score has improved but the model is still having high variance and it is very much overfitting. In this case, SVC with Countvectorizer might even perform better in terms of interpretability.
However, Random Forest has much better performance in terms of over fitting issue. We will select random forest classifer for further tuning.

In [89]:
# This is the SVC model with the best parameters
tvec_best = CountVectorizer(max_df=0.9)
svc_best = SVC(C=0.9, random_state=12)