<a href="https://colab.research.google.com/github/matthewpecsok/4482_fall_2022/blob/main/tutorials/4482_classification_SVC_titanic_cleaned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to the Support Vector classification notebook. Please start with the Hyperparameter Tuning notebook if you have not already. 

## Setup

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix,\
 recall_score, precision_score, f1_score, accuracy_score, make_scorer,\
  precision_recall_fscore_support

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import random

import warnings
warnings.filterwarnings('ignore')

## Data

In [None]:
titanic_cleaned = pd.read_csv('https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/titanic_cleaned.csv').drop('Cabin', axis=1) # drop cabin

In [None]:
titanic_cleaned.head()

In [None]:
titanic_cleaned['Pclass'] = titanic_cleaned.Pclass.astype(str)

In [None]:
titanic_cleaned.info()

In [None]:
y = titanic_cleaned.pop('Survived')

In [None]:
X = pd.get_dummies(titanic_cleaned)
print(X.shape, y.shape)

In [None]:
X.head()

## SVC Modeling

### GriSearchCV 

Exploring multiple SVC models using grid search

In [None]:
import numpy as np
np.arange(0,10000,200)

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'C':[5,10,15,20,30,40,50,60,70,80,90,100],
              'kernel':['rbf','poly','sigmoid']
              }
svc = SVC(random_state=42)
clf = GridSearchCV(svc, parameters,scoring='f1')
clf.fit(X, y)

clf.score(X, y)


In [None]:
clf.best_estimator_

In [None]:
grid_search_df = pd.DataFrame(clf.cv_results_)
print(grid_search_df.shape) 
grid_search_df.sort_values('mean_test_score',ascending=False)#.head() #only taking the top five rows as this is a large dataframe sort by the best f1 scores found

visualize the f1 score by hyperparameter C

In [None]:
colors = {'sigmoid':'orange','rbf':'red', 'poly':'green' }


sns.lmplot('param_C', 'mean_test_score', data=grid_search_df,palette=colors, hue='param_kernel', fit_reg=False)
plt.title("all models mean F1 score optimized by C")
plt.show()

# Notice how the poly kernel is substantially worse than the rbf kernel. 
# on the rb kernel we seem to maximize the f1 with a C of 60


In [None]:
sns.lmplot('param_C', 'rank_test_score', data=grid_search_df,palette=colors, hue='param_kernel', fit_reg=False)
plt.title("all models rank as optimized by C")
plt.show()

visualize the grid search rank by C
in this case lower is better (rank 1 is the best rank)

In [None]:
pd.DataFrame(cross_validate(clf, 
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

To conclude we evaluated 36 hyperparameter combinations and achieved substantially better performance on some models compared to others. our best performing model achieved an F1 score of 71.5% compared to an F1 of 29.6%. That's an increase of 41.9%

What if you have potentially thousands of combinations? It may not be practical to run all the models. In that case randomizing your search and randomly picking combinations from the total number of possible models and selecting the best estimator is a good strategy. For example: in the decision tree model we have a parameter cc_alpha, which has an infinite number of possible values to choose from one cannot simply enter all the possible values so one must sample a subset of them from a uniform distribution. 

In [None]:
parameters = {'C':[5,10,15,20,30,40,50,60,70,80,90,100],
              'kernel':['rbf','poly','sigmoid']
              }
svc = SVC(random_state=42)
clf = RandomizedSearchCV(svc, parameters,scoring='f1')
clf.fit(X, y)

clf.score(X, y)

In [None]:
grid_search_df = pd.DataFrame(clf.cv_results_)
print(grid_search_df.shape) 
grid_search_df.sort_values('mean_test_score',ascending=False)#.head() #only taking the top five rows as this is a large dataframe sort by the best f1 scores found