## Support Vector Machines - Part 1

#### Table of Contents

- [Preliminaries](#Preliminaries)
- [Null Model](#Null-Model)
- [10% Correlation](#10%-Correlation)
- [5% Correlation](#5%-Correlation)
- [1% Correlation](#1%-Correlation)
- [Comparison](#Comparison)

***
# Preliminaries
[TOP](#Support-Vector-Machines---Part-1)

Here we have our usual set up.

However, this time we are going to compare choosing features based upon their correlation with the label `pos_net_job`.
We will do so at

* 10%
- 5%
- 1%

This will result with a postponed train-test split.

In [None]:
# utilities
import numpy as np
import pandas as pd

# processing
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler

# algorithms
from sklearn.svm import LinearSVC

In [None]:
df = pd.read_pickle('C:/Users/johnj/Documents/Data/aml in econ 02 spring 2021/class data/class_data.pkl')
df_prepped = df.drop(columns = ['urate_bin', 'year']).join([
    pd.get_dummies(df['urate_bin'], drop_first = True),
    pd.get_dummies(df.year, drop_first = True)    
])

**********
# Null Model 
[TOP](#Support-Vector-Machines---Part-1)

In [None]:
y = df_prepped['pos_net_jobs'].astype(float)
y_train, y_test = train_test_split(y, train_size = 2/3, random_state = 490)

In [None]:
yhat_null = y_train.value_counts().index[0]
acc_null = np.mean(y_test == yhat_null)
acc_null

*****
# 10% Correlation
[TOP](#Support-Vector-Machines---Part-1)

First, let's produce a correlation matrix with the data frame method `.corr()`

In [None]:
df_prepped.corr()

This is far too much information. 
We reall only want the values for `pos_net_jobs`.

Remember that Python is zero-indexed...

In [None]:
df_prepped.corr().iloc[:, 1]

Now we are going to select those that have at least a 10% correlation with our label. 
Specifically, we want the absolute value of the correlation to be weakly greater than 10%.

In [None]:
pos_net_job_cor = np.abs(df_prepped.corr().iloc[:, 1])
vrbls = pos_net_job_cor[pos_net_job_cor >= 0.10].index
vrbls

Neat.

Now we can select the variables that we want.

In [None]:
df_prepped2 = df_prepped.loc[:, vrbls]

In [None]:
y = df_prepped2['pos_net_jobs'].astype(float)
x = df_prepped2.drop(columns = 'pos_net_jobs')

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)

ss = StandardScaler()
x_train_std = pd.DataFrame(ss.fit(x_train).transform(x_train),
                           columns = x_train.columns,
                           index = x_train.index)

x_test_std = pd.DataFrame(ss.fit(x_test).transform(x_test),
                          columns = x_test.columns, 
                          index = x_test.index)

Now let's cross-validate the optimal value of `C`

In [None]:
%%time
param_grid = {
    'C': 10.0**np.linspace(-5, 2, num = 20)
}

svc_cv = LinearSVC(dual = False)

grid_search = GridSearchCV(svc_cv, param_grid, 
                          cv = 5,
                          scoring = 'accuracy')
grid_search.fit(x_train_std, y_train)
best_10 = grid_search.best_params_
best_10

Alternatively:

In [None]:
%%time
param_grid = { # List or numpy array
    'C': 10.0**np.linspace(-5, 2, num = 20),
    'dual': [False]
}

svc_cv = LinearSVC()

grid_search = GridSearchCV(svc_cv, param_grid, 
                          cv = 5,
                          scoring = 'accuracy')
grid_search.fit(x_train_std, y_train)
best_10 = grid_search.best_params_
best_10

Now to refit and find the accuracy with the model with the full testing data using the optimal value of `C`.

In [None]:
svc_tuned_10 = LinearSVC(C = best_10['C'], dual = False)
acc_tuned_10 = svc_tuned_10.fit(x_train_std, y_train).score(x_test_std, y_test)
acc_tuned_10

*****
# 5% Correlation
[TOP](#Support-Vector-Machines---Part-1)

Let's do the same thing with a weakly greater than 5% threshold.

In [None]:
pos_net_job_cor = np.abs(df_prepped.corr().iloc[:, 1])
vrbls = pos_net_job_cor[pos_net_job_cor >= 0.05].index
df_prepped2 = df_prepped.loc[:, vrbls]

In [None]:
y = df_prepped2['pos_net_jobs'].astype(float)
x = df_prepped2.drop(columns = 'pos_net_jobs')

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)

ss = StandardScaler()
x_train_std = pd.DataFrame(ss.fit(x_train).transform(x_train),
                           columns = x_train.columns,
                           index = x_train.index)

x_test_std = pd.DataFrame(ss.fit(x_test).transform(x_test),
                          columns = x_test.columns, 
                          index = x_test.index)

Now let's cross-validate the optimal value of `C`

In [None]:
%%time
param_grid = {
    'C': 10.0**np.linspace(-5, 2, num = 20)
}

svc_cv = LinearSVC(dual = False)

grid_search = GridSearchCV(svc_cv, param_grid, 
                          cv = 5,
                          scoring = 'accuracy')
grid_search.fit(x_train_std, y_train)
best_5 = grid_search.best_params_
best_5

Now to refit and find the accuracy with the model with the full testing data using the optimal value of `C`.

In [None]:
svc_tuned_5 = LinearSVC(C = best_5['C'], dual = False)
acc_tuned_5 = svc_tuned_5.fit(x_train_std, y_train).score(x_test_std, y_test)
acc_tuned_5

*****
# 1% Correlation
[TOP](#Support-Vector-Machines---Part-1)

Let's do the same thing with a weakly greater than 1% threshold.

In [None]:
pos_net_job_cor = np.abs(df_prepped.corr().iloc[:, 1])
vrbls = pos_net_job_cor[pos_net_job_cor >= 0.01].index
df_prepped2 = df_prepped.loc[:, vrbls]

In [None]:
y = df_prepped2['pos_net_jobs'].astype(float)
x = df_prepped2.drop(columns = 'pos_net_jobs')

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)

ss = StandardScaler()
x_train_std = pd.DataFrame(ss.fit(x_train).transform(x_train),
                           columns = x_train.columns,
                           index = x_train.index)

x_test_std = pd.DataFrame(ss.fit(x_test).transform(x_test),
                          columns = x_test.columns, 
                          index = x_test.index)

Now let's cross-validate the optimal value of `C`

In [None]:
%%time
param_grid = {
    'C': 10.0**np.linspace(-5, 2, num = 20)
}

svc_cv = LinearSVC(dual = False)

grid_search = GridSearchCV(svc_cv, param_grid, 
                          cv = 5,
                          scoring = 'accuracy')
grid_search.fit(x_train_std, y_train)
best_1 = grid_search.best_params_
best_1

Now to refit and find the accuracy with the model with the full testing data using the optimal value of `C`.

In [None]:
svc_tuned_1 = LinearSVC(C = best_1['C'], dual = False)
acc_tuned_1 = svc_tuned_1.fit(x_train_std, y_train).score(x_test_std, y_test)
acc_tuned_1

********************
# Comparison 
[TOP](#Support-Vector-Machines---Part-1)

Print the percent improvement in the accuracy for each of three models. 
Which model was the best performer?

In [None]:
pct_10 = 100*(acc_tuned_10 - acc_null)/acc_null
pct_5  = 100*(acc_tuned_5  - acc_null)/acc_null
pct_1  = 100*(acc_tuned_1  - acc_null)/acc_null

print('10% Corr. Accuracy Improvement: {0: .2f}'.format(pct_10))
print('5% Corr. Accuracy Improvement: {0: .2f}'.format(pct_5))
print('1% Corr. Accuracy Improvement: {0: .2f}'.format(pct_1))

Print the optimal value of `C` for each model. 
Which model has the least amount of regularization?

In [None]:
print(best_10['C'])
print(best_5['C'])
print(best_1['C'])