# LSG Assignment

We already did some hyper-parameter tuning in previous lectures, but we were a little loose about how we did it: (1) we didn't use a validation data like we should have, and (2) we had to write a lot of custom-code to collected results. If we try a few different models we can get away with being a little sloppy, but now we're going to do things right. You should not be surprised to find out that hyper-parameter tuning being a common ML task, there's functionality in `sklearn` to help us with it. In this assignment, we are going to use it to try different combinations of hyper-parameters for the SVM classifier we trained in the lecture.

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

from sklearn.model_selection import GridSearchCV

Let's load and pre-process the data in the same way we did in the lecture

# used 1/16 of the data to debug and 1/4 of the data to train.

In [2]:
bank = pd.read_csv("../data/bank-full-1.csv", sep = ";")

# bank is too large for a grid search assignment.  We will speed up calculations by reducing the data set to 1/4
OldShape = bank.shape
bank = bank.sample(int(bank.shape[0]/4)) #changed to 16 while debugging
print("Dataset size is reduced from ", OldShape, " to ", bank.shape)

num_cols = bank.select_dtypes(['integer', 'float']).columns
cat_cols = bank.select_dtypes(['object']).drop(columns = "y").columns

X_train, X_test, y_train, y_test = train_test_split(bank.drop(columns = "y"), bank["y"], 
                                                    test_size = 0.10, random_state = 42)

X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)

onehoter = OneHotEncoder(sparse = False)
onehoter.fit(X_train[cat_cols])
onehot_cols = onehoter.get_feature_names(cat_cols)
X_train_onehot = pd.DataFrame(onehoter.transform(X_train[cat_cols]), columns = onehot_cols)
X_test_onehot = pd.DataFrame(onehoter.transform(X_test[cat_cols]), columns = onehot_cols)

znormalizer = StandardScaler()
znormalizer.fit(X_train[num_cols])
X_train_norm = pd.DataFrame(znormalizer.transform(X_train[num_cols]), columns = num_cols)
X_test_norm = pd.DataFrame(znormalizer.transform(X_test[num_cols]), columns = num_cols)

X_train_featurized = X_train_onehot # add one-hot-encoded columns
X_test_featurized = X_test_onehot   # add one-hot-encoded columns
X_train_featurized[num_cols] = X_train_norm # add numeric columns
X_test_featurized[num_cols] = X_test_norm   # add numeric columns

del X_train_norm, X_test_norm, X_train_onehot, X_test_onehot

print("Featurized training data has {} rows and {} columns.".format(*X_train_featurized.shape))
print("Featurized test data has {} rows and {} columns.".format(*X_test_featurized.shape))

X_train_featurized.head()

Dataset size is reduced from  (45211, 17)  to  (11302, 17)
Featurized training data has 10171 rows and 51 columns.
Featurized test data has 1131 rows and 51 columns.


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_other,poutcome_success,poutcome_unknown,age,balance,day,duration,campaign,pdays,previous
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.871497,0.939251,0.387423,-0.350663,-0.562904,-0.413454,-0.291715
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,-0.268481,0.127069,-0.691799,0.923129,-0.562904,-0.413454,-0.291715
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.156492,0.83751,0.747164,0.184865,-0.562904,2.397009,0.677903
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.871497,-0.392126,-1.291367,-0.343012,-0.562904,-0.413454,-0.291715
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,-0.743473,-0.398419,0.267509,-0.916792,0.380153,-0.413454,-0.291715


There are three main ways to search the **hyper-parameter space**:

- **Grid search:** tries every combination of hyper-parameters
- **Random search:** tries a random subset of all combinations of hyper-parameters
- **Bayesian optimization:** tries a subset of all combinations of hyper-parameters (like random search) but does so in a more intelligent way, based on trading off the need to **explore** (trying a part of the hyper-parameter space thus far unexplored) and the need to **exploit** (focusing on a part of the hyper-parameter space that thus far seems promising)

We will use a grid search algorithm here, as implemented by the `GridSearchCV` function. As a bonus, the grid search algorithm uses cross-validation (CV) to evaluate the model. Cross-validation can slow down the process, but we can use a lower number of **folds** to speed it up.

SVMs have two important **high-level hyper-parameters** and then some lower-level ones that depend on the high-level ones. The high-level hyper-parameters are `C`, `kernel`. Depending on the choice of `kernel`, we can also specify `degree` and `gamma`. You can read more about that [here](https://scikit-learn.org/stable/modules/svm.html#kernel-functions).

In addition to the hyper-parameters mentioned above, `SVC` also has some important arguments such as `max_iter` and `class_weight`, or `cache_size` which we should be aware of.

- Use `GridSearchCV` to train multiple `SVC` classifiers with different hyper-parameter combination. <span style="color:red" float:right>[15 point]</span>
  - The hyper-parameters you want to try are `kernel`, `C` and `gamma`. You should pick two different choices for each. 
  - For `SVC` setting `probability = True` slows down training considerably, so it's not a good idea to use it during grid search. (Instead, we can retrain the final model using the hyper-pramaters combinations that we found and set `probability = True` to if we need to get soft predictions but we won't worry about that here.) Increase the cache size to 1024
  - We leave it to you to read the documentation for `SVC` to see what choices make sense. Morever, your grid search should perform 3-fold cross-validation to select the best model, execute 4 parallel jobs (n_jobs), and return the training score (return_train_score = True).
  - It's best to avoid running everything in one line. So try to break your code into a few different steps to make it easy to follow.

In [3]:
## your code goes here
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10],'gamma':[1,10]}
svc=SVC()
clf = GridSearchCV(svc, parameters,n_jobs=4, return_train_score=True, cv=3)

- While coding and debugging, reduce your training dataset to an additional 1/4 of the rows.
(After your code works, you can try the full dataset and then pick the model with the best combination of hyper-parameters: in this context, hyper-parameter tuning is often also referred to as **model selection**.)  
- Run your grid search to train all the models. <span style="color:red" float:right>[5 point]</span>

In [4]:
## your code goes here
clf.fit(X_train_featurized, y_train)

GridSearchCV(cv=3, estimator=SVC(), n_jobs=4,
             param_grid={'C': [1, 10], 'gamma': [1, 10],
                         'kernel': ('linear', 'rbf')},
             return_train_score=True)

All the results generated form the work done by the grid search is stored in the `cv_results_` attribute. For example, if we want to know the combination of hyper-parameters that was tried in the 10th iteration, we can run `clf.cv_results_['params'][9]` (assuming the trained model is called `clf`) and if we wantc to know the cross-validated evaluation score for that 10th itearation, we can run `clf.cv_results_['mean_test_score'][9]`.

Note that we need to be careful about terminology here. Unfortunately, the hyper-parameters are called `params` by `GridSearchCV`. But in ML **parameters** are the things that the algorithm learns from the data (such as the coefficients in the prediction equation), whereas **hyper-parameters** cannot be learned from the data, which is why we have to tune them by trying different combination. Also, the cross-validated score is called `mean_test_score` even though we are not using the test data to evaluate it. At least not during model selection. We will use the test data later to evaluate the final model.

Present a data frame of the grid search parameters and the validation scores ('mean_test_score') <span style="color:red" float:right>[5 point]</span>

In [5]:
clf.cv_results_['mean_test_score']

array([0.89538895, 0.880936  , 0.89538895, 0.8819192 , 0.89519229,
       0.87916629, 0.89519229, 0.88182087])

In [6]:
clf.cv_results_['params']

[{'C': 1, 'gamma': 1, 'kernel': 'linear'},
 {'C': 1, 'gamma': 1, 'kernel': 'rbf'},
 {'C': 1, 'gamma': 10, 'kernel': 'linear'},
 {'C': 1, 'gamma': 10, 'kernel': 'rbf'},
 {'C': 10, 'gamma': 1, 'kernel': 'linear'},
 {'C': 10, 'gamma': 1, 'kernel': 'rbf'},
 {'C': 10, 'gamma': 10, 'kernel': 'linear'},
 {'C': 10, 'gamma': 10, 'kernel': 'rbf'}]

In [7]:
results=pd.DataFrame(clf.cv_results_['params'])
results['mean_test_score']=clf.cv_results_['mean_test_score']
results.head()

Unnamed: 0,C,gamma,kernel,mean_test_score
0,1,1,linear,0.895389
1,1,1,rbf,0.880936
2,1,10,linear,0.895389
3,1,10,rbf,0.881919
4,10,1,linear,0.895192


In [8]:
results.loc[results.mean_test_score==results.mean_test_score.max()]

Unnamed: 0,C,gamma,kernel,mean_test_score
0,1,1,linear,0.895389
2,1,10,linear,0.895389


Time to pull the best model. We can explicitly call the `clf.best_estimator_` method. However, calling `clf.best_estimator_` explicitly is not necessary: by calling `clf.estimator` it is **implied** that we are calling the best estimator. This means that if we call `clf.predict`, we would be using the best estimator to get predictions.

- Get predictions on the training and test data for the best model. Finally, get the precision and recall of the best estimator to see how they compare to what we got from logistic regression during the lecture. <span style="color:red" float:right>[5 point]</span>

In [9]:
## your code goes here
y_hat_train=clf.predict(X_train_featurized)
y_hat_test=clf.predict(X_test_featurized)
precision_train = precision_score(y_train, y_hat_train, pos_label = 'no') * 100 #per piazza discussion changed to no for the positive outcome state
precision_test = precision_score(y_test, y_hat_test, pos_label = 'no') * 100

recall_train = recall_score(y_train, y_hat_train, pos_label = 'no') * 100
recall_test = recall_score(y_test, y_hat_test, pos_label = 'no') * 100

print("Precision = {:.0f}% and recall = {:.0f}% on the training data.".format(precision_train, recall_train))
print("Precision = {:.0f}% and recall = {:.0f}% on the validation data.".format(precision_test, recall_test))

Precision = 91% and recall = 98% on the training data.
Precision = 94% and recall = 98% on the validation data.


### Results from lecture
Precision = 65% and recall = 35% on the training data.

Precision = 63% and recall = 34% on the validation data.

### Conclusion
By utilizing the grid search we managed to increase both the precision and recall on both the training and validation data. By utilizing parameters (C=1(happens to be the default), gamma=1, and kernel='linear') on this model compared to gamma = 'scale', cache_size = 1024 in the class notebook. The default cache_size is 200 and was used in the grid search model. 

### Trained a stand alone SVC model with probability=True to see how it outputs.

In [10]:
clf.best_estimator_

SVC(C=1, gamma=1, kernel='linear')

In [11]:
## trained a typical SVC model using the hyper parameters determined for fun an included probability =True to see what the outputs are
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10],'gamma':[1,10]}
svc_fin=SVC(kernel='linear',C=1,gamma=1,probability=True)
svc_fin.fit(X_train_featurized, y_train)

SVC(C=1, gamma=1, kernel='linear', probability=True)

In [12]:
y_hat_train_fin_prob=svc_fin.predict_proba(X_train_featurized)
y_hat_train_fin=svc_fin.predict(X_train_featurized)
print('y_hat_train_final_probability \n',y_hat_train_fin_prob[:5])
print('y_hat_train_final \n',y_hat_train_fin[:5])

y_hat_train_final_probability 
 [[0.93413699 0.06586301]
 [0.84846366 0.15153634]
 [0.85015186 0.14984814]
 [0.95022764 0.04977236]
 [0.96449301 0.03550699]]
y_hat_train_final 
 ['no' 'no' 'no' 'no' 'no']


# End of assignment