# Lab 6 - Validation and Tuning
In this lab, you will learn how to use cross validation to improve 
your model evaluation and to optimize the parameters for
various types of classification algorithms.

The technique to do this is called **cross-validation**.
Until now, we have been using a train and test set to evaluate our models.
This method has some limitations. First, since the train and test sets
are typically chosen randomly, it is possible that either of these
samples (or both) do not reflect the general data well. This could
result in models that appear better or much worse than their actual
performance on new data. In addition, it can lead to "guessing" which
parameters can work, rather than *searching* various parameters in 
a systematic way. 

Cross validation (CV) works by taking your training data,
breaking it into smaller, roughly equal size subsets.
For example, in 5-fold CV, we divide the data into 5 subsets.
For the first fold, we will train a model using four subsets and 
use the remaining subset for testing. Then, we will choose a different
four subsets for training and the fifth set for testing. The result is that
we will train and test five different models. 

First, import the necessary tools for this lab.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import (KFold, ShuffleSplit,
                                     StratifiedKFold, 
                                     StratifiedShuffleSplit)

from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV

# Data
The data for this lab comes from a study of deception in high-stakes situations. These observations 
were statements from individuals in the trials of high-profile murders. Statements from guilty
parties are marked as `lies` and all others as `truth`. 

The language of these utterances has been quantified using the total number of words and the sentiment
(positive or negative) of the utterances.

In [2]:
trial_data = pd.read_csv('data/trial_data_language.csv')
trial_data.describe(include="all")

Unnamed: 0,audio_file,title_name,verdict,title,name,condition,transcript,total_words,sentiment
count,120,120,120,120,120,120,120,120.0,120.0
unique,119,44,9,11,31,2,119,,
top,trial_truth_008.mp4,Defendant / Jodi Arias,Guilty ...,Defendant,Jodi Arias,lie,"She was fine, laughing about simple little thi...",,
freq,2,18,54,70,33,60,2,,
mean,,,,,,,,66.666667,0.064516
std,,,,,,,,37.843404,0.542237
min,,,,,,,,8.0,-0.9844
25%,,,,,,,,42.75,-0.3071
50%,,,,,,,,59.5,0.0
75%,,,,,,,,82.75,0.5045


Let us specify our variables that we will use for prediction here.

In [3]:
pred_vars = ['total_words', 'sentiment']

# Using cross-validation in training
We will now use cross-validation to train a model. With cross validation,
we can get the performance of a model with several iterations of training 
and testing.

We will still hold a portion of the data out so that we can compare the
results of cross-validation to the test data.

## Create train and test sets
For this lab, we will use an 80/20 split, and we will
use stratified sampling. This will ensure proportional amounts
of the values in the `class` column in the train and test sets.

In [None]:
trial_data['class'] = np.where(trial_data['condition'] == 'truth', 0, 1)
np.random.seed(516)

# create train and test
train, test = train_test_split(trial_data, test_size=0.20, stratify=trial_data['condition'])
print("Rows in train:", len(train))
print("Rows in test:", len(test))



In [5]:
from sklearn.model_selection import GroupShuffleSplit 
# https://stackoverflow.com/questions/54797508/how-to-generate-a-train-test-split-based-on-a-group-id

splitter = GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 516)
split = splitter.split(trial_data, groups=trial_data['name'])
train_inds, test_inds = next(split)

train = trial_data.iloc[train_inds]
test = trial_data.iloc[test_inds]

## Train a model and view results
Now, we will train a random forest classifier
with the stratified shuffle split. This will take longer than
regular training, since it will train 5 different models.
We can have the models scored using a number of criteria.
The output contains a dictionary with various figures for
each model.

In [None]:
n_splits = 5
scoring = ['accuracy', 'neg_log_loss', 'f1', 'roc_auc']
rf_base = RandomForestClassifier()
cv_rf = cross_validate(rf_base, train[pred_vars], train['class'], cv=StratifiedShuffleSplit(n_splits), scoring=scoring)
print(cv_rf)

In [None]:
# view a single statistic over the models
print(cv_rf['test_roc_auc'])
print("mean model AUC", np.mean(cv_rf['test_roc_auc']))

In [None]:
print(cv_rf['test_neg_log_loss'])
# note that this uses negative log loss (log loss * (-1)), meaning that 
# higher values (those closer to 0) are better.
print("mean model log loss ", np.mean(cv_rf['test_neg_log_loss']))

# Model Tuning
One major strength of cross-validation is the ability to
search for different paramters in model specification to 
see which work best. Before, this was a very slow and manual
process. 

You can get an overview of model tuning 
[here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

First, we will specifiy some parameters for random forests that we would like to try.
In this example, we will vary the criterion (Gini or entropy) and the maximum tree
depth (5, 10, 15, or None).

In [None]:
params = {'criterion': ['gini', 'entropy'], 'max_depth': [5, 10, 15, None]}

This results in a large number of models to train, and may be slow. This dataset
is small, so it will not be much of an issue here.

The total number of models can be found by multiplying the number of parameter options
to try and the number of folds. In this example:
2 criterion * 4 tree depth * 5 folds = 40 models to train.

We must also specify a scoring method to optimize. Using different
scoring methods may yield very different model parameters, so this 
must be thought out in advance! Here, we are using the AUC as our
optimization criteria.

In [None]:
rf_base = RandomForestClassifier()
rf_tuned = GridSearchCV(rf_base, param_grid=params, cv=StratifiedShuffleSplit(n_splits), scoring='roc_auc')
rf_tuned.fit(train[pred_vars], train['class'])

After training the model, you can view lots of details about the optimization process. The code below shows the results of each parameter combination that we specified, averaged over all of the folds. 

In [None]:
print(rf_tuned.cv_results_)

View the model test scores for each paramter:

In [None]:
print(rf_tuned.cv_results_['mean_test_score'])
print(rf_tuned.cv_results_['mean_test_score'].mean())

View the parameter settings that generated the best model:

In [None]:
print(rf_tuned.best_estimator_)

## Tune a neural network
We will now apply this same process to a neural network. Since
these are very different from random forests under the hood,
the parameters to search over will be specific to multi-layer perceptrons.

Ignore the warnings for Convergence. You can use an import to solve this.

In [None]:
import warnings
warnings.filterwarnings('ignore')

nnet_base = MLPClassifier()

params = {'hidden_layer_sizes': [(100,), (10,10), (5,5,5)], 
          'solver': ['adam', 'lbfgs', 'sgd']}
# 9 different param combos, 5 cv = total of 45 models
# try with differnt scoring methods
# we get different results with accuracy & log_loss
nnet_tuned = GridSearchCV(nnet_base, param_grid=params, cv=StratifiedShuffleSplit(n_splits), scoring='roc_auc')
nnet_tuned.fit(train[pred_vars], train['class'])
# nnet_tuned.get_params()
print(nnet_tuned.cv_results_)

In [None]:
print(nnet_tuned.cv_results_['mean_test_score'])

In [None]:
print(nnet_tuned.best_estimator_)

# Evaluation
Next, we will compare the results of the two models, much like we did last time.
We will use the same method as last week to plot the ROC curves for each model.

In [None]:
fitted = [rf_tuned, nnet_tuned]

result_table = pd.DataFrame(columns=['classifier_name', 'fpr','tpr','auc', 
                                     'log_loss', 'clf_report'])

for clf in fitted:
    print(clf.estimator)
    yproba = clf.predict_proba(test[pred_vars])
    yclass = clf.predict(test[pred_vars])
    
    # auc information
    fpr, tpr, _ = metrics.roc_curve(test['class'],  yproba[:,1])
    auc = metrics.roc_auc_score(test['class'], yproba[:,1])
    
    # log loss
    log_loss = metrics.log_loss(test['class'], yproba[:,1])
    
    # add some other stats based on confusion matrix
    clf_report = metrics.classification_report(test['class'], yclass)
    
    
    result_table = result_table.append({'classifier_name':str(clf.estimator),
                                        'fpr':fpr, 
                                        'tpr':tpr, 
                                        'auc':auc,
                                        'log_loss': log_loss,
                                        'clf_report': clf_report}, ignore_index=True)
    


result_table.set_index('classifier_name', inplace=True)
# print(result_table)

In [None]:
for i in result_table.index:
    print('\n---- statistics for', i, "----\n")
    print(result_table.loc[i, 'clf_report'])
    print("Model AUC:", result_table.loc[i, 'auc'])
    print("Model log loss:", result_table.loc[i, 'log_loss'])

In [None]:
fig = plt.figure(figsize=(14,12))

for i in result_table.index:
    plt.plot(result_table.loc[i]['fpr'], 
             result_table.loc[i]['tpr'], 
             label="{}, AUC={:.3f}".format(i, result_table.loc[i]['auc']))
    
plt.plot([0,1], [0,1], color='orange', linestyle='--')

plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)

plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)

plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')

plt.show()


# Exercises

1. Tune the parameters for two additional algorithms. They can be those which were used in last lab, or
    you can try some other algorithm. A list of the models in `sklearn` is
    [here](https://scikit-learn.org/stable/supervised_learning.html). 
    Use a grid search on at least two different parameters for each model. For example, in the random forest
    built in the lab, we modified the `criterion` and `max_depth` parameters.
    1. You should now have four models to compare. Which model performs best on out-of-sample (`test`)
       data? Include precision, recall, accuracy, $F_1$, and ROC/AUC when deciding.


## Optional

2. Try using the grid search technique on the netattacks data from earlier labs. 
    1. Does model performance improve with the optimized parameters? 
    2. How long does model training take? (this could be a long time!)