<a href="https://colab.research.google.com/github/matthewpecsok/4482_fall_2022/blob/main/tutorials/4482_classification_MLP_titanic_cleaned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to the MLP classification notebook. In this notebook we will be exploring two new items. 

* The first will be learning how to use the MultiLayerPerceptron Algorithm. In the case that we start introducing hidden layers one would call this "Deep Learning" and "Deep Neural Network" which is an extremely powerful modeling technique. 

* The second concept which is extremely relevant for DNN is the selection of and tuning of hyperparameters. This selection can be computationally expensive and time consuming. For this we introduce sklearn's gridsearchcv to help us search/explore the parameter space without having to code this manually. 

* The notebook will begin by simply creating MLP models with various numbers of layers and a variety of neurons in each layer with the models becoming more computationally expensive and complex. Pay attention to the metrics to see if this complexity has actually improved our predictions. DNNs are notorious for overfitting data because they can become arbitrarily complex. 

* After you have grasped the primary concept of the DNN we will introduce other hyperparameters that are often tuned to improve the model, and finally we will introduce gridsearchcv to bring this code complexity back down to a reasonable effort. 

## Setup

In [None]:
import numpy as np
import pandas as pd

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix,\
 recall_score, precision_score, f1_score, accuracy_score, make_scorer,\
  precision_recall_fscore_support

from sklearn.model_selection import train_test_split, cross_validate

from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

import warnings
warnings.filterwarnings('ignore')


## Data

In [None]:
titanic_cleaned = pd.read_csv('https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/titanic_cleaned.csv').drop('Cabin', axis=1) # drop cabin

In [None]:
titanic_cleaned.head()

In [None]:
titanic_cleaned['Pclass'] = titanic_cleaned.Pclass.astype(str)

In [None]:
titanic_cleaned.info()

In [None]:
y = titanic_cleaned.pop('Survived')

In [None]:
X = pd.get_dummies(titanic_cleaned)
print(X.shape, y.shape)

In [None]:
X.head()

## MLP

# Changing the number of hidden layers on the model

### Model 1 (no hidden layers)

In [None]:
model_1 = MLPClassifier(random_state=2021,hidden_layer_sizes=()).fit(X,y)


In [None]:
model_1

In [None]:
model_1.n_layers_

In [None]:
model_1.hidden_layer_sizes

In [None]:
model_1.classes_

In [None]:
len(model_1.coefs_)

In [None]:
model_1.coefs_[0].shape

In [None]:
model_1.coefs_[0]

In [None]:
model_1_cv_results = pd.DataFrame(cross_validate(model_1,
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

model_1_cv_results

### Model 2 (1 hidden layer with 50 neurons)

In [None]:
# why do we fit here? without fitting we have no layers
model_2 = MLPClassifier(random_state=2021,hidden_layer_sizes=(50,)).fit(X,y)
print("hidden layers sizes",model_2.hidden_layer_sizes)
print("n_layers_",model_2.n_layers_)

In [None]:
model_2_cv_results = pd.DataFrame(cross_validate(model_2,
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

model_2_cv_results

### Model 3 (2 hidden layers, the first has 15 neurons and the second has 10 neurons)


In [None]:
# why do we fit here? without fitting we have no layers
model_3 = MLPClassifier(random_state=2021,hidden_layer_sizes=(15,10)).fit(X,y)
print("hidden layers sizes",model_3.hidden_layer_sizes)
print("n_layers_",model_3.n_layers_)

In [None]:
model_3_cv_results = pd.DataFrame(cross_validate(model_3,
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

model_3_cv_results

### Model 3 (3 hidden layers, the first has 50 neurons , the second has 25 neurons, the third has 10 neurons)

In [None]:
# why do we fit here? without fitting we have no layers
model_4 = MLPClassifier(random_state=2021,hidden_layer_sizes=(50,25,10)).fit(X,y)
print("hidden layers sizes",model_4.hidden_layer_sizes)
print("n_layers_",model_4.n_layers_)

In [None]:
model_4_cv_results = pd.DataFrame(cross_validate(model_4,
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

model_4_cv_results

### Model 5 (4 hidden layers, the first has 100 neurons , the second has 50 neurons, the third has 25 neurons and the fourth has 10 neurons)

In [None]:
# why do we fit here? without fitting we have no layers
model_5 = MLPClassifier(random_state=2021,hidden_layer_sizes=(100,50,25,10)).fit(X,y)
print("hidden layers sizes",model_5.hidden_layer_sizes)
print("n_layers_",model_5.n_layers_)

In [None]:
model_5_cv_results = pd.DataFrame(cross_validate(model_5,
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

model_5_cv_results

### Model 6 (5 hidden layers, the first has 100 neurons , the second has 50 neurons, the third has 25 neurons , the fourth has 25 neurons , the fifth has 10 neurons)

In [None]:
model_6_cv_results = pd.DataFrame(cross_validate(MLPClassifier(random_state=2021,hidden_layer_sizes=(100,50,25,25,10)),
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

model_6_cv_results

### Model 7 (3 hidden layers each with 500 neurons)

notice how much longer this model took to train compared to the others!

In [None]:
model_7_cv_results = pd.DataFrame(cross_validate(MLPClassifier(random_state=2021,hidden_layer_sizes=(500,500,500)),
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

model_7_cv_results

In [None]:
model_1_cv_results

In [None]:
model_2_cv_results

In [None]:
model_3_cv_results

In [None]:
model_4_cv_results

In [None]:
model_5_cv_results

In [None]:
model_6_cv_results

In [None]:
model_7_cv_results

Analyzing the results as shown above notice that the performance of the models is not always increasing even though the complexity is increasing. The increased complexity does seem to increase the fit time though which is expected. A more complex model will take longer to train. So there is a trade off between quality of the model and complexity of the model. 

In addition, notice how much code we had to write just to get these results? And if you were starting to feel like the code is redundant for each section you are correct! Notice with a simple function how we can do exactly the same thing much easier as shown below. 

In [None]:
model_1_hidden_layers = ()
model_2_hidden_layers = (50,)
model_3_hidden_layers = (15,10)
model_4_hidden_layers = (50,25,10)
model_5_hidden_layers = (100,50,25,10)
model_6_hidden_layers = (100,50,25,25,10)
model_7_hidden_layers = (500,500,500)

In [None]:
# create a function to train/evaluate the model using cross validation and return the
# results as a dataframe
def cv_mlp_models(hidden_layer_param):
  return_df = pd.DataFrame(cross_validate(MLPClassifier(random_state=2021,hidden_layer_sizes=hidden_layer_param),
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))
  return_df['hidden'] = str(hidden_layer_param)
  return return_df

In [None]:
# now simply join all the result dataframes together
def join_dfs(df,concat_df):
  concat_df = concat_df.append(df)
  return concat_df

In [None]:
# make a list of the hidden layer tuples so we can interate over it with a for loop
list_of_hidden_layers = [model_1_hidden_layers,model_2_hidden_layers,model_3_hidden_layers,model_4_hidden_layers,model_5_hidden_layers,model_6_hidden_layers,model_7_hidden_layers]

In [None]:
# create an empty dataframe with the correct column names, we need this so we have a dataframe that can collect all our results
concat_df = pd.DataFrame(columns=['fit_time',	'score_time'	,'test_accuracy',	'train_accuracy',	'test_recall',	'train_recall',	'test_precision',	'train_precision',	'test_f1',	'train_f1'])

# now simply iterate over our list of hidden layers, run the cv function for each hidden layer setting, and append all the results
for model_hidden_layer in list_of_hidden_layers:
  result_df = cv_mlp_models(model_hidden_layer)
  concat_df = join_dfs(result_df,concat_df)

concat_df

# Learning Rate Hyperparameter

Before this we were simply looking at the number of hidden layers and the quantity of neurons in this hidden layers to improve our model. Another common tuning hyperparameter is learning rate. This is the rate at which the weights are adjusted and has two primary benefits. The first is that a learning weight can impact the models ability to learn more quickly or slowly and second that it can help the model stay out of local minima and hopefully find the globbal minumum when reducing loss as it is getting tuned.

Let's do a quick exploration of this and its impact on training using a simple example

In [None]:
pd.DataFrame(cross_validate(MLPClassifier(random_state=2021,hidden_layer_sizes=(500,500,500),
                                          learning_rate_init=0.001), # note that 0.001 is the DEFAULT learning rate
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

now we adjust the learning rate and make it smaller to see if we can improve the model's accuracy, precision, recall etc.  

In [None]:
pd.DataFrame(cross_validate(MLPClassifier(random_state=2021,hidden_layer_sizes=(500,500,500),
                                          learning_rate_init=0.0001), 
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

it appears the model is almost as good as the original, and it took almost twice as long to train! while 18 seconds may not be too long to wait imagine this model takes 6 hours, then reducing the training time to 3 hours could be a significant advantage and if the accuracy is almost the same there's almost no disadvantage to doing so

# GridSearchCV

In addition to hidden layer size and depth, learning rate, there are also momentum and the maximum number of passes that could be tuned and MANY others! This gets overwhelming quite quickly and isn't efficient for us to try to test them all! Imagine writing the code to test the following

* 5 possible hidden layer sizes
* 6 possible learning rates
* 4 possible momentums
* 6 possible total numbers of passes

That is 5 *6 * 4 * 6 = 720 possible combinations! thats way too many to try and write by hand.

### Introducing GridSearch

GridSearch can try all of those combinations for you and simplify your code

note that I have chosen to use F1 as the primary metric for the grid search to use to evaluate the hyperparameter combinations. The grid search will return the best combination that maximizes the F1 score. 

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'learning_rate_init':[0.1,0.01,0.001],
               'hidden_layer_sizes':[(50,),(50,50,50),(4,4,4)],
              'activation':['identity', 'logistic', 'tanh', 'relu','adam']
              }
mlp = MLPClassifier(random_state=42)
clf = GridSearchCV(mlp, parameters,scoring='f1')
clf.fit(X, y)

clf.score(X, y)


In [None]:
# this is a complex dataframe that shows all of the various metrics you might want to look
# into if you want to understand how the grid search worked.
# a scenario might be: what if the grid search returns the "best" model but it takes 2 days to fit? (some DNNs take this long or longer to train)
# and is only slightly better than another model which takes 10 minutes to train. you might not bother
# picking the "best" model in that case

grid_search_df = pd.DataFrame(clf.cv_results_)

print(grid_search_df.shape) # we trained 30 models! is that what you would expect?

grid_search_df.sort_values('mean_test_score',ascending=False).head() #only taking the top five rows as this is a large dataframe sort by the best f1 scores found

after the gridsearch explores all the model combinations it selects the best model set of hyperparameters to be used. Once this model is trained it may result in an improvement over our previous efforts. Notice that I have not explored EVERY possible model we created previously and have instead opted to explore a variety of other hyperparameters to see if it's possible to achieve a better prediction than we achieved previously in the notebook. 

The point to take away from this is that
1. There are an infinite number of hyperparameter combinations
1. It's impossible to explore them all by hand
1. For each combination a model must be trained



In [None]:
# note the hyperpameters that the gridsearch discovered achieve the best results
clf.best_estimator_

In [None]:
pd.DataFrame(cross_validate(clf, 
               X,
               y,
               cv = 3,
               return_train_score=True,
               scoring=['accuracy','recall','precision','f1']))

compare this test_f1 to the test_f1 results we had from the other models. Have we improved the model's f1 score on the test set? And, was it easier than all the code we wrote before?