## Mission

## A note on cross-validation/validation

This notebook is for learning purposes. It includes two approaches to validation: cross-validating the training data and validating using a separate validation dataset. In practice, you generally will only use one or the other for a given project. 

Cross-validation is more rigorous, because it maximizes the usage of the training data, but if you have a very large dataset or limited computing resources, it may be better to validate with a separate validation dataset.

*   Encoding of categorical features as dummies
*   Stratification during data splitting
*   Fitting a model
*   Using `GridSearchCV` to cross-validate the model and tune the following hyperparameters:  
    - `max_depth`  
    - `max_features`  
    - `min_samples_split`
    - `n_estimators`  
    - `min_samples_leaf`  
*   Model evaluation using precision, recall, and f1 score

---

>  **Modeling objective:** To predict whether a customer will churn&mdash;a binary classification task.


>  **Target variable:** `Exited` column&mdash;0 or 1.  

>  **Class balance:** The data is imbalanced 80/20 (not churned/churned), but we will not perform class balancing.

>  **Primary evaluation metric:** F1 score.

>  **Modeling workflow and model selection:** The champion model will be the model with the best validation F1 score. Only the champion model will be used to predict on the test data. See the annotated decision tree notebook for details and limitations of this approach.

In [1]:
## Importing Statements

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns',None)

## Scikitlearn
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score,\
confusion_matrix, ConfusionMatrixDisplay, classification_report

from sklearn.ensemble import RandomForestClassifier

## Saving models with this module
import pickle

In [2]:
## Usual parameters to adjust
df = pd.read_csv('rfParameters.csv')
df.head(10)

Unnamed: 0,Parameter,Description,Possible Values,Best Options
0,n_estimators,The number of trees in the forest.,"Integer, e.g. 100, 200, 500",100-1000
1,max_depth,The maximum depth of the trees in the forest.,"Integer, e.g. 3, 5, 10",5-10
2,min_samples_split,The minimum number of samples required to spli...,"Integer, e.g. 2, 5, 10",2-10
3,min_samples_leaf,The minimum number of samples required to be a...,"Integer, e.g. 1, 2, 5",1-5
4,max_features,The number of features to consider when lookin...,"Integer, ""auto"", ""sqrt"", ""log2""","""auto"""
5,criterion,The function to measure the quality of a split.,"""gini"", ""entropy""","""gini"""
6,bootstrap,Whether to bootstrap the samples when building...,"Boolean, True, False",TRUE
7,oob_score,Whether to calculate an out-of-bag score.,"Boolean, True, False",TRUE


In [16]:
## locations:
loc1 = '/home/gato/Scripts/DS/MachineLearning/data/churn.csv'
loc2 = '/home/gato/Scripts/DS/MachineLearning/data/churnModeling.csv'

In [17]:
## Reading data
df = pd.read_csv(loc1)
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [18]:
## Reading data 2
## This data was cleaned, checked and get to this stage. On an earlier task
df2 = pd.read_csv(loc2)
df2.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Loyalty,Geography_Germany,Geography_Spain
0,619,42,2,0.0,1,1,1,101348.88,1,0.047619,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0.02439,0,1
2,502,42,8,159660.8,3,1,0,113931.57,1,0.190476,0,0
3,699,39,1,0.0,2,0,0,93826.63,0,0.025641,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0.046512,0,1


In [19]:
x_columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Loyalty',
       'Geography_Germany', 'Geography_Spain']

In [20]:
## Splitting the data
y = df2['Exited']
X = df2[x_columns]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

##

### Cross-validated hyperparameter tuning

The cross-validation process is the same as it was for the decision tree model. The only difference is that we're tuning more hyperparameters now. The steps are included below as a review. 

For details on cross-validating with `GridSearchCV`, refer back to the decision tree notebook, or to the [GridSearchCV documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) in scikit-learn.

1. Instantiate the classifier (and set the `random_state`). 

2. Create a dictionary of hyperparameters to search over.

3. Create a dictionary of scoring metrics to capture. 

4. Instantiate the `GridSearchCV` object. Pass as arguments:
  - The classifier (`rf`)
  - The dictionary of hyperparameters to search over (`cv_params`)
  - The dictionary of scoring metrics (`scoring`)
  - The number of cross-validation folds you want (`cv=5`)
  - The scoring metric that you want GridSearch to use when it selects the "best" model (i.e., the model that performs best on average over all validation folds) (`refit='f1'`)

5. Fit the data (`X_train`, `y_train`) to the `GridSearchCV` object (`rf_cv`).

Note that we use the `%%time` magic at the top of the cell. This outputs the final runtime of the cell. (Magic commands, often just called "magics," are commands that are built into IPython to simplify common tasks. They begin with `%` or `%%`.)


### Hyper-parameter grid. 
Tune five hyper-parameters, 
- max depth, 
- min samples leaf, 
- min samples split, 
- max features, and 
- number of estimators. 

**max depth**, note, `None` is included. This means that one of the options allows the trees to grow without a specific limit on their depth. 

Instantiate our classifier and assign it a **random state for reproducibility** and specify the metrics the model will capture. 

**Instantiate the grid search object.** 
It has two positional arguments, 
- the classifier and 
- the parameter grid.

Use the scoring metrics specified above and set CV to five. 

This means the model will be cross-validated using five folds. 

Lastly, specify **refit equals F1.** This is necessary when we've given multiple scoring metrics because it tells grid search that even though we want to check a few different metrics, the one we care most about is the F1 score. 

In [None]:
%%time

rf = RandomForestClassifier(random_state=0)

cv_params = {'max_depth': [2,3,4,5, None], 
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'max_features': [2,3,4],
             'n_estimators': [75, 100, 125, 150]
             }  

scoring = ['accuracy', 'precision', 'recall', 'f1']

rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='f1')

rf_cv.fit(X_train, y_train)

# good news is that there's a method that enables you to save the fit model object to a specified location and then quickly read it back in.
 
 **Once you find a model you're happy with**, you don't want to start from scratch every time you open your notebook, and that's where pickling comes in. 
 
 Pickle is a tool that saves the fit model object to a specified location and then quickly reads it back in. 
 It also allows you to use models that would fit somewhere else without having to train them yourself. Let's pick up where we left off and pickle the model. 
 
 - First, specify a file path to the directory where the model will be saved. 
 - Then create a `with open()` statement, passing to it the file path plus the name you want to use to save this model, followed by dot pickle. This creates an empty pickle file. The second argument, `wb`, gives permission `to_write` to the file **in binary, which is how pickling works.** Use `as` to assign the return value of open to a local variable named `to_write`. Call `pickle.dump` and pass the fit model object to it. Then, the to_write variable.
 
 - In the next cell, read back in the pickled model from where it's saved. The only difference in syntax is using `rb` to specify that we'll be reading binary and using `pickle.load` to assign a new variable, which points to the fit model. Make sure you call this new variable by the same name you used for your fit model above, in this case, `rf_cv`. If you comment out the line of code where you fit the model and the cell where you pickle the model, you can close the notebook, reopen it, and rerun all the cells without having to wait for the model to fit.

## Pickle  

When models take a long time to fit, you don’t want to have to fit them more than once. If your kernel disconnects or you shut down the notebook and lose the cell’s output, you’ll have to refit the model, which can be frustrating and time-consuming. 

`pickle` is a tool that saves the fit model object to a specified location, then quickly reads it back in. It also allows you to use models that were fit somewhere else, without having to train them yourself.

In [9]:
# Define a path to the folder where you want to save the model
path = '/home/gato/Scripts/DS/MachineLearning/'


This step will ***W***rite (i.e., save) the model, in ***B***inary (hence, `wb`), to the folder designated by the above path. In this case, the name of the file we're writing is `rf_cv_model.pickle`.

In [10]:
# Pickle the model
with open(path+'rf_cv_model.pickle', 'wb') as to_write:
    pickle.dump(rf_cv, to_write)

Once we save the model, we'll never have to re-fit it when we run this notebook. Ideally, we could open the notebook, select "Run all," and the cells would run successfully all the way to the end without any model retraining. 

For this to happen, we'll need to return to the cell where we defined our grid search and comment out the line where we fit the model. Otherwise, when we re-run the notebook, it would refit the model. 

Similarly, we'll also need to go back to where we saved the model as a pickle and comment out those lines.  

Next, we'll add a new cell that reads in the saved model from the folder we already specified. For this, we'll use `rb` (read binary) and be sure to assign the model to the same variable name as we used above, `rf_cv`.

In [11]:
# Open Read in pickled model
with open(path + 'rf_cv_model.pickle', 'rb') as to_read:
    rf_cv = pickle.load(to_read)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


ValueError: node array from the pickle has an incompatible dtype:
- expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
- got     : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]

Now everything above is ready to run quickly and without refitting. We can continue by using the model's `best_params_` attribute to check the hyperparameters that had the best average F1 score across all the cross-validation folds.

In [None]:
%%time
rf_cv.fit(X_train, y_train)
rf_cv.best_params_