# Vehicle Loan Prediction Machine Learning Model

# Chapter 7 - Random Forest

### Recap and Load
- As always, let's begin by importing our libraries and loading the data
- Notice that we are importing RandomForestClassifier from sklearn.ensemble

*Throughout this chapter you may see slightly different results to those on the demo videos. The outputs vary due to the random nature of the random forest algorithm but they should be similar to those in the videos*

*Some of the models we will build here are a bit more complex, if you are running into memory related issues try and free up memory by closing down any programs that you do not need to complete the chapter*

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score, recall_score, roc_curve, auc, precision_score, plot_confusion_matrix

In [None]:
loan_df = pd.read_csv('../data/vehicle_loans_feat.csv', index_col='UNIQUEID')

Just like we did for Logistic Regression let's convert our categorical variables to the 'category' data type

In [None]:
#convert to category

Now we can bring the plot_roc_curve and eval_model functions we defined in chapter 6

In [None]:
#get plot roc curve

In [None]:
#get eval models


## Lesson 1 - Building The Forest

In this lesson, we will use the [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from sklearn to build a Random Forest model for our data


### EXERCISE 

- We seem to be duplicating the code for creating training/test sets and dummy variables 
- Fill in the function definition below to take in a data frame, create dummy variables and split the data into train/test sets 
- The return statement has been filled out for you

### SOLUTION

In [None]:
def encode_and_split(loan_df):
    #type solution here

    return x_train, x_test, y_train, y_test

Now let's test our new function and create a training and test set for RandomForest, this time using the full set of features 

In [None]:
#run encode and split

In [None]:
#get training shape

In [None]:
#get testing shape

In [None]:
#check class distribution

Ok great, looks like we have a train and test set with the class distribution we want

### EXERCISE 

- Use [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to train and evaluate a Random Forest Model
- HINT: The model can be trained using the [fit](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit) function

### SOLUTION

In [None]:
#type solution here

Let's take a minute to interpret these results 

### Accuracy 

- ~78% similar to the simple logistic regression model we built already

### Precision 

- 39% better than simple logistic Regression which had ~33% 
- More of the instances we classified as defaults actually were defaults 
- However, most of the instances we classify as defaults are actually not defaults

### Recall 

- Recall has increased dramatically, from 0.03% to 4.5%!
- Random Forest picked up a lot more of the actual positive cases
- It still missed most of them

### F1

- The F1 score has also increased dramatically from 0.0006 to ~0.08! 
- There is a better balance between Precision and Recall for Random Forest
- Although this is still generally poor

### AUC 

- The area under the roc curve has increased very slightly

### Probability Distributions 

- Plot shows bad class separation 
- Majority of cases unlikely to be classified as defaults 

Generally the random forest is better than Logistic Regression but it is still not doing a good job

## Lesson 2 - Overfitting

A model is said to be overfitted if it performs very well on training data but does not generalize well to unseen test data

We can look at evaluate our model's performance on the training data to investigate overfitting

In [None]:
#eval on training data

Wow! Pretty clear evidence that our random forest is overfitting, it has nearly perfect results on the training data and poor results on the test data

## Lesson 3 - Hyperparameters 

Classification performance of random forest can be heavily influenced by its hyperparameters

### Hyperparameter Tuning 

- The process of selecting optimal hyperparameters 
- Can be tricky and time-consuming, many automated methods exists for finding the parameters that yield the best classification results 
- Out of scope of this course but if you are interested look at [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

### Number of Trees
- How big is the forest?
- Typically increasing the number of trees increases model performance up to a certain point
- Too many trees can increase the computational cost and does little to benefit classification performance
- set through the n_estimators property

### Maximum Depth
- The longest path between a tree root node and its deepest leaf node
- exposed through max_depth parameter which defaults to None, meaning the max depth is not limited
- limiting the depth of the trees can be used to reduce overfitting


 



### Number of Trees 

Let's do some manual exploration of the forest size parameter, remember the default value is 100 

In [None]:
# 1 estimator

- With a forest size of 1, the random forest behaves as a standalone decision tree and is unable to distinguish between the two classes
- With AUC of 0.52 it is only marginally better than a random classifier

Let's see what happens if we increase the number of trees to 10

In [None]:
#10 estimators

- We see here that with a forest size of 10 the separation ability of the model increases with an AUC of 0.58
- Multiple peaks on the distribution chart suggest that this is not a very stable model

How about with the default value of 100 trees?

In [None]:
#100 estimators


- With 100 estimators the AUC improved from 0.58 to 0.62
- Class distributions appeared more defined and settled

What about if we increase to 300?

*NB - You might not be able to run all the scenarios due to system capacity. If you receive a "MemoryError", it means the model is too expensive to run on your computer. Try reducing n_estimators to 200 or 150.*

In [None]:
#300 estimators

Very similar performance to the default value of 100! 

Increasing the size of the forest helps classification performance up to a point

However, it also increases the computational cost of training the model

### Maximum Depth

We observed earlier that our random forest model is overfitting

One way of tackling overfitting in random forest is by limiting the Maximum Depth of the trees. This prevents the classifiers from growing to large picking up noise in the training data

The default value of max_depth is None (it is not limited!)

Let's do some experiments

*NB - You might not be able to run all the scenarios due to system capacity. If you receive a "MemoryError", it means the model is too expensive to run on your computer.*

In [None]:
#max depth 5

We have increased the AUC but the model is failing to identify any loan defaults

Let's take a look at how it performs on the training data

In [None]:
#check the overfitting

As with the test data, the model is not identifying any defaults.

Very similar performance between training and test data tells us we are not overfitting anymore, but the model has very little predictive power

Limiting the tree size to 5 has probably oversimplified the model and actually given us an underfit model!

Let's try again with a larger max_depth

In [None]:
#max depth 15

A few things to note here! 

We have increased the AUC to ~0.65, this model has the best ability to separate classes that we have seen so far! 

It is also has a very good precision score of 67%, but we are still identifying very few loan defaults hence the poor recall

Let's have a look at the training set performance!

In [None]:
#check the overfitting

Our model does perform better on the training data so it could be a little overfitted. However, it certainly is much less dramatic than before! 

We have now limited the complexity of the trees in our forest which has reduced overfitting. 

### A Note on Hyperparameter Tuning 

- We have discussed the effects of n_estimators and max_depth in isolation 
- Random Forest has many more parameters that can be tuned 
- In reality, parameters are dependent on each other, i.e changing one affects the others
- Automated methods to find the right balance exist, look at [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)