* Actual coding begins at `17:39`
* Updated version of this lesson available at :: https://www.kaggle.com/code/ailobe/fastai-ml1-lesson2-rf-interpretation/notebook
* The command `git restore .` will remove any changes and allow fresh pull from repo

# Random Forest Interpretation - Chapter 4

# 0. Pre-Discussion

* `set_rf_samples` means how many of the samples are made from the tree
* Before we start making trees we have two choiced
    * Sample w replacement from the entire dataset
    * Subsampling from the dataset 
* In the latter, the trees are made from only a small variation of the set 
* This is a trick oftenly done when dataset is very large
* The subsamples are also sometimes called `bootstrap samples`

---

* On `growth scale` of rf, consider the size to be $\log_2 (set\ rf\ samples)$
* The `no. of leaf nodes` is equal to the set_rf_samples
* Hence there is a `linear relationship` between set_rf_samples and number of leaf nodes
* So, in a sense, number of rf samples also decides the number of decisions made by the rf
* Therefore, the RF is going to be `less rich` in what it can predict as it will make `less binary choices` 
* How this relates to overfitting ? --> basically having low rfsamples will mean `less chances` of `overfitting`
* But it also means each of the individual tree in the forest will be `less accurate`

---

* Now looking in-depth what the idealogy about models with `bagging` is 
* You are trying to do two things:
    * A) Each individual estimator is as accurate as possible $\uparrow$ on the training set
    * B) The correlation between the estimators is low as possible $\downarrow$
    * So when you `average them out` together you end up with `better generalization`
* Hence, by setting set_rf_samples with a low number, you are decreasing the `A` factor and increasing the `B` factor

---

* Now what happens when you set `oob_score` to True
* In this case, remind yourself that there is these `residual` rows that did'net get included in the training set after the `subsampling stage`
* You can essentially construct a `quasi` validation set from this
* Obviously if you do not prefer this, it is possible to use `reset_rf_samples()` which simply sets rfsamples to 0 and uses the entire dataset, you WONT be able to use `oob_score` anymore now!

---

* Nextup is `min_samples_leaf` , setting this from (for eg) from 1 to 2, means that the depth of the decision tree will be `subtracted by 1`
* Because everytime we `double` the min_samples_leaf, we are removing `one layer` from the forest
* And the number of leaves will be `halved` if min_samples_leaf is `doubled`
* In this case, increasing min_samples_leaf will decrease `(A)` and increase `(B)` which `might` help us from `overfitting`
* Ideal choices for min_samples_leaf can be: *1, 3, 5, 10, 25, 100*

---

* Finally, the `max_features` determines how much portion of the features are selected `per-split`
* So if max_feautures = 0.5, then at each split, we take 0.5 of the features
* This will `reduce` the `coorelation` between the individual trees and *MAY* help with overfitting 
* The trade-off is that each of the tree will be `less accurate`
* Options you can have for `max_features` is :
    > sqrt for allow the sqrt of features

    > log2 to allow log2 of the number of features set
    
    > None means have all of them available at each split

# 1. Libraries and Modules Import

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import os
currDir = os.getcwd()
os.chdir("../fastai/")
from structured import *       
from imports import *
os.chdir(currDir)
# ____________________________________________________________ #
from pandas_summary import DataFrameSummary
from IPython.display import display

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# ____________________________________________________________ #
import math
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [3]:
PATH2DATA = "../datasets/kaggle/bluebook_bulldozers/"
# !dir "../datasets/kaggle/corporcion_favorita_grocery_sales/"

# below is just the param to control different features in a graph plot
set_plot_sizes(12, 14, 16)


# 2. Load dataset and Pre-Process

In [4]:
df_raw = pd.read_csv(f'{PATH2DATA}Train.csv',
                     low_memory=False, parse_dates=["saledate"])

# convert the columns to log
df_raw.SalePrice = np.log(df_raw.SalePrice)

# extract timeOfDay, timeOfMonth etc from time and date
add_datepart(df_raw, 'saledate')

# categorical to numeric - partly
train_cats(df_raw)
df_raw.UsageBand.cat.set_categories(
    ["High", "Medium", "Low"], ordered=True, inplace=True)

# use proc_df to quantify string columns
df_train, y_train, _ = proc_df(df_raw, 'SalePrice')


In [5]:
def split_vals(a, n):
    """
    a: number of samples (i.e. the entire dataset)
    n: number of training set to split
    """
    # a[:n] will retrieve the first (N - n_valid) rows for TRAINING set
    # a[n:] will retirve the last (N - n_valid) rows got VALIDATION set
    return a[:n].copy(), a[n:].copy()

n_valid = 12000

# the number of training sets will be len(df) - n_valid
n_trn = len(df_train) - n_valid

# now split the entire dataset into training and validation
raw_train, raw_valid = split_vals(df_raw, n_trn)

# before were raw, now get the real ones based on pre-processed version
X_train, X_valid = split_vals(df_train, n_trn)
y_train, y_valid = split_vals(y_train, n_trn)

In [6]:
print("X_train shape: {}, y_train shape : {},  x_valid shape : {}".format(X_train.shape, y_train.shape, X_valid.shape, y_valid.shape))

X_train shape: (389125, 66), y_train shape : (389125,),  x_valid shape : (12000, 66)


# 3.0. Model Setup and Initial Run

## 3.1. Some pre-defined functions for output and visualization

In [7]:
# function that will take the RMSE
def rmse(pred, known):
    return np.sqrt(((pred-known)**2).mean())

# function to round ans. to 5dp like Kaggle leaderboard answers


def rounded(value):
    return np.round(value, 5)

# function to return the rmse scores and R^2 values for train and validation set


def print_scores(model):
    RMSE_train = rmse(model.predict(X_train), y_train)
    RMSE_valid = rmse(model.predict(X_valid), y_valid)
    R2_train = model.score(X_train, y_train)
    R2_valid = model.score(X_valid, y_valid)

    # list the scores and check if oob_score is present
    scores = [rounded(RMSE_train), rounded(RMSE_valid),
              rounded(R2_train), rounded(R2_valid)]
    if hasattr(model, 'oob_score_'):
        scores.append(model.oob_score_)
    print(scores)

## 3.2. Subsampling

* Interpretation of the model has less to do with getting the `best accuracy` and more into the `insights` regarding the data
* In other words, how are the features within the data `correlated`
* For this to be tested, the model must first be `reliable`
* But also, when subsampling we need to make sure the subsample is `large enough` so that the model is reliable
* For this example we use about 50000 samples
* Recall that `oob_score=True` is only used when dataset is big enough to allow four split

In [9]:
# set_rf_samples(5000)   # the old set_rf_samples is causing error

# first run of the model
model1st = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5,
                                 n_jobs=-1, oob_score=True, random_state=17190)

In [10]:
# fit the model and measure time taken
%time model1st.fit(X_train, y_train)

Wall time: 20.7 s


RandomForestRegressor(max_features=0.5, min_samples_leaf=3, n_estimators=40,
                      n_jobs=-1, oob_score=True, random_state=17190)

## 3.3 One Hot Encode (@ 41:00)

### 3.3.1. Lecture Notes

* Earlier we had converted `UsageBands` to 
    * High $\rightarrow$ 0
    * Low $\rightarrow$ 1
    * Medium $\rightarrow$ 2
* Obv. RF doesnt know the categories, it just sees 0,1 and 2
* Now we can get about two or three `nested` splits here
* But lets say, for sake of argument, we had a wider range, with columns such as Very Low, Very Medium, Very High or Unknown
* Have a larger `range` of columns would `increase` the `number of splits`
* This is inefficient as the everytime we do a nested split, we are `halving` the amount of data
* Instead of this we can split the individual columns into `binary format` for e.g. say *isVeryLow*, *isVeryMedium*, or *isVeryHigh*
* And this essentially allows us to reduce the number of `nested splits` which is ideal for efficiency down to 1

---

* This is known as `one-hot encoding` of data points
* ANd it is fine if we have too many columns being too similar since `linear models hate co-linearity` 
* In our case it is not that big of a deal
* We are mainly doing one-hot encoding for the sake of `interpreting` your ML model
* Sometimes it might reveal the true `influence/importance` of a feature that was latent when the nested splits could not capture the `true insight` of its influence
* One-hot encoding is performed by pandas using the `pd.get_dummies` $\rightarrow$ you can get more info on this using `??numericalize`

---

* It is mainly implemented by setting the `max_cats` argument in `proc_df`
* This argument decides if the limiting number of columns has `cardinality` that must be less than max_cats
* For example, UsageBands has Low, Medium and High, i.e. cardinality = 3, Sex has Male, Female i.e. cardinality = 2
* So here if we set max_cats = 7, all of those with cardinality less than 7 will be one-hot encoded

---

* Awkward question asked `@53:02 - @53:35`, actually not so awkward :D ... the question is that some data will be organised very orderly by lets say having a `grading system` 
* A grading system for example saying values shift from poor, to good, to very good
* Then, using dummy variables (one-hot encoding) might destroy this order, how do we overcome it ?

[ANS] You can easily make it an integer to prevent destruction of the order with `proc_df` by equalling it to its `cat.codes`, for example **df_raw.UsageBands = df_raw.UsageBands.cat.codes**

### 3.3.2. Code

## 3.4. Removing Redundant Features (@ 54:50)

### 3.4.1. Lecture Notes

* This builds up from hierarchial sorting of importance
* You can remove these features with the help of `dendograms`
* This is a type of `heirarchical clustering` algorithm
* Cluster analysis allows us to look at rows or columns and decide which ones are similar
* A good example of cluster analysis is `k-means`
* In heirarchical or anglomerative clustering, we look at every `pair` of `objects/points`
* And then decide which two objects are the closest
* Given those, delete them and replace them with an average that sits in the middle of those points
* Then iteratively perform this `pairwise combining` that is taking a pair of points and replacing them with their averages

---

* In our example, we instead of looking at pair of points/objects, we look at the pair of `columns` and/or `variables`
* We want to know which two `tree variables` are the most similar
* The horizonal axis of the dendogram shows how similar are the two variables being compared
* If the vertical line is more to right, the variables are more similar
* In this particular example we use the units of `Spearman's R` to tell the difference between the varaibles
* `Correlation coeff.` are almost similar to `R^2`, except correlation is between two variables, and R^2 is between the variable and its prediction
* ALso we instead of comparing the points directly, we compare their `rank`, this will help fortify our `linearity assumption` when testing the correlation between variables

---

* After that we get the (@ 1:04:33) `out of band score`
* Here it does rf on some dataframe and get the oob_score on that
* The idea is to compare the effect on the oob_score_ after removing some of the variables one at a time
* First you get a baseline oob_score_ by training on the entire data frame
* Then you sequentially remove variables and test the scores, if it improves, then remove that variable

### 3.4.2. Code

## 3.5. Partial Dependence (@ 1:07:20)

## 3.5.1. Lecture Notes

* Technique is not very well-known but is very powerfull
* 

### 3.5.2. Code