* Actual coding begins at `17:39`

# Random Forest Interpretation - Chapter 4

# Pre-Discussion

* `set_rf_samples` means how many of the samples are made from the tree
* Before we start making trees we have two choiced
    * Sample w replacement from the entire dataset
    * Subsampling from the dataset 
* In the latter, the trees are made from only a small variation of the set 
* This is a trick oftenly done when dataset is very large
* The subsamples are also sometimes called `bootstrap samples`

---

* On `growth scale` of rf, consider the size to be $\log_2 (set\ rf\ samples)$
* The `no. of leaf nodes` is equal to the set_rf_samples
* Hence there is a `linear relationship` between set_rf_samples and number of leaf nodes
* So, in a sense, number of rf samples also decides the number of decisions made by the rf
* Therefore, the RF is going to be `less rich` in what it can predict as it will make `less binary choices` 
* How this relates to overfitting ? --> basically having low rfsamples will mean `less chances` of `overfitting`
* But it also means each of the individual tree in the forest will be `less accurate`

---

* Now looking in-depth what the idealogy about models with `bagging` is 
* You are trying to do two things:
    * A) Each individual estimator is as accurate as possible $\uparrow$ on the training set
    * B) The correlation between the estimators is low as possible $\downarrow$
    * So when you `average them out` together you end up with `better generalization`
* Hence, by setting set_rf_samples with a low number, you are decreasing the `A` factor and increasing the `B` factor

---

* Now what happens when you set `oob_score` to True
* In this case, remind yourself that there is these `residual` rows that did'net get included in the training set after the `subsampling stage`
* You can essentially construct a `quasi` validation set from this
* Obviously if you do not prefer this, it is possible to use `reset_rf_samples()` which simply sets rfsamples to 0 and uses the entire dataset, you WONT be able to use `oob_score` anymore now!

---

* Nextup is `min_samples_leaf` , setting this from (for eg) from 1 to 2, means that the depth of the decision tree will be `subtracted by 1`
* Because everytime we `double` the min_samples_leaf, we are removing `one layer` from the forest
* And the number of leaves will be `halved` if min_samples_leaf is `doubled`
* In this case, increasing min_samples_leaf will decrease `(A)` and increase `(B)` which `might` help us from `overfitting`
* Ideal choices for min_samples_leaf can be: *1, 3, 5, 10, 25, 100*

---

* Finally, the `max_features` determines how much portion of the features are selected `per-split`
* So if max_feautures = 0.5, then at each split, we take 0.5 of the features
* This will `reduce` the `coorelation` between the individual trees and *MAY* help with overfitting 
* The trade-off is that each of the tree will be `less accurate`
* Options you can have for `max_features` is :
    > sqrt for allow the sqrt of features

    > log2 to allow log2 of the number of features set
    
    > None means have all of them available at each split

# Libraries and Modules Import

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import os
currDir = os.getcwd()
os.chdir("../fastai/")
from structured import *       
from imports import *
os.chdir(currDir)
# ____________________________________________________________ #
from pandas_summary import DataFrameSummary
from IPython.display import display

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# ____________________________________________________________ #
import math
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [11]:
PATH2DATA = "../datasets/kaggle/bluebook_bulldozers/"
# !dir "../datasets/kaggle/corporcion_favorita_grocery_sales/"

# below is just the param to control different features in a graph plot
set_plot_sizes(12, 14, 16)


# Load dataset and Pre-Process

In [18]:
df_raw = pd.read_csv(f'{PATH2DATA}Train.csv',
                     low_memory=False, parse_dates=["saledate"])

# convert the columns to log
df_raw.SalePrice = np.log(df_raw.SalePrice)

# extract timeOfDay, timeOfMonth etc from time and date
add_datepart(df_raw, 'saledate')

# categorical to numeric - partly
train_cats(df_raw)
df_raw.UsageBand.cat.set_categories(
    ["High", "Medium", "Low"], ordered=True, inplace=True)

# use proc_df to quantify string columns
df_train, y_train, _ = proc_df(df_raw, 'SalePrice')


In [21]:
def split_vals(a, n):
    """
    a: number of samples (i.e. the entire dataset)
    n: number of training set to split
    """
    # a[:n] will retrieve the first (N - n_valid) rows for TRAINING set
    # a[n:] will retirve the last (N - n_valid) rows got VALIDATION set
    return a[:n].copy(), a[n:].copy()

n_valid = 12000

# the number of training sets will be len(df) - n_valid
n_trn = len(df_train) - n_valid

# now split the entire dataset into training and validation
raw_train, raw_valid = split_vals(df_raw, n_trn)

# before were raw, now get the real ones based on pre-processed version
X_train, X_valid = split_vals(df_train, n_trn)
y_train, y_valid = split_vals(y_train, n_trn)

In [22]:
print("X_train shape: {}, y_train shape : {},  x_valid shape : {}".format(X_train.shape, y_train.shape, X_valid.shape, y_valid.shape))

X_train shape: (389125, 66), y_train shape : (389125,),  x_valid shape : (12000, 66)
