# Random Forest in Depth - Chapter 02

## 0. Learning Objectives

* Diving further into Random forests
* For some datasets they work really well,  
* What do we do when they don't work
* What are the pros and cons of Random Forests
* What parameters can be tuned 
* Look at how we interpret the results of random forest

## 1. Import Dataset

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [14]:
import os
currDir = os.getcwd()
os.chdir("../fastai/")
from structured import *
from imports import *
os.chdir(currDir)
# ____________________________________________________________ #
from pandas_summary import DataFrameSummary
from IPython.display import display

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# ____________________________________________________________ #

## 2. Data Pre-Processing

In [3]:
PATH2DATA = "../datasets/kaggle/bluebook_bulldozers/"
!dir PATH2DATA  # view what is in the data directory

 Volume in drive C is DATA
 Volume Serial Number is B8F3-C6D7

 Directory of c:\Users\mjasus\Downloads\MAINPAPERWORK\JOB_SOFTWARES\Python4DS\data_science



File Not Found


---

**STEP 1)**

* Read the data and say which columns are dates

In [4]:
df_raw = pd.read_csv(f'{PATH2DATA}Train.csv',
                     low_memory=False, parse_dates=["saledate"])


---

**STEP 2)**

* Now a better way of displaying the results 
* With the addittional property of setting the nos. of columns and rows to display

In [8]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000):
        with pd.option_context("display.max_columns", 1000):
            display(df)

---

**STEP 3)**

* THen showing how only few of these last results can be displayed usign `tail()` method
* But this data has too many rows, so it will be much better to show it columns x row wise using `transpose()` method

In [None]:
display_all(df_raw.tail().transpose())

---

**STEP 4)**

* You need to check what type of loss function is used to train the dataset
* Can confirm it on Kaggle with the section on `evaluation`
* Then, accordinly change the data
* For example, Bluebook Bulldozers require `RMSLE` which means the dependant entries (in our case `Sale Price`) need to be `logged` as shown below

In [12]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

---

**STEP 5)**

* Initialise the random forest algorithm
* Set it so that the dependant variable is Sale Price and all of the other columns except Sale Price are independant variables
* The exception of such nature is carried out using the `df_raw.drop` command

In [None]:
rfalg = RandomForestRegressor(n_jobs=-1)
rfalg.fit(df_raw.drop("SalePrice", axis=1), df_raw.SalePrice)

---

**STEP 6)**

* Make sure all the entries of tables are numbers
* In particular `extract` further features from the `Sale Date` column entry
* You can use `add_datepart` method in fast.ai library

In [17]:
add_datepart(df_raw, 'saledate')
df_raw.saleYear.head()

  df[targ_pre+n] = getattr(fld.dt,n.lower())
  df[targ_pre+'Elapsed'] = fld.astype(np.int64) // 10**9


0    2006
1    2004
2    2004
3    2011
4    2009
Name: saleYear, dtype: int64

---

**STEP 7)**

* Make sure all the categorical entries of tables are converted into numbers
* You can use the `train_cat` or the `apply_cat` method in fast.ai library
* Allows us to map categorical data to 0, 1, ... n
* You can inspect this using the newly added `.cat.categories` method in fast.ai
* And use `.cat.codes` to see all the mappings of the categorical entries

In [20]:
train_cats(df_raw)
df_raw.UsageBand.cat.categories

Index(['High', 'Low', 'Medium'], dtype='object')

In [21]:
df_raw.UsageBand.cat.codes

0         1
1         1
2         0
3         0
4         2
         ..
401120   -1
401121   -1
401122   -1
401123   -1
401124   -1
Length: 401125, dtype: int8

---

**STEP 8)**

* Manually set the categories to improve RFAlgol performance slightly
* This is done using the `.cat.set_categories` as endowed by the fast.ai libraries

In [None]:
df_raw.UsageBand.cat.set_categories(
    ["High", "Medium", "Low"], ordered=True, inplace=True)

---

**STEP 9)**

* Remove any empty entries in the table
* You can first inspect those using `display_all` and then calculating percentages of empty or NaN entries within the table

In [22]:
display_all(df_raw.isnull().sum().sort_index() / len(df_raw))


## 3. Algorithm

## 4. Outputs