# 01 | Basics - Introduction to Random Forests

## 1.1. About autoreload

* These two lines help modify modules and update on demand
* Mainly to show tables within the notebook itself
* If you want to use any Python variable inside a Jupyter console, use CURLIES

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [16]:
from 

ImportError: attempted relative import with no known parent package

In [17]:
from .fastai.imports import *
from .fastai.structured import *
# ____________________________________________________________ #
from pandas_summary import DataFrameSummary
from IPython.display import display

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# ____________________________________________________________ #
import pandas as pd

ImportError: attempted relative import with no known parent package

## 1.2. About Software development practices in Data Science

* `Pep 8` is the Python standard for developing source code in Python
* Most of DS/ML is `prototyping model`
* The component of software development is taken away (although very loosely)
* Best practices for prototyping model is barely even standardised, so you are on your own to setup your own style and standards

## 1.3. About Blue Book for Bulldozers

* It is about predicting `sales price` for heavy equipments
* here you better keep a seperate `PATH` variable to store the location of the dataset
* Also the *structured* module from fastai has been removed to *tabular*
* This is an example of a `structured` data --> this is where pandas come in
* 

In [None]:
PATH2DATA = '../datasets/kaggle/bluebook_bulldozers/'


## 1.4. Basics on Visualizing the Data

* You can use the `head` command to make

In [None]:
!head '../datasets/kaggle/bluebook_bulldozers/Train.csv'

SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
1139246,66000,999089,3157,121,3,2004,68,Low,11/16/2006 0:00,521D,521,D,,,,Wheel Loader - 110.0 to 120.0 Horsepower,Alabama,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,None or Unspecified,None or Unspecified,,,,,,,,,,,,,Standard,Conventional
1139248,57000,117657,77,121,3,1996,4640,Lo

* Or use the pandas library
* the `f{PATH}Train.csv` is sth known as the Python 3.7 string
* e.g. if you have variable NAME = 'deedo' , and then you type in command `f My name is {NAME}`, then this will print `My name is deedo`
* But keep in mind if you remove `f` it WONT WORK, `f` allows `interpolation` of string variables
* You can also make it work for `integers` use code inside curlies for example `{NAME}.upper()`

In [None]:
# setting low_memory to true just saves ram. setting to false lets it read more of the file
df_raw = pd.read_csv(f'{PATH2DATA}Train.csv',
                     low_memory=False, parse_dates=["saledate"])


* You will find that pandas `DataFrame` are exactly like R dataframes

In [None]:
df_raw

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,2006-11-16,...,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,2004-03-26,...,,,,,,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,2004-02-26,...,,,,,,,,,,
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,2011-05-19,...,,,,,,,,,,
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,2009-07-23,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401120,6333336,10500,1840702,21439,149,1.0,2005,,,2011-11-02,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401121,6333337,11000,1830472,21439,149,1.0,2005,,,2011-11-02,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401122,6333338,11500,1887659,21439,149,1.0,2005,,,2011-11-02,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401123,6333341,9000,1903570,21435,149,2.0,2005,,,2011-10-25,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,


## 1.5. Visualizing your Data to Inspect Types

* Important to look at your data and understand `the format`, `how stored`, `what type of values` it holds
* Even if you read the description of the data, it will never be enough
* `.tail()` will show us the last few rows
* Keep in mind dont look at the data for too long, it risks `overfitting` 
* You should generally keep the `EDA` right towards the end (EDA means analysing the data)

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000):
        with pd.option_context("display.max_columns", 1000):
            display(df)

In [None]:
display_all(df_raw.tail().T)

Unnamed: 0,401120,401121,401122,401123,401124
SalesID,6333336,6333337,6333338,6333341,6333342
SalePrice,10500,11000,11500,9000,7750
MachineID,1840702,1830472,1887659,1903570,1926965
ModelID,21439,21439,21439,21435,21435
datasource,149,149,149,149,149
auctioneerID,1.0,1.0,1.0,2.0,2.0
YearMade,2005,2005,2005,2005,2005
MachineHoursCurrentMeter,,,,,
UsageBand,,,,,
saledate,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-10-25 00:00:00,2011-10-25 00:00:00


## 1.6. Inspecting the Loss-Function

* Quite oftenly it is important to check what loss functions the data uses
* It may just fall into the pre-processing stage
* In Kaggle, you can check this by going on the `evaluation` section
* For e.g. in Bluebook for Bulldozers we evaluate using `Root Means Square Log Error (RMLSE)`
* Hence we need to log out our `dependant variables` which is the sale price

In [None]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

* You can inspect each column of the DataFrame, as a seperate pandas `series`
* These panda series can easily be transferred to `numpy arrays`
* Hence numpy and pandas have a harmonious dev environment

## 1.7. Curse of Dimenionality and No Free Lunch Theorem

* Now, there is this essentially the belief in 90s theoreticians in ML that ...
* ... data points gets `stretched` to the edges of each dimenion you add to the data
* Hence the distance between them, which is essen. a `feature` becomes redundant
* Emperical modern ML agrees otherwise, where Jeremy says that adding more columns improves the model's performance
* Just because they are on edge, they still varies
* Even kNN works better on high-dimensional data, and it relies on distance between points

---

* Now the No Free Lunch --> "there is NO TYPE of model that WORKS WELL for any TYPE of dataset"
* Makes sense in Math, any random dataset will not be able to be approachable from point of view of methods that can make sense of random dataset
* Why ? because models we use rely on `features` to make sense of data
* A random dataset has no `features`, unless you know the type of process that generated that data well in hand but then it makes no sense to use ML if you already know the answer
* In the real applications, data is `NOT RANDOM`
* Mathematically, we believe that the real data sits on a `lower dimensional manifold`
* But it was created by some `Causal Structure` that is of stochastic nature

## 1.8. Initialise the Random Forest Algorithm

* Regressor is used for fitting data for `continuous variables` that is dependant
* Classifier for `discrete variables` or `categorical variables` that is dependant
* RFAlgorithm is best for beginings, as if your data doesnt work with RF to begin with, there is possibly sth already wrong with it
* In the `.fit` method you pass in the `independant variable`, that is the things you are going to be using to predict, in our case we set 'SalePrice' meaning we want to use EVERYTHING EXCEPT SALEPRICE
* The `pd.drop` function brings a new DataFrame with the `SalePrice` column REMOVED (it can work with rows as well FYI)
* And then the `dependant variable` that is the things you want to predict, for us this `df_raw.SalePrice`

In [None]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)

ValueError: could not convert string to float: 'Low'

## 1.9. How to Pre-Process (Convert all Colms to Numbers) Before Using RF Algorithm as Part of Feature Engineering

* You can tell from previous cell call that our datas contains `NON-NUMBERS`
* Dataset bluebook contains both `continuous` (numeric) and `categorical` (non-numeric are partially numeric like ZipCode)
* One example of non-numreic data in bluebook is the `Sale Date`

In [None]:
df_raw.saledate

0        2006-11-16
1        2004-03-26
2        2004-02-26
3        2011-05-19
4        2009-07-23
            ...    
401120   2011-11-02
401121   2011-11-02
401122   2011-11-02
401123   2011-10-25
401124   2011-10-25
Name: saledate, Length: 401125, dtype: datetime64[ns]

* As you can the **dtype** of this entry is **datetime64bits**
* Not usefull when you have to do RF algorithm
* Now we are going to do our very first `feature engineering`
* You need to **transform** the data information that you have into something that is **usefull** to you
* For example, was there christmas on the day, if you are analyse sport, was it a football match that date
* No ML algorithm can tell you that, you have to use **feature engineering** to make use of the useless data presented to you
* Now using `add_datepart` will allow you to **split** the date into Year, Month, Day

In [None]:
fastaicore.add_datepart(df_raw, 'saledate')

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,saleDay,saleDayofweek,saleDayofyear,saleIs_month_end,saleIs_month_start,saleIs_quarter_end,saleIs_quarter_start,saleIs_year_end,saleIs_year_start,saleElapsed
0,1139246,11.097410,999089,3157,121,3.0,2004,68.0,Low,521D,...,16,3,320,False,False,False,False,False,False,1.163635e+09
1,1139248,10.950807,117657,77,121,3.0,1996,4640.0,Low,950FII,...,26,4,86,False,False,False,False,False,False,1.080259e+09
2,1139249,9.210340,434808,7009,121,3.0,2001,2838.0,High,226,...,26,3,57,False,False,False,False,False,False,1.077754e+09
3,1139251,10.558414,1026470,332,121,3.0,2001,3486.0,High,PC120-6E,...,19,3,139,False,False,False,False,False,False,1.305763e+09
4,1139253,9.305651,1057373,17311,121,3.0,2007,722.0,Medium,S175,...,23,3,204,False,False,False,False,False,False,1.248307e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401120,6333336,9.259131,1840702,21439,149,1.0,2005,,,35NX2,...,2,2,306,False,False,False,False,False,False,1.320192e+09
401121,6333337,9.305651,1830472,21439,149,1.0,2005,,,35NX2,...,2,2,306,False,False,False,False,False,False,1.320192e+09
401122,6333338,9.350102,1887659,21439,149,1.0,2005,,,35NX2,...,2,2,306,False,False,False,False,False,False,1.320192e+09
401123,6333341,9.104980,1903570,21435,149,2.0,2005,,,30NX,...,25,1,298,False,False,False,False,False,False,1.319501e+09


* Now if you inspect all the fields of your data you see that the saledate is gone
* and now addittional fields have appeared known as **saleYear**, **saleMonth**, **saleWeek**, and **saleDay** and many more ...

In [None]:
df_raw.columns

Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
       'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
       'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries',
       'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state',
       'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure',
       'Forks', 'Pad_Type', 'Ride_Control', 'Stick', 'Transmission',
       'Turbocharged', 'Blade_Extension', 'Blade_Width', 'Enclosure_Type',
       'Engine_Horsepower', 'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier',
       'Tip_Control', 'Tire_Size', 'Coupler', 'Coupler_System',
       'Grouser_Tracks', 'Hydraulics_Flow', 'Track_Type',
       'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb', 'Pattern_Changer',
       'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
       'Differential_Type', 'Steering_Controls', 'saleYear', 'saleMonth',
       'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear',


* This is the most basic type of `feature engineering`
* `[BONUS]` In your spare time, try to do this step manually so you can understand the pre-processing properly

## 1.10. Feature Engineering :2 -- Take care of the Strings

* As you can see below, you still have certain columns that contain **Strings**
* How do you get rid of those ?

In [None]:
df_raw.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,saleDay,saleDayofweek,saleDayofyear,saleIs_month_end,saleIs_month_start,saleIs_quarter_end,saleIs_quarter_start,saleIs_year_end,saleIs_year_start,saleElapsed
0,1139246,11.09741,999089,3157,121,3.0,2004,68.0,Low,521D,...,16,3,320,False,False,False,False,False,False,1163635000.0
1,1139248,10.950807,117657,77,121,3.0,1996,4640.0,Low,950FII,...,26,4,86,False,False,False,False,False,False,1080259000.0
2,1139249,9.21034,434808,7009,121,3.0,2001,2838.0,High,226,...,26,3,57,False,False,False,False,False,False,1077754000.0
3,1139251,10.558414,1026470,332,121,3.0,2001,3486.0,High,PC120-6E,...,19,3,139,False,False,False,False,False,False,1305763000.0
4,1139253,9.305651,1057373,17311,121,3.0,2007,722.0,Medium,S175,...,23,3,204,False,False,False,False,False,False,1248307000.0


* You can use `train_cats` method in fastai library
* It basically converts strings in defined order to integers
* But `WARNING` make sure to use the same defined order for both **TRAIN** and **TEST** and **VAL** data
* Otherwise your model will be **un-predictive**
* For that issue you can instead use the `apply_cats`

In [None]:
df_raw.UsageBand.cat.categories

AttributeError: Can only use .cat accessor with a 'category' dtype