# 01 | Basics - Introduction to Random Forests

## 1.1. About autoreload

* These two lines help modify modules and update on demand
* Mainly to show tables within the notebook itself
* If you want to use any Python variable inside a Jupyter console, use CURLIES

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.imports import *
from fastai.tabular import *
# ____________________________________________________________ #
from pandas_summary import DataFrameSummary
from IPython.display import display

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# ____________________________________________________________ #

## 1.2. About Software development practices in Data Science

* `Pep 8` is the Python standard for developing source code in Python
* Most of DS/ML is `prototyping model`
* The component of software development is taken away (although very loosely)
* Best practices for prototyping model is barely even standardised, so you are on your own to setup your own style and standards

## 1.3. About Blue Book for Bulldozers

* It is about predicting `sales price` for heavy equipments
* here you better keep a seperate `PATH` variable to store the location of the dataset
* Also the *structured* module from fastai has been removed to *tabular*
* This is an example of a `structured` data --> this is where pandas come in
* 

In [3]:
PATH2DATA = '../datasets/kaggle/bluebook_bulldozers/'


## 1.4. Basics on Visualizing the Data

* You can use the `head` command to make

In [4]:
!head '../datasets/kaggle/bluebook_bulldozers/Train.csv'

SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
1139246,66000,999089,3157,121,3,2004,68,Low,11/16/2006 0:00,521D,521,D,,,,Wheel Loader - 110.0 to 120.0 Horsepower,Alabama,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,None or Unspecified,None or Unspecified,,,,,,,,,,,,,Standard,Conventional
1139248,57000,117657,77,121,3,1996,4640,Lo

* Or use the pandas library
* the `f{PATH}Train.csv` is sth known as the Python 3.7 string
* e.g. if you have variable NAME = 'deedo' , and then you type in command `f My name is {NAME}`, then this will print `My name is deedo`
* But keep in mind if you remove `f` it WONT WORK, `f` allows `interpolation` of string variables
* You can also make it work for `integers` use code inside curlies for example `{NAME}.upper()`

In [5]:
# setting low_memory to true just saves ram. setting to false lets it read more of the file
df_raw = pd.read_csv(f'{PATH2DATA}Train.csv',
                     low_memory=False, parse_dates=["saledate"])


In [6]:
df_raw

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,2006-11-16,...,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,2004-03-26,...,,,,,,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,2004-02-26,...,,,,,,,,,,
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,2011-05-19,...,,,,,,,,,,
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,2009-07-23,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401120,6333336,10500,1840702,21439,149,1.0,2005,,,2011-11-02,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401121,6333337,11000,1830472,21439,149,1.0,2005,,,2011-11-02,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401122,6333338,11500,1887659,21439,149,1.0,2005,,,2011-11-02,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
401123,6333341,9000,1903570,21435,149,2.0,2005,,,2011-10-25,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,,,,,
