### Imports and Setup

In [2]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [None]:
import sys
import os
sys.path.insert(0, "/Users/JI/Documents/Github/fastai/old/")
# print(sys.path)
import fastai
print(sys.modules['fastai'])

In [33]:
from fastai.structured import add_datepart,train_cats,proc_df,fix_missing,numericalize,set_rf_samples
import pandas as pd
import numpy as np
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics


In [6]:
PATH = "./data/bulldozers/"
# !ls {PATH}

### Load Dataset

In [7]:
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, 
                     parse_dates=["saledate"])

### Pre-Processing Steps
Need to make cols numeric
1. change dates to numerics (add_datepart)
2. change all string names to categorical variables (train_cats). To apply the same categorical mappings to test set, use (apply_cats). Make sure the categorical mappings make sense, i.e. Low, Med, High instead of High, Low, Med etc.
3. Take care of missing/null values
    - if numeric, add new col(_na) with 1 or 0, and fill na with median value (proc_df,fix_missing)
    - pandas auto sets null categorical values to -1, so add 1 to all codes using (numericalize)

#### Display all cols in a df

In [9]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

#### Save Pre-Processed data in feather format

### Random Forests
- A tree consists of a sequence of binary decisions/splits
- How do you find the most simple basic split (which variable, which split point)?
    - for every feature and for every split within that feature, we find the weighted avg of the mse, which one had the best mse and we picked that
    - split when you hit a limit, or when leaf nodes have only one decision left
- How can you make a decision tree better?
    - **FORESTS!** RFs are simply a way of *Bagging* trees
- What is Bagging?
    - construct multiple uncorelated models whose errors are close to random
- What are some RF hyperparameters?
    - num_estimators(trees) - as many as you have time to fit, give you good r2 (mostly testing using 20,30 - finally using 1k or so)
    - min_samples_leaf - min # of samples
    - 

#### Speeding things up
If something takes longer than 10s to run, it's to slow to be interactive. Ideally, you want to create a model and tune hyperparameters quickly, then run on entire dataset when you head home
- Run your models on subsamples of the data, will give you most of the insights you can get, rather than training on all of huge dataset

## Notes

- It's good not knowing about the dataset before hand, will keep you open-minded as to what the data is saying

- Kaggle API: https://github.com/Kaggle/kaggle-api

- you can open a terminal within jupyter

- can run shell commands from jupyter with ! before, eg. **!ls /path** or **!ls {PATH}**

- read_csv low_memory=False reads in more of the file

- look at the evaluation metric to determine how to modify the variables (log loss, etc.)

- RFs are very robust, great place to start

- Scikit learn steps: 
        1. Create Instance of model you want
        2. call fit, pass in independent variables, and dependent variable


- ? docs, ?? source code

- pandas has a Category data type but doesn't change anything into it by default, use train_cats(df) to do so. Stores a mapping from integers to the strings

- RFs are trivially parallelizable (it will split up data across CPUs and linearly scale) n_jobs=-1

- the fastai library in this directory is a symlink (file that points to another file) pointing to the original fastai folder ../../old/fastai
ln -s ../../old/fastai

- R^2 is useful, ratio of how good your model is (root mean squared error) versus how good the naive mean model is (root squared error)

- creating validation set is most important thing to do in a ML model. Test set should never be touched until you are done done with modeling.

- when dealing with time series data, you want your test set to be of a different time (future) than your training set. So set your validation set to be different as well, not randomized.

- an effective ML model is accurate for the training set and also generalizes well

### Figuring out what add_datepart does

In [19]:
test_df = df_raw[['saledate']]

In [26]:
test_df.columns = ['testing']
test_df

Unnamed: 0,testing
0,2006-11-16
1,2004-03-26
2,2004-02-26
3,2011-05-19
4,2009-07-23
...,...
401120,2011-11-02
401121,2011-11-02
401122,2011-11-02
401123,2011-10-25


In [27]:
add_datepart(test_df,'testing')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [28]:
test_df

Unnamed: 0,testingYear,testingMonth,testingWeek,testingDay,testingDayofweek,testingDayofyear,testingIs_month_end,testingIs_month_start,testingIs_quarter_end,testingIs_quarter_start,testingIs_year_end,testingIs_year_start,testingElapsed
0,2006,11,46,16,3,320,False,False,False,False,False,False,1163635200
1,2004,3,13,26,4,86,False,False,False,False,False,False,1080259200
2,2004,2,9,26,3,57,False,False,False,False,False,False,1077753600
3,2011,5,20,19,3,139,False,False,False,False,False,False,1305763200
4,2009,7,30,23,3,204,False,False,False,False,False,False,1248307200
...,...,...,...,...,...,...,...,...,...,...,...,...,...
401120,2011,11,44,2,2,306,False,False,False,False,False,False,1320192000
401121,2011,11,44,2,2,306,False,False,False,False,False,False,1320192000
401122,2011,11,44,2,2,306,False,False,False,False,False,False,1320192000
401123,2011,10,43,25,1,298,False,False,False,False,False,False,1319500800


In [29]:
add_datepart(df_raw,'saledate')

In [30]:
df_raw

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,saleDay,saleDayofweek,saleDayofyear,saleIs_month_end,saleIs_month_start,saleIs_quarter_end,saleIs_quarter_start,saleIs_year_end,saleIs_year_start,saleElapsed
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,521D,...,16,3,320,False,False,False,False,False,False,1163635200
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,950FII,...,26,4,86,False,False,False,False,False,False,1080259200
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,226,...,26,3,57,False,False,False,False,False,False,1077753600
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,PC120-6E,...,19,3,139,False,False,False,False,False,False,1305763200
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,S175,...,23,3,204,False,False,False,False,False,False,1248307200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401120,6333336,10500,1840702,21439,149,1.0,2005,,,35NX2,...,2,2,306,False,False,False,False,False,False,1320192000
401121,6333337,11000,1830472,21439,149,1.0,2005,,,35NX2,...,2,2,306,False,False,False,False,False,False,1320192000
401122,6333338,11500,1887659,21439,149,1.0,2005,,,35NX2,...,2,2,306,False,False,False,False,False,False,1320192000
401123,6333341,9000,1903570,21435,149,2.0,2005,,,30NX,...,25,1,298,False,False,False,False,False,False,1319500800
