# Lesson 1

## 00:00:00 - Intro

## 00:01:41 - Setting up development environment

* Crestle gives you a Juypter notebook for 3c an hour.
* Paperspace another option.
* All course data is in Fast.ai repo under `fastai` > `courses` > `ml1`.

## 00:05:14 - Recommendations for watching video

* Watch, then follow along with video later (probably more useful to in person students).

## 00:06:15 - Course approach

* Top-down approach: lot's of practical upfront, then theory later.
* Course is a summary of 25 years of Jeremy's research - not a summary of other people's research.
* Chance to practise technical writing by authoring blog posts on stuff you learn.

## 00:08:08 - Importing libraries in Juypter notebook

* Autoreload commands lets you edit source code and have it immediately available in Juypter.

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [66]:
from math import sqrt
from pathlib import Path

from fastai.imports import *
from fastai.structured import *

import pandas as pd
from pandas.api.types import is_string_dtype
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics

## 00:08:42 - Why not follow Python code standards?

* Doesn't follow PEP8.
* Basic idea: data science is not software engineering, even if they eventually become them.
  * Prototyping models requires thinking about some new paradigms.
* Can figure out where a function is from by putting its name into Juypter:

In [3]:
display

<function IPython.core.display.display(*objs, include=None, exclude=None, metadata=None, transient=None, display_id=None, **kwargs)>

* 1 question mark shows docs: `?display`, 2 shows source: `??display`.

## 00:12:08 - Kaggle competition: Blue Book for Bulldozers

* Kaggle comps allow you to download a real-world dataset.
* Can submit to leaderboard of old competitions.
  * No other way to know if you're competent at solving that type of problem.
* Machine Learning can help us understand a dataset: not just make predictions of it.
* Downloading data:

In [4]:
PATH = Path('./data/bluebook')

In [5]:
PATH.mkdir(parents=True, exist_ok=True)

In [9]:
!kaggle competitions download -c bluebook-for-bulldozers --path={PATH}

Train.7z: Downloaded 7MB of 7MB to data/bluebook
Train.zip: Downloaded 9MB of 9MB to data/bluebook
Valid.7z: Downloaded 209KB of 209KB to data/bluebook
Valid.csv: Downloaded 3MB of 3MB to data/bluebook
Valid.zip: Downloaded 297KB of 297KB to data/bluebook
Data%20Dictionary.xlsx: Downloaded 11KB of 11KB to data/bluebook
median_benchmark.csv: Downloaded 192KB of 192KB to data/bluebook
Machine_Appendix.csv: Downloaded 49MB of 49MB to data/bluebook
ValidSolution.csv: Downloaded 316KB of 316KB to data/bluebook
TrainAndValid.7z: Downloaded 7MB of 7MB to data/bluebook
TrainAndValid.csv: Downloaded 114MB of 114MB to data/bluebook
TrainAndValid.zip: Downloaded 10MB of 10MB to data/bluebook
Test.csv: Downloaded 3MB of 3MB to data/bluebook
random_forest_benchmark_test.csv: Downloaded 207KB of 207KB to data/bluebook


* Can also download using [https://daniel.haxx.se/blog/2015/11/23/copy-as-curl/](copy as Curl) command in FF.
  * Ensure using `-o` flag in Curl to specify the output location.

In [6]:
!ls {PATH}

Data%20Dictionary.xlsx           TrainAndValid.zip
Machine_Appendix.csv             Valid.7z
Test.csv                         Valid.csv
Train.7z                         Valid.zip
Train.csv                        ValidSolution.csv
Train.zip                        median_benchmark.csv
TrainAndValid.7z                 random_forest_benchmark_test.csv
TrainAndValid.csv                [34mtmp[m[m


### 00:24:32 - Audience questions

* Q1: What are the curly brackets?
* A1: Expand Python variables before passing to the shell.

## 00:25:14 - Exploring the dataset

* It's in CSV format. Can use head to look at the first few lines:

In [16]:
!unzip {PATH}/Train.zip -d {PATH}

Archive:  data/bluebook/Train.zip
  inflating: data/bluebook/Train.csv  


In [7]:
!head {PATH}/Train.csv

SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
1139246,66000,999089,3157,121,3,2004,68,Low,11/16/2006 0:00,521D,521,D,,,,Wheel Loader - 110.0 to 120.0 Horsepower,Alabama,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,None or Unspecified,None or Unspecified,,,,,,,,,,,,,Standard,Conventional
1139248,57000,117657,77,121,3,1996,464

* Jeremy considers this data structured data (vs unstructured: images, audio).
  * NLP people refer to structured data as something else.
* Pandas most commonly used tool for dealing with structured data.
  * Everyone uses the same abbreviation: `pd`.
* Can read a csv file using the `read_csv` command.
  * Args:
    * `parse_dates` picks which columns are dates.
    * `low_memory` - read more of the file to decide what the types are.

In [9]:
df_raw = pd.read_csv(f'{PATH}/Train.csv', low_memory=False, parse_dates=['saledate'])

In [10]:
df_raw.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,2006-11-16,...,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,2004-03-26,...,,,,,,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,2004-02-26,...,,,,,,,,,,
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,2011-05-19,...,,,,,,,,,,
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,2009-07-23,...,,,,,,,,,,


* Can see the last rows using the `tail()` method.
* If you have a lot of rows, can be worth transposing it with `transpose()`:

In [11]:
df_raw.tail().transpose()

Unnamed: 0,401120,401121,401122,401123,401124
SalesID,6333336,6333337,6333338,6333341,6333342
SalePrice,10500,11000,11500,9000,7750
MachineID,1840702,1830472,1887659,1903570,1926965
ModelID,21439,21439,21439,21435,21435
datasource,149,149,149,149,149
auctioneerID,1,1,1,2,2
YearMade,2005,2005,2005,2005,2005
MachineHoursCurrentMeter,,,,,
UsageBand,,,,,
saledate,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-10-25 00:00:00,2011-10-25 00:00:00


* The value you want to predict (`SalePrice` in this example) is called the "dependent variable".

### 00:33:08 - Audience questions

* Q1: Aren't you at risk of overfitting if you spend too much time looking at the data?
* A1: Prefer "machine learning driven" exploratory data analysis.

## 00:34:06 - Evaluation metrics

* For Kaggle projects, there's an evaluation section that describes how the project is evaluated.
* Bluebook example: root mean squared log error.
* Can replace column with the log of its value, as follows:

In [12]:
sale_price = np.log(df_raw.SalePrice)

## 00:36:31 - Intro to Random Forests

* Brief: universal machine learning technique.
  * Can be used for categorical or continuous.
  * Can be used with columns of any kind.
  * Doesn't overfit in general: very easy to stop if it is.
  * Don't need a separate validation set in general.
  * Few statistical assumptions.
  * Great place to start.

#### 00:38:13 - Curse of dimensionality (audience question)

* Q1: What about curse of dimensionality?
* A1: Idea: the more columns you have, the more empty space you have. Higher dimensions tend to have lots of points on the edges.
  * Doesn't tend to be a problem in practise.
  * Even K neighest neighbours works well in high dimensions.
  * "Theory took over machine learning in the 90s" - today's ML is more impirical.
  
* Related: no free lunch theorem.
   * Claim: no type of model works well for any kind of dataset.
   * It's true in the sense that a dataset could be random, so obviously there won't be a good model for it.
   * In the real world: we aren't using random datasets, so there are techniques that work for almost all kinds of real datasets.
     * Ensembles of decision trees is one example (which a Random Forest is).

#### 00:42:53 - Sklearn, Regressors vs Classifiers

* Sklearn: by far the most important package for ML in Python.
  * Does almost everything, though not the best of everything.
* Two types of random forests in Sk: regressor and classifier:

In [13]:
print(RandomForestRegressor)
print(RandomForestClassifier)

<class 'sklearn.ensemble.forest.RandomForestRegressor'>
<class 'sklearn.ensemble.forest.RandomForestClassifier'>


* Lot of people thing regressor is linear regression, which is not true or appropriate (?)
* Regressor = something which predicts a continuous output.

* Putting the cursor over a method/function and pressing Shift-tab in Jupyter will return the docs.

* First attempt at running regressor:

In [14]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(df_raw, sale_price)

ValueError: could not convert string to float: 'Conventional'

* Error tells you that you need to convert strings into numbers: that's what an ML model expects.
* First issue: `saledate` is a date. Need to convert to ints.
  * Can be converted with Fast.ai's `add_datepart`
  * Adds columns like: day of month, day of year, is it a public holiday and so on.
    * Any important stuff you can tell the model about the date? Special events etc.

In [15]:
df_raw.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,2006-11-16,...,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,2004-03-26,...,,,,,,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,2004-02-26,...,,,,,,,,,,
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,2011-05-19,...,,,,,,,,,,
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,2009-07-23,...,,,,,,,,,,


In [16]:
add_datepart(df_raw, 'saledate')

In [17]:
df_raw.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,saleDay,saleDayofweek,saleDayofyear,saleIs_month_end,saleIs_month_start,saleIs_quarter_end,saleIs_quarter_start,saleIs_year_end,saleIs_year_start,saleElapsed
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,521D,...,16,3,320,False,False,False,False,False,False,1163635200
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,950FII,...,26,4,86,False,False,False,False,False,False,1080259200
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,226,...,26,3,57,False,False,False,False,False,False,1077753600
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,PC120-6E,...,19,3,139,False,False,False,False,False,False,1305763200
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,S175,...,23,3,204,False,False,False,False,False,False,1248307200


* Can access datetime-related method on datetime columns using `df.dt.<some_method>`.
  * No hard in adding more columns, might as well use all datetime attributes.

* Also need to convert strings:
  * Can use Pandas `category` type to convert strings to categorical.

In [18]:
df_raw.UsageBand.head()

0       Low
1       Low
2      High
3      High
4    Medium
Name: UsageBand, dtype: object

In [19]:
for col_name, col in df_raw.items():
    if is_string_dtype(col):
        df_raw[col_name] = col.astype('category').cat.as_ordered()

In [20]:
df_raw.UsageBand.cat.categories

Index(['High', 'Low', 'Medium'], dtype='object')

* May want to order certain categories, where it makes sense (like above).
  * Can use `set_categories` to do that:

In [21]:
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)

In [22]:
df_raw.UsageBand.cat.codes.head()

0    2
1    2
2    0
3    0
4    1
dtype: int8

#### 01:00:56 - Audience question

* Q1: Can you explain the column ordering?
* A1: (reexplains ordering)

#### 01:04:17 - Find missing values

* Can use `isnull` with `sum` to find all columns with missing columns:

In [23]:
df_raw.isnull().sum().sort_index() / len(df_raw)

Backhoe_Mounting            0.803872
Blade_Extension             0.937129
Blade_Type                  0.800977
Blade_Width                 0.937129
Coupler                     0.466620
Coupler_System              0.891660
Differential_Type           0.826959
Drive_System                0.739829
Enclosure                   0.000810
Enclosure_Type              0.937129
Engine_Horsepower           0.937129
Forks                       0.521154
Grouser_Tracks              0.891899
Grouser_Type                0.752813
Hydraulics                  0.200823
Hydraulics_Flow             0.891899
MachineHoursCurrentMeter    0.644089
MachineID                   0.000000
ModelID                     0.000000
Pad_Type                    0.802720
Pattern_Changer             0.752651
ProductGroup                0.000000
ProductGroupDesc            0.000000
ProductSize                 0.525460
Pushblock                   0.937129
Ride_Control                0.629527
Ripper                      0.740388
S

### 01:05:23 - Saving state of dataframe to "feather"

* can use `to_feather` to save data to disk in the sample way as its stored in RAM.
  * By far the fastest way to read Dataframe.
  * Becoming the standard even in Spark and Java.

In [24]:
(PATH / 'tmp').mkdir(exist_ok=True)
df_raw.to_feather(PATH / 'tmp' / 'raw')

* Can be read with `pd.read_feather`.

In [25]:
df_raw = pd.read_feather(PATH / 'tmp' / 'raw')

### 01:07:37 - Final preprocessing

* Want to replace string with numeric codes, handle missing continuous values and split dependent variable out.
* Can do it all with Fast.ai's `proc_df` method.

In [26]:
proc_df

<function fastai.structured.proc_df(df, y_fld=None, skip_flds=None, ignore_flds=None, do_scale=False, na_dict=None, preproc_fn=None, max_n_cat=None, subset=None, mapper=None)>

In [36]:
df, y, nas = proc_df(df_raw, 'SalePrice')

### 01:08:01 - `proc_df` internals

* What `proc_df` does
  1. Takes DataFrame and output field name as input.
  2. Make a copy of DataFrame.
  3. Extract dependant variable.
  4. Prepare continuous columns by fixing missing values by setting continuous values to their median and adding a column that defines whether something is null or not.
  5. Prepare categorical columns by replacing the values with their numeric codes (+1 to convert -1 into 0 -- not sure why).

In [76]:
df_copy = df_raw.copy()

sale_price = np.log(df_copy.pop('SalePrice'))

for col_name, col in df_copy.items():
    
    if is_numeric_dtype(col):

        # Add a column that defines whether the value is NA or not.
        if pd.isna(col).sum():
            
            df_copy[f'{col_name}_is_na'] = pd.isna(col)
        
        # Set the value to the median of the dataset.
        df_copy[col_name] = col.fillna(col.median())

        continue
        
    # Assume categorical
    
    # Add 1 to move -1 to 0.
    df_copy[col_name] = df_copy[col_name].cat.codes + 1

In [70]:
df_copy.head()

Unnamed: 0,SalesID,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,fiBaseModel,...,saleDayofyear,saleIs_month_end,saleIs_month_start,saleIs_quarter_end,saleIs_quarter_start,saleIs_year_end,saleIs_year_start,saleElapsed,auctioneerID_is_na,MachineHoursCurrentMeter_is_na
0,1139246,999089,3157,121,3.0,2004,68.0,3,950,296,...,320,False,False,False,False,False,False,1163635200,False,False
1,1139248,117657,77,121,3.0,1996,4640.0,3,1725,527,...,86,False,False,False,False,False,False,1080259200,False,False
2,1139249,434808,7009,121,3.0,2001,2838.0,1,331,110,...,57,False,False,False,False,False,False,1077753600,False,False
3,1139251,1026470,332,121,3.0,2001,3486.0,1,3674,1375,...,139,False,False,False,False,False,False,1305763200,False,False
4,1139253,1057373,17311,121,3.0,2007,722.0,2,4208,1529,...,204,False,False,False,False,False,False,1248307200,False,False


In [71]:
sale_price.head()

0    11.097410
1    10.950807
2     9.210340
3    10.558414
4     9.305651
Name: SalePrice, dtype: float64

In [72]:
df_copy.columns

Index(['SalesID', 'MachineID', 'ModelID', 'datasource', 'auctioneerID',
       'YearMade', 'MachineHoursCurrentMeter', 'UsageBand', 'fiModelDesc',
       'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor',
       'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup',
       'ProductGroupDesc', 'Drive_System', 'Enclosure', 'Forks', 'Pad_Type',
       'Ride_Control', 'Stick', 'Transmission', 'Turbocharged',
       'Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower',
       'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control',
       'Tire_Size', 'Coupler', 'Coupler_System', 'Grouser_Tracks',
       'Hydraulics_Flow', 'Track_Type', 'Undercarriage_Pad_Width',
       'Stick_Length', 'Thumb', 'Pattern_Changer', 'Grouser_Type',
       'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
       'Differential_Type', 'Steering_Controls', 'saleYear', 'saleMonth',
       'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear',
       'saleI

* Notice that we leave the `ModelID` and `MachineID` in that shouldn't make must sense for a model to learn, but it doesn't tend to cause problems with random forests.
* Random forest are "trivially parallelisable"
  * Pass `n_jobs=-1` to create a separate job for each CPU.

In [60]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(df_copy, sale_price)
m.score(df_copy, sale_price)

0.9825776291317143

* The score is measured using $r^2$ where 0 is worst and 1 is best.

## 01:13:24 - Measuring overfitting

* Want to separate data into training and validation to measure how well your training is actually doing.

In [80]:
def split_vals(a, n):
    return a[:n].copy(), a[n:].copy()

num_valid = 12000
num_train = len(df_copy) - num_valid

X_train, X_valid = split_vals(df_copy, num_train)
y_train, y_valid = split_vals(sale_price, num_train)

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((389125, 66), (389125,), (12000, 66), (12000,))

In [81]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [82]:
predictions = m.predict(X_valid)
errors_squared = (predictions - y_valid) ** 2
mean_error = errors_squared.mean()
print('Root mean squared error (validation):', sqrt(mean_error))

Root mean squared error (validation): 0.24580048459670867


* Would get us to about 28th on the private leaderboard and 136th on the public.

## 01:16:09 - Assigment

* Try these steps on as many Kaggle competitions as you can.