# Lesson 3 Notes: Large Datasets
In this notebook, looking at Rossmann store sales prediction           
RFs allow us to understand our data more deeply than traditional ML techniques
- for structured data - random forests
- for unstructured data - deep learning
- collaborative filtering - another ML model use case

## Large Datasets
- Usually, reading and writing to RAM is the bottle-neck
- check datatypes before and use the smallest memory efficient dtype
- dates are usually important, make sure dates don't overlap
    - understand dates in training set, test set
- %prun will show which process is taking the most time (profiling)

### How to know if there's errors in your model
- ML models are tricky, small error can cause mistakes without you knowing
- ask "What do I know about the outputs?"
- Need a good validation set.
- Plot validation results against test set results - should follow the y=x line if it's a good validation set (only time to use test set before final modle tuned)

In [21]:
%load_ext autoreload
%autoreload 2

%matplotlib inline
import matplotlib.pyplot as plt
import math

In [19]:
import sys
import os
sys.path.insert(0, "/Users/JI/Documents/Github/fastai/old/")
# print(sys.path)
import fastai
print(sys.modules['fastai'])

<module 'fastai' from '/Users/JI/Documents/Github/fastai/courses/ml1/fastai/__init__.py'>


In [50]:
from fastai.structured import *
import pandas as pd
import numpy as np
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics

#### Load Rossmann data: Training set

In [3]:
types = {"Store": "uint16",
         "DayOfWeek": "uint8",
         "Sales": "uint16",
         "Customers": "uint16",
         "Open": "bool",
         "Promo": "bool",
         "StateHoliday":"object",
         "SchoolHoliday": "bool"}
%time
df = pd.read_csv("./data/rossmann/train.csv",parse_dates=['Date'],dtype=types,
                 infer_datetime_format=True)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.2 µs


In [4]:
df.dtypes

Store                    uint16
DayOfWeek                 uint8
Date             datetime64[ns]
Sales                    uint16
Customers                uint16
Open                       bool
Promo                      bool
StateHoliday             object
SchoolHoliday              bool
dtype: object

In [5]:
df.Promo.isnull().values.any()

False

In [6]:
df.describe()

Unnamed: 0,Store,DayOfWeek,Sales,Customers
count,1017209.0,1017209.0,1017209.0,1017209.0
mean,558.4297,3.998341,5773.819,633.1459
std,321.9087,1.997391,3849.926,464.4117
min,1.0,1.0,0.0,0.0
25%,280.0,2.0,3727.0,405.0
50%,558.0,4.0,5744.0,609.0
75%,838.0,6.0,7856.0,837.0
max,1115.0,7.0,41551.0,7388.0


In [7]:
%time df.to_feather('/tmp/rossmann')

CPU times: user 116 ms, sys: 24.4 ms, total: 141 ms
Wall time: 182 ms


In [8]:
%time df.describe(include='all')

CPU times: user 274 ms, sys: 18 ms, total: 292 ms
Wall time: 331 ms


Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
count,1017209.0,1017209.0,1017209,1017209.0,1017209.0,1017209,1017209,1017209.0,1017209
unique,,,942,,,2,2,4.0,2
top,,,2015-06-09 00:00:00,,,True,False,0.0,False
freq,,,1115,,,844392,629129,986159.0,835488
first,,,2013-01-01 00:00:00,,,,,,
last,,,2015-07-31 00:00:00,,,,,,
mean,558.4297,3.998341,,5773.819,633.1459,,,,
std,321.9087,1.997391,,3849.926,464.4117,,,,
min,1.0,1.0,,0.0,0.0,,,,
25%,280.0,2.0,,3727.0,405.0,,,,


In [9]:
df.StateHoliday.unique()

array(['0', 'a', 'b', 'c'], dtype=object)

#### Test Set

In [10]:
types = {"Id":"uint16",
         "Store": "uint16",
         "DayOfWeek": "uint8",
         "Sales": "uint16",
         "Customers": "uint16",
         "Open": "object",
         "Promo": "bool",
         "StateHoliday":"object",
         "SchoolHoliday": "bool"}
df_test = pd.read_csv("./data/rossmann/test.csv",parse_dates=['Date'],dtype=types,
                 infer_datetime_format=True)
df_test.Open.fillna(False,inplace=True)
df_test.Open = df_test.Open.map({'False':False,'True':True})
df_test.Open = df_test.Open.astype(bool)

In [11]:
%time df_test.describe(include='all')

CPU times: user 33.4 ms, sys: 2.14 ms, total: 35.5 ms
Wall time: 50.4 ms


Unnamed: 0,Id,Store,DayOfWeek,Date,Open,Promo,StateHoliday,SchoolHoliday
count,41088.0,41088.0,41088.0,41088,41088,41088,41088.0,41088
unique,,,,48,1,2,2.0,2
top,,,,2015-09-15 00:00:00,True,False,0.0,False
freq,,,,856,41088,24824,40908.0,22866
first,,,,2015-08-01 00:00:00,,,,
last,,,,2015-09-17 00:00:00,,,,
mean,20544.5,555.899533,3.979167,,,,,
std,11861.228267,320.274496,2.015481,,,,,
min,1.0,1.0,1.0,,,,,
25%,10272.75,279.75,2.0,,,,,


In [16]:
df_test.dtypes, len(df_test)

(Id                       uint16
 Store                    uint16
 DayOfWeek                 uint8
 Date             datetime64[ns]
 Open                       bool
 Promo                      bool
 StateHoliday             object
 SchoolHoliday              bool
 dtype: object,
 41088)

#### Cleaning

In [13]:
df.tail()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday
1017204,1111,2,2013-01-01,0,0,False,False,a,True
1017205,1112,2,2013-01-01,0,0,False,False,a,True
1017206,1113,2,2013-01-01,0,0,False,False,a,True
1017207,1114,2,2013-01-01,0,0,False,False,a,True
1017208,1115,2,2013-01-01,0,0,False,False,a,True


#### Check sales values to match evaluation metric
https://gist.github.com/bshishov/5dc237f59f019b26145648e2124ca1c9

- in this case they're using RMSPE, so no need to take the log of sales values. Will leave as # units

In [15]:
df.Sales

0           5263
1           6064
2           8314
3          13995
4           4822
           ...  
1017204        0
1017205        0
1017206        0
1017207        0
1017208        0
Name: Sales, Length: 1017209, dtype: uint16

#### Add datepart

In [22]:
%time add_datepart(df,'Date')

CPU times: user 848 ms, sys: 152 ms, total: 1 s
Wall time: 1.35 s


In [23]:
df.head()

Unnamed: 0,Store,DayOfWeek,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,Year,Month,...,Day,Dayofweek,Dayofyear,Is_month_end,Is_month_start,Is_quarter_end,Is_quarter_start,Is_year_end,Is_year_start,Elapsed
0,1,5,5263,555,True,True,0,True,2015,7,...,31,4,212,True,False,False,False,False,False,1438300800
1,2,5,6064,625,True,True,0,True,2015,7,...,31,4,212,True,False,False,False,False,False,1438300800
2,3,5,8314,821,True,True,0,True,2015,7,...,31,4,212,True,False,False,False,False,False,1438300800
3,4,5,13995,1498,True,True,0,True,2015,7,...,31,4,212,True,False,False,False,False,False,1438300800
4,5,5,4822,559,True,True,0,True,2015,7,...,31,4,212,True,False,False,False,False,False,1438300800


#### split raw df into train and val sets, val set contains most recent values


In [24]:
def split_vals(a,n):
    return a[:n].copy(),a[n:].copy()

In [30]:
n_valid = len(df_test)
n_train = len(df)-n_valid
training_set, val_set = split_vals(df,n_train)
training_set.shape, val_set.shape

((976121, 21), (41088, 21))

#### Separate response variables in df and convert everything to numeric

In [39]:
%time
X_train, y_train, _ = proc_df(training_set,y_fld='Sales')
X_val, y_val, _ = proc_df(val_set,y_fld='Sales')

CPU times: user 12 µs, sys: 1 µs, total: 13 µs
Wall time: 35 µs


### Models

In [40]:
EPSILON = 1e-10

def _error(actual: np.ndarray, predicted: np.ndarray):
    """ Simple error """
    return actual - predicted

def _percentage_error(actual: np.ndarray, predicted: np.ndarray):
    """
    Percentage error
    Note: result is NOT multiplied by 100
    """
    return _error(actual, predicted) / (actual + EPSILON)

def rmspe(actual: np.ndarray, predicted: np.ndarray):
    """
    Root Mean Squared Percentage Error
    Note: result is NOT multiplied by 100
    """
    return np.sqrt(np.mean(np.square(_percentage_error(actual, predicted))))

In [41]:
def print_score(m):
    res = [rmspe(y_train,m.predict(X_train)),rmspe(y_val,m.predict(X_val)),
           m.score(X_train,y_train),m.score(X_val,y_val)]
    if hasattr(m,'oob_score_'): res.append(m.oob_score_)
    print(res)

In [51]:
# set_rf_samples(100_000)
reset_rf_samples()
m = RandomForestRegressor(n_estimators=100,min_samples_leaf=10,n_jobs=4)
%prun m.fit(X_train,y_train)

 

         127481 function calls (126254 primitive calls) in 291.382 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      422  290.214    0.688  290.214    0.688 {method 'acquire' of '_thread.lock' objects}
    18/16    0.592    0.033    1.061    0.066 {built-in method numpy.array}
        5    0.244    0.049    0.244    0.049 {method 'astype' of 'numpy.ndarray' objects}
        1    0.127    0.127    0.469    0.469 managers.py:834(_interleave)
        4    0.098    0.024    0.098    0.024 {built-in method numpy.empty}
        2    0.019    0.010    0.019    0.010 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.010    0.010  291.382  291.382 <string>:1(<module>)
      500    0.010    0.000    0.032    0.000 inspect.py:2102(_signature_from_function)
     6500    0.009    0.000    0.015    0.000 inspect.py:2452(__init__)
      500    0.003    0.000    0.005    0.000 base.py:164(<listcomp>)
      500    0.003    0.000

In [52]:
print_score(m)

[1042345421.5645587, 2319965865.481266, 0.9638837404090848, 0.9553316424467648]
