# Performance Validation and Model Interpretation - Chapter 03

## 0. Learning Objectives

* To understand and `interpret` the predictive model
* We demystify the idea that ML/DL models are `black-boxes`
* Instead that RForest actual gives us useful `insights` regarding the data
* We will also consider a larger dataset this chapter, particulary with over `1million` rows
* This is Kaggle competition for `grocery forecasting`
* Look at a model called `collaborative filtering`

* Also learn a bit of tweaking today

---

* Q) Question's been asked, how to choose ML models
* A) For `unstructured` dataset, it is always good to use `deep learning` methods
* S

## 1. Import Modules and Dataset Info

### 1.1 Reading Third Party Modules

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import os
currDir = os.getcwd()
os.chdir("../fastai/")
from structured import *       
from imports import *
os.chdir(currDir)
# ____________________________________________________________ #
from pandas_summary import DataFrameSummary
from IPython.display import display

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# ____________________________________________________________ #
import math
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [3]:
PATH = "../datasets/kaggle/corporcion_favorita_grocery_sales/"
# !dir "../datasets/kaggle/corporcion_favorita_grocery_sales/"

### 1.2 Information on The Data

* The `dependant` variable is the one you are trying to **PREDICT**
* In this dataset you are trying to predict ... "How many `UNITS` of each kind of product was sold in `EACH STORE` on each day during the `two-week period`
* The info that you want to predict is the "How many `UNITS` each project at each store, on each day were sold in the last few years and for each store, date, product there is a metadata
* `Metadata` based on the store includes information for example
    > where is the store located
    > what class of store is it 
* Meta data on the product type can include
    > what was the oil price on this date ?
    > what was the overall sales likes from the point of view of competitors ?
---

* The grocery store dataset is a type of `Relational Dataset`
* Meaning there are a number of different piece of information that we can `relate` together
* This type of relational dataset is a type of `Star Schema`
* A star schema is a kind of a `data warehousing` schema where we say there is some `central transaction`
* You can think of this as star schema becaouse we can have a central transaction (i.e. the `train.csv`) and this branches out with different metadata based on targets such as `unit_sales`, `date`, `item_number` etc.
* This is different to what is known as `Snowflake Schema`
* Where there might be extra information available that may join targets across the central transaction
---

## 2. Data Importing and Pre-Processing

**STEP 1:**

* Begin with some basicimporting of the data
* When using `pd.read_csv` if you say `limit_memory=False`, then we will set to use as much as memory as we like
* This helps with figuring out what kind of data it is with more introspection possible
* However, the system will run out of memory regardless of how big is your RAM
* To limit the amount of memory to be used, we make a seperate columns of `types` of data we would like to store, this is demonstrated below
* And as usual, you assign the column you would like to be parsed as dates in the `parse_dates` argument for which you pass the column name `[date]`

---


* The logic behding chosing types is that the author is looking for the `smallest possible bits` needed to store the data
* When working with large datasets, the `reading` and `writing` of the data is considerably slow
* As a rule of thumb, `smaller datatypes` will RUN faster
* In particularly if you use SIMD
* SIMD, stands for `Single Instructure Multiple Data` vectorized code --> SIMD can pack more numbers into a single vector to `run at once`

---

In [4]:
types = {'id': 'int64',
         'item_nbr': 'int32',
         'store_nbr': 'int8',
         'unit_sales': 'float32',
         'onpromotion': 'object'}

# below is only used when working on non_lenovo machine -- here, set chunksize=1000
%time df_chunk = pd.read_csv(PATH+'train.csv', parse_dates=['date'], dtype=types, infer_datetime_format=True, chunksize=10000)
df_all = df_chunk.get_chunk()


Wall time: 16.9 ms


---

**STEP 2**

* Also, we set `onpromotion` to `object`
* By default, the column onpromotion stores boolean variables
* But we instead set its type to be `object`
* Why ? because we need to pre-process it b4hand we name it as it has `missing values`
* The pre-processing is done so as to avoid any gaps unexplainable to data holders or analytics
* Keep in mind, setting to 'object' is not a good choice since it is a general purpose type which consumes `large amount of memory` and is `slow to use`
* But it is the best we have so far
* Now to fill all the missing values in the `onpromotion` columns with some binary values
* After removing all the missing values, use the `.map` function to set all the `string booleans` to actual booleans
* And then in the final line of code, convert it into a boolean
* After the save file you can see that the data drops in memory from `train.csv` going from 4.65GB to the `tmp` file taking only 878MB
* This saving memory technique allows us to inspect large scale datasets on less powerfull PCs

In [5]:
df_all.onpromotion.fillna(False, inplace=True)
df_all.onpromotion = df_all.onpromotion.map({'False': False, 'True': True})
df_all.onpromotion = df_all.onpromotion.astype(bool)

# save the temporary modified date
# %time df_all.to_feather(PATH+'/tmp_grocery_sales')

In [6]:
%time df_all.describe(include='all')

Wall time: 15.7 ms


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
count,10000.0,10000,10000.0,10000.0,10000.0,10000
unique,,2,,,,1
top,,2013-01-02 00:00:00,,,,True
freq,,9422,,,,10000
first,,2013-01-01 00:00:00,,,,
last,,2013-01-02 00:00:00,,,,
mean,4999.5,,6.006,606755.1,11.498948,
std,2886.89568,,5.284047,302236.2,17.2225,
min,0.0,,1.0,103501.0,0.252,
25%,2499.75,,3.0,346065.0,3.0,


* As you can see first thing first, the dates look wrong in terms of format and has various NaNs there
* Also we havent fixed the NaN values yet as well
* So why formatting date is important ? --> because if you train your model at an earlier date and deploy it in a later date, your model should be `adaptable` enough to encorporate the changes
* So you always need to make sure that in your data, the dates dont `overlap`

---

**STEP 3:**

* Repeat the same steps above but for the test sets
* And always be on the look out for discrepancy between the training set and the test set

In [7]:
# df_test = pd.read_feather(PATH+'tmp_test_grocery')
df_test_chunk = pd.read_csv(
    PATH+'test.csv', parse_dates=['date'], dtype=types, infer_datetime_format=True, chunksize=1000)

df_test = df_test_chunk.get_chunk()

df_test.onpromotion.fillna(False, inplace=True)
df_test.onpromotion = df_test.onpromotion.map({'False': False, 'True': True})
df_test.onpromotion = df_test.onpromotion.astype(bool)
df_test.describe(include='all')

# write to a file
# df_test.to_feather(PATH+'tmp_test_grocery')

# display the table
df_test.describe(include='all')


Unnamed: 0,id,date,store_nbr,item_nbr,onpromotion
count,1000.0,1000,1000.0,1000.0,1000
unique,,1,,,2
top,,2017-08-16 00:00:00,,,False
freq,,1000,,,941
first,,2017-08-16 00:00:00,,,
last,,2017-08-16 00:00:00,,,
mean,125497500.0,,1.0,451994.296,
std,288.8194,,0.0,208186.720121,
min,125497000.0,,1.0,96995.0,
25%,125497300.0,,1.0,269286.75,


* Now you can clearly see that in the test set notice the dates begin one day later form the training set
* So then your model should be able to `forecast` based on the date you are given
* This is fundamental level of ML that all should know, the test set must **TEST** the ability of the model to forecast
* Instead of randomly sampling, why not look at the latest dates in the test set using `.tail()`, think about this, you need to be able to predict on the latest information and your model should be able to predict on it

In [8]:
df_all.tail()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
9995,9995,2013-01-02,9,698643,20.0,True
9996,9996,2013-01-02,9,716241,5.0,True
9997,9997,2013-01-02,9,716242,12.0,True
9998,9998,2013-01-02,9,716245,7.0,True
9999,9999,2013-01-02,9,716250,10.0,True


---

**STEP 4:**

* Next we demonstrate how to load from the saved temporary folder and set that to the `df_all` variable
* So that there is no overlap between the one loaded from the original .csv file
* And the `truncated` version we just made
* And now we take the `log` of the sales, just like in previous data
* This the dependant variable remember, we are trying to predict the sales, and we want it in logs so we can predict something that `varies according to ratios` and the loss function will again be `RMLSE`
* Also always be attentive as to what the project description is saying
* For example, in grocery sales it says that the `negative sales` should be counted as `zeros`
* So we `clip` the sales so they fall between `0` and `None`, where none means undefined maximum val
* The usage of $\ln(sales) + 1$ is also there as per suggestion of the project description $\rightarrow$ hence why we use `np.log1p`

In [9]:
df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales, 0, None))

---

**STEP 5:**

* Now we resume with the pre-processing of dates as carried out in previous databases
* We again use the `add_date_part` function provided by the Fast.AI library
* Usually, you'd run with a smaller subsample to make sure your function runs correctly
* Also, we dont use `train_cats` here because all the columns are numeric
* We do, however, need to run `proc_df` on the target/dependant variable `Unit_Sales` for appending missing values to numeric ones

In [10]:
%time add_datepart(df_all, 'date')

Wall time: 1.35 s


**STEP 6:**

* Here comes the usual split again

In [11]:
def split_values(a, n):
    return a[:n].copy(), a[n:].copy()
    
n_valid = len(df_test)
n_trn = len(df_all) - n_valid
train, valid = split_values(df_all, n_trn)
print("Trainset shape: {}  Validset shape: {}".format(train.shape, valid.shape))


Trainset shape: (9000, 18)  Validset shape: (1000, 18)


**STEP 7:**

* Run the `proc_df` function for missing values replacement to numeric

In [12]:
%time trn, y, _ = proc_df(train, 'unit_sales')
val, y_val, _ = proc_df(valid, 'unit_sales')

Wall time: 7.67 ms


## 3. Model

### 3.1 Algorithm Procedure

**STEP 1:**

* We still care about using *rmse* as measure of performance

In [13]:
def rmse(x, y):
    return math.sqrt(((x - y)**2).mean())


def print_score(m, x, y, val=val, y_val=y_val):
    res = [rmse(m.predict(x), y), rmse(m.predict(val), y_val),
           m.score(x, y), m.score(val, y_val)]
    if hasattr(m, 'oob_score_'):
        res.append(m.oob_score_)
    print("RMSE_X_train  :  RMSE_X_valid  :  Score_X_train  :  Score_X_valid")
    print(res)

---

**STEP 2:**

* Now we look back at the function `set_rf_samples`
* 1st consider the background, we have about 1e6 samples, and we dont want to create a tree with 120million records
* So instead start with 10,000 or 100,000 according to author, using 1million (1,000,000) runs within under a minute
* Hence why we use set_rf_samples=1million, for Lenovo I used `1000`, maybe `100` for Asus or Dell
* Also it completely never came across my notice that you could use underscore to set large numbers in Python ... Wow !

---

* Also author converted datatypes to floats
* Why ? because save time, it is done anyways in the internal pandas libraries, however, doing it seperately and only once, saves you a couple of minutes
* You can also use the magic command `%prun` to be able to profile the code and see which line takes the most amount of time in running
* You can practice this on any line of code that is taking more than 20secons and inspect which line needs to be optimised

---

* Also you cannot use `oob_score` if you run `rf_sample` so instead I would use `chunksize` in order to section the data 
* Alternatively, you have the option of manually writing in the oob_score as well
* But author said it is NOT recommended for large datasets, for this calculating the oob_score will take too much time
* You can also inspect if oob_score is available to you using the command `rfalg.oob_score`

In [14]:
# not sure if you need this if you use chunksize
# set_rf_samples(1_000)  
%time x = np.array(trn, dtype=np.float32)

Wall time: 10.1 ms


---

**STEP 3:**

* Finally create the random forrest and fit it with at least `20 estimators/trees`
* The number of jobs is equal to number of cores, setting `n_jobs=-1` means use every single core

In [15]:
rfalg = RandomForestRegressor(n_estimators=20, min_samples_leaf=100, n_jobs=2)
%time rfalg.fit(x, y)
print_score(rfalg, x, y)

Wall time: 118 ms
RMSE_X_train  :  RMSE_X_valid  :  Score_X_train  :  Score_X_valid
[0.8434397863716047, 0.9453934295747514, 0.13949736216972342, 0.00719903865985505]


**STEP 4:**

* Improving further (reducing the error further) by reducing `min_samples_leaf`
* In this example we drop this from 100 to 10
* We clearly saw a jump in score on both x_train and x_valid from `0.139` to `0.312` when reducing min_samples_leaf

In [16]:
rfalg = RandomForestRegressor(n_estimators=20, min_samples_leaf=10, n_jobs=2)
%time rfalg.fit(x, y)
print_score(rfalg, x, y) 

Wall time: 218 ms
RMSE_X_train  :  RMSE_X_valid  :  Score_X_train  :  Score_X_valid
[0.7551882462056887, 0.9138796843603745, 0.3101502950698689, 0.07228393813131406]


* Now keep decreasing min_samples_leaf down to 3 and observe

In [17]:
rfalg = RandomForestRegressor(n_estimators=20, min_samples_leaf=3, n_jobs=2)
%time rfalg.fit(x, y)
print_score(rfalg, x, y) 

Wall time: 239 ms
RMSE_X_train  :  RMSE_X_valid  :  Score_X_train  :  Score_X_valid
[0.6084471930368329, 0.9121156036047839, 0.5521940881441207, 0.07586206013318464]


### 3.2 Discussion on Limitations of Random Forest Algorithms

* RForest only knows how to make splits at various columns
* It doesnt know where the location of store is
* It has a hunch on the correlation and then it makes the splits
* Coding for ML is very difficult, small changes in details you get bad performance
* And if you are NOT on kaggle you wont be able to tell where you are wrong
* Debugging is very intricate in ML programming
* Also, Might not be very good for RForest to work based on very old as in 4 years of data, so Kaggle offers kernel that follows that: takes the last two weeks and take the average sales by `date`, `store_number`, `on_promotion` and take mean across date just submit that -=> and you get 30th position


### 3.3 Ways to Improve Performance on this Dataset

Always remeber the STAR schema : the `supplementary` data will always help the data organisation

* `Student A` suggestion : model seasonality and trend effects in different column e.g. average sales per month
* `Jeremy` suggestion : finish testing a model and train a new and draw a scatter plot (x_axis predictions of old model Vs predictions of new model) they should form a line and if it doesnt then you have screwed sth up. ALso check upon `Rossman challenge` from Kaggle, it is very relatable to Ecuador's grocery challenge
* `Student B` suggestion : consider Ecuador's holidays 

### 3.4 Advice on Bad Validation Set Cases

* When you have bad valid_set, it can be difficult to make a good model out of it
* Validation set must be reliable : i.e. it must be able to tell if your model is going to do well in production case or not
* You can use the test to calibrate your validation set

<img src='../fastai/images/terrence_validation.png' height=300px width=600px>