# Performance Validation and Model Interpretation - Chapter 03

## 0. Learning Objectives

* To understand and `interpret` the predictive model
* We demystify the idea that ML/DL models are `black-boxes`
* Instead that RForest actual gives us useful `insights` regarding the data
* We will also consider a larger dataset this chapter, particulary with over `1million` rows
* This is Kaggle competition for `grocery forecasting`
* Look at a model called `collaborative filtering`
* Also learn a bit of tweaking today

---

* Q) Question's been asked, how to choose ML models
* A) For `unstructured` dataset, it is always good to use `deep learning` methods
* S

## 1. Import Modules and Dataset Info

### 1.1 Reading Third Party Modules

In [1]:
%load_ext autoreload
%autoreload 2
%matplotliba inlinae

UsageError: Line magic function `%matplotliba` not found.


In [2]:
import os
currDir = os.getcwd()
os.chdir("../fastai/")
from structured import *       
from imports import *
os.chdir(currDir)
# ____________________________________________________________ #
from pandas_summary import DataFrameSummary
from IPython.display import display

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# ____________________________________________________________ #
import math
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [3]:
PATH = "../datasets/kaggle/corporcion_favorita_grocery_sales/"
# !dir "../datasets/kaggle/corporcion_favorita_grocery_sales/"

### 1.2 Information on The Data

* The `dependant` variable is the one you are trying to **PREDICT**
* In this dataset you are trying to predict ... "How many `UNITS` of each kind of product was sold in `EACH STORE` on each day during the `two-week period`
* The info that you want to predict is the "How many `UNITS` each project at each store, on each day were sold in the last few years and for each store, date, product there is a metadata
* `Metadata` based on the store includes information for example
    > where is the store located
    > what class of store is it 
* Meta data on the product type can include
    > what was the oil price on this date ?
    > what was the overall sales likes from the point of view of competitors ?
---

* The grocery store dataset is a type of `Relational Dataset`
* Meaning there are a number of different piece of information that we can `relate` together
* This type of relational dataset is a type of `Star Schema`
* A star schema is a kind of a `data warehousing` schema where we say there is some `central transaction`
* You can think of this as star schema becaouse we can have a central transaction (i.e. the `train.csv`) and this branches out with different metadata based on targets such as `unit_sales`, `date`, `item_number` etc.
* This is different to what is known as `Snowflake Schema`
* Where there might be extra information available that may join targets across the central transaction
---

## 2. Data Importing and Pre-Processing

**STEP 1:**

* Begin with some basicimporting of the data
* When using `pd.read_csv` if you say `limit_memory=False`, then we will set to use as much as memory as we like
* This helps with figuring out what kind of data it is with more introspection possible
* However, the system will run out of memory regardless of how big is your RAM
* To limit the amount of memory to be used, we make a seperate columns of `types` of data we would like to store, this is demonstrated below
* And as usual, you assign the column you would like to be parsed as dates in the `parse_dates` argument for which you pass the column name `[date]`

---


* The logic behding chosing types is that the author is looking for the `smallest possible bits` needed to store the data
* When working with large datasets, the `reading` and `writing` of the data is considerably slow
* As a rule of thumb, `smaller datatypes` will RUN faster
* In particularly if you use SIMD
* SIMD, stands for `Single Instructure Multiple Data` vectorized code --> SIMD can pack more numbers into a single vector to `run at once`

---

In [4]:
types = {'id': 'int64',
         'item_nbr': 'int32',
         'store_nbr': 'int8',
         'unit_sales': 'float32',
         'onpromotion': 'object'}

%time
# df_all = pd.read_csv(
#     PATH+'train.csv', parse_dates=['date'], dtype=types, infer_datetime_format=True)
df_all = pd.read_feather(PATH+'tmp_all_grocery')

Wall time: 0 ns


---

**STEP 2**

* Also, we set `onpromotion` to `object`
* By default, the column onpromotion stores boolean variables
* But we instead set its type to be `object`
* Why ? because we need to pre-process it b4hand we name it as it has `missing values`
* The pre-processing is done so as to avoid any gaps unexplainable to data holders or analytics
* Keep in mind, setting to 'object' is not a good choice since it is a general purpose type which consumes `large amount of memory` and is `slow to use`
* But it is the best we have so far
* Now to fill all the missing values in the `onpromotion` columns with some binary values
* After removing all the missing values, use the `.map` function to set all the `string booleans` to actual booleans
* And then in the final line of code, convert it into a boolean
* After the save file you can see that the data drops in memory from `train.csv` going from 4.65GB to the `tmp` file taking only 878MB
* This saving memory technique allows us to inspect large scale datasets on less powerfull PCs

In [5]:
# df_all.onpromotion.fillna(False, inplace=True)
# df_all.onpromotion = df_all.onpromotion.map({'False': False, 'True': True})
# df_all.onpromotion = df_all.onpromotion.astype(bool)

# save the temporary modified date
# %time df_all.to_feather(PATH+'/tmp_grocery_sales')

In [7]:
%time df_all.describe(include='all')

Wall time: 14.7 s


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
count,125497000.0,125497040,125497000.0,125497000.0,125497000.0,125497040
unique,,1684,,,,2
top,,2017-07-01 00:00:00,,,,False
freq,,118194,,,,96028767
first,,2013-01-01 00:00:00,,,,
last,,2017-08-15 00:00:00,,,,
mean,62748520.0,,27.46458,972769.2,8.554856,
std,36227880.0,,16.33051,520533.6,23.60515,
min,0.0,,1.0,96995.0,-15372.0,
25%,31374260.0,,12.0,522383.0,2.0,


* As you can see first thing first, the dates look wrong in terms of format and has various NaNs there
* Also we havent fixed the NaN values yet as well
* So why formatting date is important ? --> because if you train your model at an earlier date and deploy it in a later date, your model should be `adaptable` enough to encorporate the changes
* So you always need to make sure that in your data, the dates dont `overlap`

---

**STEP 3:**

* Repeat the same steps above but for the test sets
* And always be on the look out for discrepancy between the training set and the test set

In [10]:
# df_test = pd.read_csv(
#     PATH+'test.csv', parse_dates=['date'], dtype=types, infer_datetime_format=True)

# df_test.onpromotion.fillna(False, inplace=True)
# df_test.onpromotion = df_test.onpromotion.map({'False' : False, 'True' : True})
# df_test.onpromotion = df_test.onpromotion.astype(bool)
# df_test.describe(include='all')

df_test.to_feather(PATH+'tmp_test_grocery')
df_test.describe(include='all')

Unnamed: 0,id,date,store_nbr,item_nbr,onpromotion
count,3370464.0,3370464,3370464.0,3370464.0,3370464
unique,,16,,,2
top,,2017-08-16 00:00:00,,,False
freq,,210654,,,3171867
first,,2017-08-16 00:00:00,,,
last,,2017-08-31 00:00:00,,,
mean,127182300.0,,27.5,1244798.0,
std,972969.3,,15.58579,589836.2,
min,125497000.0,,1.0,96995.0,
25%,126339700.0,,14.0,805321.0,


* Now you can clearly see that in the test set notice the dates begin one day later form the training set
* So then your model should be able to `forecast` based on the date you are given
* This is fundamental level of ML that all should know, the test set must **TEST** the ability of the model to forecast
* Instead of randomly sampling, why not look at the latest dates in the test set using `.tail()`, think about this, you need to be able to predict on the latest information and your model should be able to predict on it

In [25]:
df_all.tail()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
125497035,125497035,2017-08-15,54,2089339,4.0,False
125497036,125497036,2017-08-15,54,2106464,1.0,True
125497037,125497037,2017-08-15,54,2110456,192.0,False
125497038,125497038,2017-08-15,54,2113914,198.0,True
125497039,125497039,2017-08-15,54,2116416,2.0,False


---

**STEP 4:**

* Next we demonstrate how to load from the saved temporary folder and set that to the `df_all` variable
* So that there is no overlap between the one loaded from the original .csv file
* And the `truncated` version we just made
* And now we take the `log` of the sales, just like in previous data
* This the dependant variable remember, we are trying to predict the sales, and we want it in logs so we can predict something that `varies according to ratios` and the loss function will again be `RMLSE`
* Also always be attentive as to what the project description is saying
* For example, in grocery sales it says that the `negative sales` should be counted as `zeros`
* So we `clip` the sales so they fall between `0` and `None`, where none means undefined maximum val
* The usage of $\ln(sales) + 1$ is also there as per suggestion of the project description $\rightarrow$ hence why we use `np.log1p`

In [26]:
df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales, 0, None))

---

**STEP 5:**

* Now we resume with the pre-processing of dates as carried out in previous databases
* We again use the `add_date_part` function provided by the Fast.AI library
* Usually, you'd run with a smaller subsample to make sure your function runs correctly
* Also, we dont use `train_cats` here because all the columns are numeric
* We do, however, need to run `proc_df` on the target/dependant variable `Unit_Sales` for appending missing values to numeric ones

In [29]:
%time add_datepart(df_all, 'date')

Wall time: 0 ns


**STEP 6:**

* Here comes the usual split again

In [34]:
def split_values(a, n):
    return a[:n].copy(), a[n:].copy()
    
n_valid = len(df_test)
n_trn = len(df_all) - n_valid
train, valid = split_values(df_all, n_trn)
print("Trainset shape: {}  Validset shape: {}".format(train.shape, valid.shape))


Trainset shape: (122126576, 18)  Validset shape: (3370464, 18)


**STEP 7:**

* Run the `proc_df` function for missing values replacement to numeric

In [35]:
%time
trn, y, _ = proc_df(train, 'unit_sales')
val, y_val, _ = proc_df(valid, 'unit_sales')

Wall time: 0 ns


## 3. Model

In [None]:
def rmse(x, y):
    return math.sqrt(((x - y)**2).mean())


def print_score(m, x, y, val=val, y_val=y_val):
    res = [rmse(m.predict(x), y), rmse(m.predict(val), y_val),
           m.score(x, y), m.score(val, y_val)]
    if hasattr(m, 'oob_score_'):
        res.append(m.oob_score_)
    print("RMSE_X_train  :  RMSE_X_valid  :  Score_X_train  :  Score_X_valid")
    print(res)