# Performance Validation and Model Interpretation - Chapter 03

## 0. Learning Objectives

* To understand and `interpret` the predictive model
* We demystify the idea that ML/DL models are `black-boxes`
* Instead that RForest actual gives us useful `insights` regarding the data
* We will also consider a larger dataset this chapter, particulary with over `1million` rows
* This is Kaggle competition for `grocery forecasting`
* Look at a model called `collaborative filtering`
* Also learn a bit of tweaking today

---

* Q) Question's been asked, how to choose ML models
* A) For `unstructured` dataset, it is always good to use `deep learning` methods
* S

## 1. Import Dataset and Modules

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import os
currDir = os.getcwd()
os.chdir("../fastai/")
from structured import *       
from imports import *
os.chdir(currDir)
# ____________________________________________________________ #
from pandas_summary import DataFrameSummary
from IPython.display import display

from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
# ____________________________________________________________ #
import math
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [3]:
PATH = "../datasets/kaggle/corporcion_favorita_grocery_sales/"
# !dir "../datasets/kaggle/corporcion_favorita_grocery_sales/"

### 1.1 Information on The Data

* The `dependant` variable is the one you are trying to **PREDICT**
* In this dataset you are trying to predict ... "How many `UNITS` of each kind of product was sold in `EACH STORE` on each day during the `two-week period`
* The info that you want to predict is the "How many `UNITS` each project at each store, on each day were sold in the last few years and for each store, date, product there is a metadata
* `Metadata` based on the store includes information for example
    > where is the store located
    > what class of store is it 
* Meta data on the product type can include
    > what was the oil price on this date ?
    > what was the overall sales likes from the point of view of competitors ?
---

* The grocery store dataset is a type of `Relational Dataset`
* Meaning there are a number of different piece of information that we can `relate` together
* This type of relational dataset is a type of `Star Schema`
* A star schema is a kind of a `data warehousing` schema where we say there is some `central transaction`
* You can think of this as star schema becaouse we can have a central transaction (i.e. the `train.csv`) and this branches out with different metadata based on targets such as `unit_sales`, `date`, `item_number` etc.
* This is different to what is known as `Snowflake Schema`
* Where there might be extra information available that may join targets across the central transaction
---

## 2. Read and Data

**STEP 1:**

* Begin with some basicimporting of the data
* When using `pd.read_csv` if you say `limit_memory=False`, then we will set to use as much as memory as we like
* This helps with figuring out what kind of data it is with more introspection possible
* 

In [6]:
types = {'id': 'int64',
         'item_nbr': 'int32',
         'store_nbr': 'int8',
         'unit_sales': 'float32',
         'onpromotion': 'object'}

In [10]:
%%time
df_all = pd.read_csv(
    PATH+'train.csv', parse_dates=['date'], dtype=types, infer_datetime_format=True)

Wall time: 59.4 s


In [4]:
PATH

'../datasets/kaggle/corporcion_favorita_grocery_sales/'