# Retail Demand Forecasting: The M5 Kaggle Competition
#### Kartikey Vyas

The M5 Kaggle competition was the fifth iteration of the "Makidrakis" Competitions from the University of Nicosia. The aim of the competition was to forecast the next 28 days of sales for Walmart in the US. We are provided hierarchicial sales data, broken down at the item level, department, product category, store and state. Additionally, the data set has information on price, promotions and special events. This competition serves as a very close example of what we are trying to achieve with the RETAILER project at FORECASTING_STARTUP.

This notebook explores effective approaches to this competition and relevant literature. An emphasis is put on describing key concepts and identifying opportunities for FORECASTING_STARTUP to experiment with new methods.

## Contents
1. [Data](#dataset)
2. [Feature Engineering](#fe)
3. [Cross-validation](#cv)
4. [Models](#models)
    1. [Boosting Trees](#tree)
    2. [Neural Networks](#nn)
5. [Forecasting Strategies](#forecast)
    1. [Recursive](#recursive)
    2. [Direct](#direct)
    3. [Ensemble and Others](#dirrec)
6. [Forecast Reconciliation](#recon)
7. [Caveats](#caveats)
8. [References](#refs)

## Data <a name="dataset"></a>

We are given **42,840** hierarchical time-series. There are 3049 individual products from 3 categories and 7 departments, sold in 10 stores in 3 states. The sales information covers Jan 2011 to June 2016.

In [150]:
import pandas as pd
import time
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [157]:
start = time.time()
## READ DATA #####################################
train = pd.read_csv('data/raw/sales_train_validation.csv')
prices = pd .read_csv('data/raw/sell_prices.csv')
calendar = pd.read_csv('data/raw/calendar.csv')
end = time.time()
print(end-start)

8.020843982696533


In [143]:
train.head()

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d_1,d_2,d_3,d_4,...,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,3,0,1,1,1,3,0,1,1
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,2,1,1,1,0,1,1,1
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,0,5,4,1,0,1,3,7,2
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,1,0,1,1,2,2,2,4


Here we see that we have historical sales for each product for 1,941 days. We also have information on which department, category, store and state the product was sold in. The objects `price` and `calendar` contain weekly price changes for each product and date features respectively. A full exploratory data analysis can be found on Kaggle (Interactive M5 EDA) [[1]](#eda).

In [4]:
calendar.head()

Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI
0,2011-01-29,11101,Saturday,1,1,2011,d_1,,,,,0,0,0
1,2011-01-30,11101,Sunday,2,1,2011,d_2,,,,,0,0,0
2,2011-01-31,11101,Monday,3,1,2011,d_3,,,,,0,0,0
3,2011-02-01,11101,Tuesday,4,2,2011,d_4,,,,,1,1,0
4,2011-02-02,11101,Wednesday,5,2,2011,d_5,,,,,1,0,1


An interesting data point that is provided here are SNAP days. These variables represent days where each state allowed purchases with SNAP food stamps.

We'll have a look at the visualisations produced in the most popular EDA notebook on Kaggle for this competition.

![overall](overall_sales.png)  
![monthly](monthly_sales.png)  

Next, we'll take a brief look at the sales data for three different products.  
![plot1](product_plot_1.png)  

We can see here the date the product was introduced and the general change in sales over the past few years.

![plot2](product_plot_2.png)  

This product has been in stock since the beginning of the data set, but there are significant periods of zero sales.

![plot3](product_plot_3.png)  

This product is from a different department (household) but shows a similar pattern of sales to the previous product.

## Feature Engineering <a name="fe"></a>
Most of the high-scoring submissions in the M5 competition used relatively simple features.
- Item, Department and Category (given)
    - often just using default LGBM categorical feature encoding
    - one solution used GLMM encoding for item_id to some success
- Price Features
    - Current price
    - Price momentum
    - max, min, std, mean...
- Day Lags
    - Sales 7 days ago, 14 days ago, 28 days ago
    - 1-day lags were not very useful
- Rolling Mean of Lags
    - Mean of sales 7 days ago over previous 7 days
- Mean encoding at item, dept, category, store and state levels
- Date Features
    - day, week, month, year
    - week of the month, day of the week
    - weekend
- SNAP days
- Special events
    - event type
    - event name

Most submissions used all or a subset of the listed features. Interestingly, the 2nd place submission did not make use of any lag features, instead having the hypothesis that such features were not drivers for sales [[2]](#2nd). There was no discussion surrounding the use of `tsfresh` for extracting time series features in any of the top solutions and it was only mentioned twice in all of the discussion posts on Kaggle. Furthermore, there were no notebooks that explored the application of `tsfresh` on this competition.

Due to the size of the data set and memory limitations, most submissions opted to use a limited number of features. As such, feature selection and validation need to be taken into consideration. This will be covered in more detail in the `tsfresh` section.

Some more creative features are as follows:
- Moon phase (used in the 4th place solution)
- Number of consecutive days of 0 sales until today
- Number of days until next event (used in 99th place)
- Nearest upcoming/past event name

An interesting problem was the consistent appearance of days with zero sales for many products, which was addressed by some solutions through feature engineering. Other approaches created a classifier that identified zero days, while many simply relied on a tweedie loss function (more on this later).

## Cross Validation <a name="cv"></a>
Cross validation for time series forecasting requires that models are trained on only past data. A few approaches were described in this competition.

The forecast horizon for this competition was 28 days, so the most common approach was to split the data into 28 day blocks and use the chronologically last 3 blocks as validation folds. However, it was noted by many Kaggle contributors that this strategy is very computationally expensive, given the size of the data set.

Some alternative, more computationally expensive strategies are:
- Group K Fold
    - This involves grouping data by certain variables (e.g. store_id, dept_id)
    - It ensures that data from from the same group do not appear in different folds
    - The result is that we never test on data from a group that was trained on. This tackles the problem of data leakage and reduces overfitting.

## Models <a name="models"></a>
There were two broad categories of models that were effective in this competition; neural networks and boosting trees.

List of some of the top solutions and their model types:  

| Rank   | Model                                         |
| ---    | ---                                           |
| 1st.   | lgbm (single)                                 |
| 2nd.   | lgbm + NN (N-BEATS) (forecast reconciliation) |
| 3rd.   | NN (DeepAR)                                   |
| 4th.   | lgbm (single)                                 |
| 5th.   | lgbm (per department)                         |
| 7th.   | lgbm (per store)                              |
| 14th.  | lgbm (per department in each store)           |
| 68th.  | lgbm (single)                                 |
| 178th. | lgbm (regression + classification)            |
| 219th. | lgbm (stacked)                                |


Clearly, boosted trees were the way to go for this task. There were a couple of solutions which used neural networks, but the vast majority of high ranking submissions used LightGBM. If anyone wants to read a bit more about each of the best submissions, please see this Confluence page [[3]](#confluence)

### Boosting Trees <a name="tree"></a>
The most popular and successful algorithm in this competition was LightGBM. This algorithm 

### Neural Networks <a name="nn"></a>
These were often used for their flexibility in their loss functions, which allowed for better conforming to the competition's accuracy metric.

## Forecasting Strategies <a name="forecast"></a>

### Direct <a name="direct"></a>

### Recursive <a name="recursive"></a>

### Ensemble <a name="dirrec"></a>

## Forecast Reconciliation <a name="recon"></a>

## Caveats <a name="caveats"></a>
- The competition accuracy metric seems to have raised concerns in the community. This was due to its complexity and its stability as more features were added. Perhaps approaches would have been different with a different metric?
- We need to see if re-contextualising this competition using WAPE and BIAS as metrics will change how people approached it.
- The 'ranking' of submissions is based completely on their performance on one specific unseen test set. With questions around the way that data was collected and how this test set was assembled, high scores do not necessarily mean a better, more generalisable model.

## References <a name="refs"></a>

[1] <a name="eda"></a> Back to (predict) the future - Interactive M5 EDA. https://www.kaggle.com/headsortails/back-to-predict-the-future-interactive-m5-eda/report

[2] <a name="2nd"></a> 2nd place solution. https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/164599

[3] <a name="confluence"></a> M5 Research Confluence Page (Kartikey). https://FORECASTING_STARTUP-confluence.atlassian.net/wiki/spaces/~171079063/pages/1072005169/RETAILER+Project+Research