# Explory data analysis for Favorita Grocery Sales Forecasting

(If you found this EDA useful don't forget to upvote it, thanks a lot!)

Note:  The Kaggle docker image for kernel is not up to date for bokeh so during the interactive plots of timeseries you may see a weird number for dates, that shouldn't be a problem but I didn't find any alternative to show the date properly on an older version of bokeh.


## 1. Intro & global analysis

The data comes in the shape of multiple csv files which, given their structure, seems to be extracted from a relational database. The 2 main files: `train.csv` & `test.csv` contains the sales by date, store, and item. The test data contains the same features without the sales information, which we are tasked to predict. The train vs test split is based on the date. In addition, **some test items are not included in the train data**.

Furthermore, there are 5 additional data files that provide the following information:

 - `stores.csv`: Details about the stores, such as location and type.

 - `items.csv`: Item metadata, such as class and whether they are perishable.

 - `transactions.csv`: Count of sales transactions for the training data

 - `oil.csv`: Daily oil price. This is relevant, because “Ecuador is an oil-dependent country and its economical health is highly vulnerable to shocks in oil prices.”

 - `holidays_events.csv`: Holidays in Ecuador. Some holidays can be transferred to another day (possibly from weekend to weekday).

In addition, the [data description](https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data) notes the importance of public sector pay days (on the 15th and the end of the month) as well as the impact of **a major earthquake on April 16 2016 which greatly affected supermarket sales for several weeks after the earthquake**.

Import the necessary libs

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
from bokeh.io import show, output_notebook
from bokeh.models import ColumnDataSource, FactorRange, HoverTool
from bokeh.plotting import figure
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from tqdm import tqdm
from IPython.display import display
from random import randint

output_notebook()

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
input_files_path = '../input/'

files = ["train", "test", "transactions", "stores", "oil", "items", "holidays_events"]
datasets_path = [input_files_path + f + ".csv" for f in files]

Now lets take a look at the data files:

In [None]:
print('# File sizes')
for f in datasets_path:
    print(f.ljust(30) + str(round(os.path.getsize(input_files_path + f) / 1000000, 2)) + 'MB')

The `train.csv` is 5gb in size so we won't load all the data in pandas dataframes but instead we will conduct this EDA on ~20% of the data (By reading by chunks, put `None` in `chunksize` if you want to read the whole data). Be careful the whole datasets take ~10gb of RAM. **The training set has 125 497 040 entries and the the testing set has 3 370 464 entries.**

In [None]:
chunksize = 25_000_000 # ~20% of the training set
data_df = {}
for filename, filepath in tqdm(zip(files, datasets_path), total=len(files)):
    if chunksize:
        data_df[filename] = pd.read_csv(filepath, chunksize=chunksize, low_memory=False).get_chunk()
    else:
        data_df[filename] = pd.read_csv(filepath, low_memory=False)

Get a sense of the [dataset](https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data):

In [None]:
for chunk, df in data_df.items():
    print("Dataset {} of size {} with fields:\n {}".format(chunk, len(df), df.dtypes))
    display(df.tail())
    print(df.describe(), end='\n\n\n')


Lets now detail what every field is about and how do they relate to each other across the different csv files:
 - `train` & `test`:
     - `id`: The entry id (_non null_)
     - `date`: The date relative to each `unit_sales` (/!\ The training data does not include rows for items that had zero `unit_sales` for a store/date combination) (_non null_)
     - `store_nbr`: The id of the store selling that item on the specified date (to find it on `stores.csv`)
     - `item_nbr`: The id of the item (to find it on `items.csv`)
     - `unit_sales`: The **target** variable expressed as integer (e.g. a bag of chips) or float (e.g. 1.5 kg of cheese). Negative values of unit_sales represent returns of that particular item.
     - `onpromotion`: If the item was on promotion on that day (/!\ Pandas automatically infer columns types and will changes `NaN` values to `False` if the Dataframe was created on the whole train/test set. In our case we won't consider it a problem and will consider all `NaN` values to be `False`.)
     
 - stores:
     - `store_nbr`: Unique id for the store
     - `city`: City in which the store is located
     - `state`: State in which the store is located
     - `type`: (guess) Type of stores?
     - `cluster`: A ["grouping of similar stores"](https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data)
 - items: 
     - `item_nbr`: Unique id for the item
     - `family`: Kind of item regrouped by a broad category such as `BEVERAGES` or `GROCERY`
     - `class`: ?
     - `perishable`: `0` or `1`, wether the item is perishable or not
 - transactions:
     - `date`: The date at which the transaction was done (serves as the unique identifier)
     - `store_nbr`: The store id on which the transaction was made
     - `transactions`: The transaction count
 - oil: 
     - `date`: The date, act as an unique identifier
     - `dcoilwtico`: The oil price index
 - holidays_events:
     - `date`: The date of the event (serves as the unique identifier)
     - `type`: Type of event 
     - `locale`: Wether the event was regional, national or local
     - `locale_name`: The region in which the event applies
     - `description`: The name of the holiday
     - `transferred`: Boolean, indicates if the holiday was transfered to another day. If this field is `True` then the entry can be considered as a normal day and thus we need to find the next day marked as `Transfered` which is the actual day when the event was celebrated.

Lets retrieve and summarize some useful properties of few specific fields:

In [None]:
train_df = data_df["train"]
train_df["date"] = pd.to_datetime(train_df["date"])
test_df = data_df["test"]
test_df["date"] = pd.to_datetime(test_df["date"])
oil_df = data_df["oil"]
oil_df["date"] = pd.to_datetime(oil_df["date"])
items_df = data_df["items"]
stores_df = data_df["stores"]
transactions_df = data_df["transactions"]
transactions_df["date"] = pd.to_datetime(transactions_df["date"])
holidays_events_df = data_df["holidays_events"]
holidays_events_df["date"] = pd.to_datetime(holidays_events_df["date"])

print("Train set date range: {} to {}".format(train_df["date"].min(), train_df["date"].max()))
print("Test set date range: {} to {}".format(test_df["date"].min(), test_df["date"].max()))
print("Transactions date range: {} to {}".format(transactions_df["date"].min(), transactions_df["date"].max()))
print("Promotions count on train set: {}".format(len(train_df["onpromotion"]) - 
                                                 train_df["onpromotion"].isnull().sum()))
print("Promotions count on test set: {}".format(len(test_df["onpromotion"]) - 
                                                len(test_df[test_df["onpromotion"] == True])))
print("Number of different stores: {}".format(len(stores_df)))
print("Number of different stores clusters: {}".format(stores_df["cluster"].max()))
print("Oil price index range: {} - {}".format(oil_df["dcoilwtico"].min(), oil_df["dcoilwtico"].max()))
print("Oil unknown price count: {} ".format(oil_df["dcoilwtico"].isnull().sum()))
print("Items {} uniques families: {}\n".format(len(items_df["family"].unique()), items_df["family"].unique()))
print("Different types of holiday events: {}\n".format(holidays_events_df["type"].unique()))
print("Different holiday locales: {}\n".format(holidays_events_df["locale"].unique()))
print("Region list where the holidays applies: {}\n".format(holidays_events_df["locale_name"].unique()))
print("All different kind of holidays: {}".format(holidays_events_df["description"].unique()))
print("Example of transfered holidays: ")
display(holidays_events_df[holidays_events_df["transferred"] == True].head())

Now we can draw few observations:
 - `train` & `test`: We will plot the sales from April 16 2016 (the earthquake) and onwards to see how the data changes on that period. We may want to exclude that period from our dataset for few ML models.
 - `stores`: `store_nbr` is not a useful feature for machine learning models. We will need to find a way to encode each store so that our machine learning models could differentiate them even if they belongs to the same city/state pair.
 - `items`: There are 4100 different items and `item_nbr` is not a useful feature for machine learning models.
 - `transactions`: The transactions are only available for the training set
 - `oil`: Few dates are missing. We may want to complete the missing wholes by averaging the price index based on the dates before and after the one we are looking for.
 - `holidays_events`: The day when the holiday was reported is irrelevant, we need to reshape this dataset so that we keep only the effective date when the event happened as holidays may be correlated to higher sales. We also need to be careful to what region the holiday applies

As we said before the shape of the data let us believe that these datasets come from a relational database (linked together via their `*_nbr` field). That may be a good idea to fit them back into one (SQLlite) so we can query and cobine the data easily with SQL without loading everything in memory at once.

Before going further lets merge train and test sets, so all operations can be operated onto 1 Dataframe, we can split them back later as we know the train set range from **2013-01-01 to 2017-08-15** (on the whole train set) and the test set from **2017-08-16 to 2017-08-31**. We will also replace our missing values on the `onpromotion` to `False`.

In [None]:
train_df['onpromotion'].fillna(False, inplace=True)
test_df['onpromotion'].fillna(False, inplace=True)

final_df = train_df.append(test_df)
assert len(final_df) == len(train_df) + len(test_df)
final_df['onpromotion'] = final_df['onpromotion'].astype(bool)
print("Final df size: {}".format(len(final_df)))
#assert len(final_df) == len(final_df["item_nbr"].unique())
final_df.head()

## 2. Features visualization

Now we will plot few variables to get a sense of their individual repartition.

### 2.1 Train & Test

The repartition of the `onpromotion` field

In [None]:
# id	date	store_nbr	item_nbr	unit_sales	onpromotion
sns.countplot(x="onpromotion", data=final_df);

In [None]:
unit_df = train_df.groupby("date").sum().reset_index()

source = ColumnDataSource(unit_df)
hover = HoverTool(
    tooltips=[
        ("date", "@date{%F}"),
        ("unit_sales", "@unit_sales{0.00 a}"),
    ], 
# Kaggle docker image is not up to date for bokeh
#     formatters={
#         'date': 'datetime'
#     },
)


p = figure(x_axis_type="datetime", tools=[hover, 'pan', 'box_zoom', 'wheel_zoom', 'reset'], 
           title="Unit sales by date", plot_width=900, plot_height=400)
p.xgrid.grid_line_color=None
p.ygrid.grid_line_alpha=0.5
p.xaxis.axis_label = 'Time'
p.yaxis.axis_label = 'Value'

p.line(x="date", y="unit_sales", line_color="gray", source=source)

show(p)

Unit sales on promotion vs not on promotion

In [None]:
on_prom_df = train_df[train_df["onpromotion"] == True].groupby("date").sum()
on_prom_df.sort_values("onpromotion", inplace=True)
ax = sns.lmplot(x="onpromotion", y="unit_sales", data=on_prom_df)
ax.set(xlabel='onpromotion count by date', ylabel='unit_sales count');

In [None]:
train_df['weekday'] = pd.DatetimeIndex(train_df['date']).weekday
train_df['month'] = pd.DatetimeIndex(train_df['date']).month
r = train_df.pivot_table(values='unit_sales', index='weekday', columns='month', aggfunc=np.mean)
r.columns = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'July', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
r.index = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
f, ax = plt.subplots(figsize=(15, 6))
sns.heatmap(r, linewidths=.5, ax=ax, cmap='rainbow')

As we can see very few items are in promotion but the sales are very correlated to the promotions. As we approach the end of the year the sales are increasing. We can also see from our interactive graph that sales decrease dramatically on 2014-01-01 which mark the end of end of the feast season then go up quickly after. We didn't use all our data but this is probably a pattern occuring every year.

We can also see a jump in sales between **2014-02-27 and 2014-03-01**, a 2 days interval. We may want to check if that pattern repeats on the whole dataset and why it happens.

We can also observe that most of the sales happen during weekends.

### 2.2 Stores

In [None]:
# store_nbr	city	state	type	cluster
f, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(14, 12))

sns.countplot(x="city", data=stores_df, ax=ax1)
sns.countplot(x="state", data=stores_df, ax=ax2)
sns.countplot(x="cluster", data=stores_df, ax=ax3)

plt.setp(ax1.xaxis.get_majorticklabels(), rotation=70)
plt.setp(ax2.xaxis.get_majorticklabels(), rotation=70)
plt.tight_layout()

Most of the stores are in Quito (from the Pichincha region) and Guayaquil (from the Guayas region) with Quito being the capital of Ecuador. As for the clusters they regroup stores of similar "type".

### 2.3 Items

In [None]:
perish_items = final_df.merge(items_df, on='item_nbr')

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
ax1.set_title("Perishable items relative to all different items")
ax2.set_title("Perishable items relative to all items in train/test set")
sns.countplot(x="perishable", data=items_df, ax=ax1)
sns.countplot(x="perishable", data=perish_items, ax=ax2);

In [None]:
f, ax = plt.subplots(figsize=(16, 8))
sns.countplot(x="family", data=items_df, ax=ax)
plt.setp(ax.xaxis.get_majorticklabels(), rotation=80);

We can see that ~1/3 of the items are marked as perishable and the same can be said to all perishable objects from our test/train sets. The most recurring categories are GROCERY I and BEVERAGES which is to be expected.

### 2.4 Transactions

In [None]:
trans_per_date = transactions_df.groupby(["date"], as_index=False).sum()

source = ColumnDataSource(trans_per_date)
hover = HoverTool(
    tooltips=[
        ("date", "@date{%F}"),
        ("transactions", "@transactions{0.00 a}"),
    ], 
# Kaggle docker image is not up to date for bokeh
#     formatters={
#         'date': 'datetime'
#     },
)


p = figure(x_axis_type="datetime", tools=[hover, 'pan', 'box_zoom', 'wheel_zoom', 'reset'], 
           title="Transactions", plot_width=900, plot_height=400)
p.xgrid.grid_line_color=None
p.ygrid.grid_line_alpha=0.5
p.xaxis.axis_label = 'Time'
p.yaxis.axis_label = 'Transactions'

p.line(x="date", y="transactions", line_color="gray", source=source)

show(p)

We can see that the earthquake on April 16 2016 didn't affect the sales overall. But if we look at the individual items or cities which were the most affected by the earthquake we may find more relevant informations.

### 2.5 Oil

Here we will display the change of oil in price over time and see if it has any correlation with the sales on the same period as the competition states: "Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices"

In [None]:
source = ColumnDataSource(oil_df)
hover = HoverTool(
    tooltips=[
        ("date", "@date{%F}"),
        ("dcoilwtico", "@dcoilwtico{0.00 a}"),
    ], 
# Kaggle docker image is not up to date for bokeh
#     formatters={
#         'date': 'datetime'
#     },
)


p = figure(x_axis_type="datetime", tools=[hover, 'pan', 'box_zoom', 'wheel_zoom', 'reset'], 
           title="Oil price index", plot_width=900, plot_height=400)
p.xgrid.grid_line_color=None
p.ygrid.grid_line_alpha=0.5
p.xaxis.axis_label = 'Time'
p.yaxis.axis_label = 'Index'

p.line(x="date", y="dcoilwtico", line_color="gray", source=source)

show(p)

In [None]:
transac = transactions_df.groupby("date").sum().reset_index()
transac = transac.merge(oil_df, on='date')
sns.jointplot(x='dcoilwtico', y='transactions', data=transac);

The oil price doesn't seem to be very correlated to the number of transactions. Maybe somehow it affect the grocery prices but it's not an information relevant to us.

### 2.6 Holidays events

In [None]:
holidays_events_df.head()

In [None]:
f, ax = plt.subplots(1, 3, figsize=(16, 4))
ax = ax.ravel()

sns.countplot(x="type", data=holidays_events_df, ax=ax[0])
sns.countplot(x="locale", data=holidays_events_df, ax=ax[1])
sns.countplot(x="transferred", data=holidays_events_df, ax=ax[2])
plt.tight_layout()

f, axf = plt.subplots(figsize=(14, 8))
sns.countplot(x="locale_name", data=holidays_events_df, ax=axf)
plt.setp(axf.xaxis.get_majorticklabels(), rotation=80)
plt.tight_layout()

Most of the type of holidays are actually holidays followed by "Additional" and "Event" and take place in Ecuador most of the time (which explains why we see so few events being regional). "Additional" denote a regular calendar holiday for example Christmas Eve. 