# Introduction

This competition is hosted by a large grocery company in Ecuador called "Corporacion Favorita" where the aim of the game is to accurately predict and forecast the unit sales for items sold at various Favorita supermarket chains across Ecuador. Apart from the usual training and test data files provided, there are also quite a handful of other supplementary data files (5 extra files to be exact) provided to us. 

This notebook aims to take a deep-dive analysis into each of the files provided in this competition and to investigate what types of insights or observations can be derived from each. The structure of this analysis is as follows:

1. Data loading and inspection 
2. Supplementary Data exploration 
3. Training data exploration 
4. Feature ranking with learning models

In [None]:
# Importing the relevant libraries
import pandas as pd
import seaborn as sns
%matplotlib inline
import missingno as msno
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import numpy as np
from scipy.fftpack import fft
from matplotlib import pyplot as plt

# 1. Data loading and inspection checks

To start off with, let us load in the various supplementary comma-separated value files with the Pandas package via the the read_csv function as follows. Borrowing from Inversion's very helpful kernel in creating a dataframe with all [Date-Store_Item combinations](https://www.kaggle.com/inversion/dataframe-with-all-date-store-item-combinations) (please do check it out ), I will load in the training data using some of his methods.

In [None]:
items = pd.read_csv("../input/items.csv")
holiday_events = pd.read_csv("../input/holidays_events.csv")
stores = pd.read_csv("../input/stores.csv")
oil = pd.read_csv("../input/oil.csv")
transactions = pd.read_csv("../input/transactions.csv",parse_dates=['date'])
# I read in the full training data just to get prior information and here is the output:
# Output: "125,497,040 rows | 6 columns"
train = pd.read_csv("../input/train.csv", nrows=6000000)

With regards to the training data, it contains a whooping 125,497,040 rows (and 6 columns). For  Therefore I will just load in 6 million rows of the training data (approx 5% of the data) just to get a rough idea of what is in store for us.
Let us take a quick peek into the it to see what kind of datatypes and columns are there.

In [None]:
train.head()

Now, note that the training data only consists of 6 rather measly columns and therefore coupled with the fact that we have approx 125 million rows, therefore there does seem to be a discrepancy in the number of features that we are going to provide our learning model to train on. However that's where the other supplementary files comes in to play as we will most definitely have to join the fields "store_nbr" (store number) and "item_nbr" as well as dates ( bring in daily oil prices). So there is actually quite a lot of potential and avenues for feature enhancement and engineering.

### NULL or missing values check

One standard check I like to carry out is to simply inspect all our data for any Null or missing values. If there are any, then we might have to think of strategies to handle them (eg. Imputation, removal of nulls etc). A good library to conveniently visualise missing values is via the "missingno" package as an aside. 

In [None]:
print("Nulls in Oil columns: {0} => {1}".format(oil.columns.values,oil.isnull().any().values))
print("="*70)
print("Nulls in holiday_events columns: {0} => {1}".format(holiday_events.columns.values,holiday_events.isnull().any().values))
print("="*70)
print("Nulls in stores columns: {0} => {1}".format(stores.columns.values,stores.isnull().any().values))
print("="*70)
print("Nulls in transactions columns: {0} => {1}".format(transactions.columns.values,transactions.isnull().any().values))

As we can see,  the only missing data occurs in the oil data file, which provides the historical daily price for oil. 

# 2. Supplementary Data Exploration

## 2a. Oil data

First up we can take a look at the "oil.csv" provided to us. As alluded to in section 1, this file contains daily oil prices within a time range that covers both the train and test data timeframe so this is something to note should one's learning model take into account the trend in these oil prices. This supplementary oil data seems to be a very simple two column table with one column being the date and the other the daily oil price "dcoilwtico" which seems to be the abbreviation for [Crude oil prices: West Texas Intermediate - Cushing, Oklahoma](https://fred.stlouisfed.org/series/DCOILWTICO). 



In [None]:
oil.isnull().any(axis=0).values

**Interactive Visualisations with Plotly**

Let us take a look at the underlying data by plotting the daily oil prices in a time series plot via the interactive Python visualisation library Plot.ly as follows. Here we invoke the Plot.ly scatter plot function by calling "Scatter" and it is a simple matter of providing the date range in the x-axis and the corresponding daily oil prices in the y-axis. Here I have also simultaneously dropped nulls by calling dropna( ) in the oil dataframe.

In [None]:
trace = go.Scatter(
    name='Oil prices',
    x=oil['date'],
    y=oil['dcoilwtico'].dropna(),
    mode='lines',
    line=dict(color='rgb(220, 150, 0, 0.8)'),
    #fillcolor='rgba(68, 68, 68, 0.3)',
    fillcolor='rgba(230, 200, 6, 0.3)',
    fill='tonexty' )

data = [trace]

layout = go.Layout(
    yaxis=dict(title='Daily Oil price'),
    title='Daily oil prices from Jan 2013 till July 2017',
    showlegend = False)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='pandas-time-series-error-bars')

#### *[THE ABOVE PLOT IS INTERACTIVE SO YOU CAN DRAG AND ZOOM ON IT. DOUBLE-CLICK TO GET BACK TO THE ORIGINAL SIZE]*

**Takeaway from the plots**

This plot shows that the daily oil price is on a general downward trend from Jan 2013 till July 2017. Where the price of oil started out 2013 by increasing and even busting the 100 dollar mark for a good few months in 2013 and 2014, it reached the middle of 2014 where there was a drastic drop in the price of oil. Via some quick open-source research (i.e Googling), this trend checks out as it seems oil prices were kept fairly stable from 2010 till mid-2014 after which it drastically fell ( due to a confluence of reasons such as weak demand due to poor economic growth and surging alternative sources of crude oil from shale/tar sands).


## 2b. Stores data

With regards to the "stores.csv" file, the data dictionary on the Kaggle competition simply states that it contains metadata on the city, state, the store type and a column termed "cluster". Now this cluster column is a grouping to stores that are similar to each other and as we can see from the latter analysis, there are a total of 17 distinct clusters. With regards to the number of stores, there are a total of 54 stores (based off a unique list of store_nbr) and therefore I presume that all the unit sales and transactions are generated off the data collected from these 54 stores.

In [None]:
stores.head(3)

**Inspecting the allocation of clusters to store numbers**

The first plot that we are going to generate will be that of our store numbers ordered against their respective store clusters so that we can observe if there are any apparent trends or relationships in the data. To do so, I will take our stores Python dataframe and group it based on the columns "store_nbr" and "cluster" via the **groupby** statement. After which, I will unstack the grouping which means that I will pivot on the level of store_nbr index labels, returning a DataFrame having a new level of columns which are the store clusters whose inner-most level relate to the pivoted store_nbr index labels. This technique is commonly used for producing stacked barplots in Python but since we only have unique store_nbr numbers, therefore we will simply get barplots of store numbers ordered by their relevant clusters.

In [None]:
# Unhide to see the sorted zip order
neworder = [23, 24, 26, 36, 41, 15, 29, 31, 32, 34, 39, 53, 4, 37, 40, 43, 8, 10, 19, 20, 33, 38, 13, 21, 2, 6, 7, 3, 22, 25, 27, 28, 30, 35, 42, 44, 48, 51, 16, 0, 1, 5, 52, 45, 46, 47, 49, 9, 11, 12, 14, 18, 17, 50]

In [None]:
nbr_cluster = stores.groupby(['store_nbr','cluster']).size()
nbr_cluster.unstack().iloc[neworder].plot(kind='bar',stacked=True, colormap= 'tab20', figsize=(13,11),  grid=False)
plt.title('Store numbers and the clusters they are assigned to')
plt.ylabel('')
plt.xlabel('Store number')
plt.show()

**Takeaways from thse plot**

From visualising the store numbers side-by-side based on the clustering, we can identify certain patterns. For example clusters 3, 6, 10 and 15 are the most common store clusters based off the fact that there are more store_nbrs attributed to them then the others. We can also identify outlier stores based on their assignmen 

**Stacked Barplots of Types against clusters**

Here it might be informative to look at the distribution of clusters based on the store type to see if we can identify any apparent relationship between types and the way the company has decided to cluster the particular store. Again we apply the groupby operation but this time on type and on cluster. This time when we pivot based off this grouped operation, we are able to get counts of each distinct cluster distributed and stacked on top of other clusters per store type as follows:

In [None]:
type_cluster = stores.groupby(['type','cluster']).size()
type_cluster.unstack().plot(kind='bar',stacked=True, colormap= 'viridis_r', figsize=(13,11),  grid=False)
plt.title('Stacked Barplot of Store types and their cluster distribution')
plt.ylabel('Count of clusters in a particular store type')
plt.show()

**Takeaway from the plots**

Most of the store types seem to contain a mix of the clusters from both

**Stacked barplot of types of stores across the different cities**

Another interesting distribution to observe would be the types of stores that Corporacion Favorita has decided to open in each city in Ecuador. 

In [None]:
city_cluster = stores.groupby(['city','type']).size()
city_cluster.unstack().plot(kind='bar',stacked=True, colormap= 'PuBu', figsize=(13,11),  grid=False)
plt.title('Stacked Barplot of Store types distributed across cities')
plt.ylabel('Count of stores in a particular city')
plt.show()

**Takeaways from the plot**: 

As observed from the stacked barplots, there are two cities that standout in terms of the variety and store types that they offer - Guayaquil and Quito. These should come as no surprise as [Quito](https://en.wikipedia.org/wiki/Quito) is the capital city of Ecuador while [Guayaquil](https://en.wikipedia.org/wiki/Guayaquil) is the largest and most populous city. Therefore one would think it logical to expect Corporacion Favorita to target the major cities with the most diverse store types as well as setting up more stores evinced from the numerous store_nbrs attributed to those two cities.

## 2c. Holiday Events data

Trudging on, we can inspect the "holiday_events.csv" 

In [None]:
holiday_events.head(3)

In [None]:
holiday_local_type = holiday_events.groupby(['locale_name', 'type']).size()
holiday_local_type.unstack().plot(kind='bar',stacked=True, colormap= 'inferno', figsize=(12,10),  grid=False)
plt.title('Stacked Barplot of locale name against event type')
plt.ylabel('Count of entries')
plt.show()

In [None]:
x = holiday_events.groupby(['type', 'description']).size()
x.unstack().plot(kind='bar',stacked=True, colormap= 'inferno', figsize=(12,10),  grid=False)
plt.title('Stacked Barplot of locale name against event type')
plt.show()

## 2d. Transactions data



**End-of-Year PERIODICITY IN TRANSACTION PATTERN**

In [None]:
print(transactions.head(3))
print("="*60)
print(transactions.shape)

In [None]:
transactions.iloc[33700]

In [None]:
plt.figure(figsize=(13,11))
plt.plot(transactions.date.values, transactions.transactions.values)
plt.axvline(x='2015-12-23',color='red',alpha=0.2)
plt.axvline(x='2016-12-23',color='red',alpha=0.2)
plt.axvline(x='2014-12-23',color='red',alpha=0.2)
plt.axvline(x='2013-12-23',color='red',alpha=0.2)
plt.ylim(-50, 10000)
plt.ylabel('transactions per day')
plt.xlabel('Date')
plt.show()

In [None]:
transactions.head()

## 2e. Items data

In [None]:
items.head()

# 3. Training Data exploration

In [None]:
import sklearn
from sklearn import linear_model
from sklearn import model_selection
ridge = linear_model.Ridge()

In [None]:
data = train
(train, test) = model_selection.train_test_split(data, train_size=0.75)

In [None]:
ridge.fit(train[['store_nbr','item_nbr']], train['unit_sales'])

In [None]:
print(ridge.score(train[['store_nbr','item_nbr']], train['unit_sales']))
print(ridge.score(test[['store_nbr','item_nbr']], test['unit_sales']))

In [None]:
test_data = pd.read_csv('../input/test.csv')
test_data.head()

In [None]:
sample_submission = pd.read_csv('../input/sample_submission.csv')
sample_submission.head()

In [None]:
predictions = ridge.predict(test_data[['store_nbr','item_nbr']])
print(predictions)

In [None]:
sample_submission['unit_sales'] = predictions

In [None]:
sample_submission.to_csv('submission11.csv', index=False)

In [None]:
sub = pd.read_csv('../output/sumbmission1.csv')
sub.head()

In [None]:
%ls