# Keeeping it simple but no simpler

Let's start by revisiting the great kernel byt XYZT which shows that EDA and some thinking can go a long way without black-box algorithms:
https://www.kaggle.com/thexyzt/keeping-it-simple-by-xyzt

A few points to highlight in the below tour:
1. We implement a simple SMAPE function to evaluate the quality of our predictions on the training set. No need to rely blindly on the upcoming submission score to get an idea whether our predictor has a decent chance to make the cut or not
2. We start by an even more basic model than XYZT, predicting constant sales equal to the average of the item sales per store to get an idea of the variability of the data
3. We show a few mistakes or glitches that we found easy to make when attempting to follow the spirit and not the letter of XYZT's kernel. Indeed for this competition's dataset, larger averages tend to perform better - for instance the "month factor" model is more precise if computed across all stores and items rather than for very store and item pair.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import rcParams
from IPython.display import display

rcParams['figure.figsize'] = 15, 7

train = pd.read_csv(r'../input/train.csv', parse_dates=['date'])
test = pd.read_csv(r'../input/test.csv', parse_dates=['date'])


# Getting a feel of the dataset

In [None]:
print('Training set head:')
display(train.head())
print('Test set head:')
display(test.head())
n_stores = train['store'].nunique()
print('# of stores: {:,}'.format(n_stores))
n_items = train['item'].nunique()
print('# of items: {:,}'.format(n_items))
n_samples = len(train)/(n_stores*n_items)
print('# of training samples per item and store:')
display(train.groupby(['store','item'])[['date']].nunique().describe().style.format('{:,.0f}'))
print('# of test samples per item and store:')
display(test.groupby(['store','item'])[['date']].nunique().describe().style.format('{:,.0f}'))
print('First and last date of training set:')
display(train.drop_duplicates(subset=['date'])[['date']].describe().loc[['first','last']].applymap(lambda x: x.date()))
print('First and last date of test set:')
display(test.drop_duplicates(subset=['date'])[['date']].describe().loc[['first','last']].applymap(lambda x: x.date()))

We now add the usual suspects fields when working with time series (year, month, day, dayofweek) and keep getting acquainted with the dataset

In [None]:
def expand(df,date_col):
    df['year'] = df[date_col].dt.year
    df['month'] = df[date_col].dt.month
    df['day'] = df[date_col].dt.day
    df['dayofweek'] = df[date_col].dt.dayofweek

In [None]:
expand(train,'date')
expand(test,'date')

In [None]:
print('Training set details by year and month:')
display(train.drop_duplicates(subset=['date']).groupby(['year','month'])[['day']].count().unstack(1))

In [None]:
print('Test set details by year and month:')
display(test.drop_duplicates(subset=['date']).groupby(['year','month'])[['day']].count().unstack(1))

In [None]:
item_number = 1
store_number = 1
fig = plt.figure(figsize=(15,7))
ax = fig.gca()
train.loc[(train['item'] == item_number) & (train['store'] == store_number)].plot(x='date',y='sales',ax=ax)
ax.legend().remove()
ax.set_xlabel('')
ax.set_title('Sales for item #{} and store #{}'.format(item_number,store_number))
ax.grid(axis='y',ls='--')

# First models of the dataset

We now turn to implementing a few simple models of the dataset to better understand where the sales variability come from

In [None]:
n_items = 4
n_stores = 4
min_df = train.loc[(train['item'] < n_items) & (train['store'] < n_stores)].groupby(['item','store','year'])[['sales']].min().rename(columns = {'sales':'min sales'}).unstack(1)
max_df = train.loc[(train['item'] < n_items) & (train['store'] < n_stores)].groupby(['item','store','year'])[['sales']].max().rename(columns = {'sales':'max sales'}).unstack(1)
mean_df = train.loc[(train['item'] < n_items) & (train['store'] < n_stores)].groupby(['item','store','year'])[['sales']].mean().rename(columns = {'sales':'mean sales'}).unstack(1)
std_df = train.loc[(train['item'] < n_items) & (train['store'] < n_stores)].groupby(['item','store','year'])[['sales']].std().rename(columns = {'sales':'stddev sales'}).unstack(1)
print('A few descriptive statistics by item and store:')
display(pd.concat([min_df,max_df,mean_df,std_df],axis=1).style.format('{:,.0f}'))

In [None]:
def smape(y_true, y_pred):
    return 200.0*np.mean((np.abs(y_pred - y_true) / (np.abs(y_pred) + np.abs(y_true))).fillna(0))

In [None]:
model_train_smape_dict = {}
model_test_smape_dict = {}

## The Most Basic Model
The most basic model is to predict the average sales per item and per store

In [None]:
most_basic_model = train.groupby(['store','item'])[['sales']].mean().reset_index()
most_basic_model['sales'] = np.round(most_basic_model['sales']).astype(int)
display(most_basic_model.head())
cross_val = pd.merge(train,most_basic_model.rename(columns={'sales':'salespred'}),on=['store','item'],how='left')
model_smape = smape(cross_val['sales'],cross_val['salespred'])
model_train_smape_dict['Most Basic Model'] = model_smape
print('SMAPE on train set is {:,.4f}'.format(model_smape))

sub = pd.merge(test,most_basic_model,on=['store','item'],how='left')
sub = sub[['id','sales']].copy()
sub.to_csv('submission-most_basic_model.csv',index=False)
model_test_smape_dict['Most Basic Model'] = 28.34071 #given by kaggle

If we illustrate graphically what this most basic model does:

In [None]:
item_number = 1
store_number = 1
fig = plt.figure(figsize=(15,7))
ax = fig.gca()
cross_val.loc[(cross_val['item'] == item_number) & (cross_val['store'] == store_number)].plot(x='date',y='sales',ax=ax)
cross_val.loc[(cross_val['item'] == item_number) & (cross_val['store'] == store_number)].plot(x='date',y='salespred',ax=ax,c='r',ls='--')
ax.legend().remove()
ax.set_xlabel('')
ax.set_title('Sales and Predicted Sales for item #{} and store #{}'.format(item_number,store_number))
ax.grid(axis='y',ls='--')

We can look at the prediction error and absolute error:

In [None]:
cross_val['error'] = cross_val['sales'] - cross_val['salespred']
cross_val['abserror'] = np.abs(cross_val['error'])

sma_window = 90
cross_val['error_sma'] = cross_val['error'].rolling(sma_window).mean()
cross_val['abserror_sma'] = cross_val['abserror'].rolling(sma_window).mean()

fig = plt.figure(figsize=(15,7))
ax = fig.gca()
cross_val.loc[(cross_val['item'] == item_number) & (cross_val['store'] == store_number)].plot(x='date',y='error',ax=ax)
cross_val.loc[(cross_val['item'] == item_number) & (cross_val['store'] == store_number)].plot(x='date',y='error_sma',ax=ax,c='r',ls='--')
ax.legend().remove()
ax.set_xlabel('')
ax.set_title('Sales prediction error for item #{} and store #{}'.format(item_number,store_number))
ax.grid(axis='y',ls='--')

In [None]:
fig = plt.figure(figsize=(15,7))
ax = fig.gca()
cross_val.loc[(cross_val['item'] == item_number) & (cross_val['store'] == store_number)].plot(x='date',y='abserror',ax=ax)
cross_val.loc[(cross_val['item'] == item_number) & (cross_val['store'] == store_number)].plot(x='date',y='abserror_sma',ax=ax,c='r',ls='--')
ax.legend().remove()
ax.set_xlabel('')
ax.set_title('Sales prediction abs error for item #{} and store #{}'.format(item_number,store_number))
ax.grid(axis='y',ls='--')

# Still basic models
We can refine the previous most basic model into in still basic model which attempt to predict a more accurate and relevant average than the store & item sales average over the training set.


## Year models

We could start by trying to add the year information to the window used to compute the prediction average. However, here comes the first glitch: such a model will involve predicting the average for the test set year which will require some more modelling which we shall tackler later on.

For now, let's pretend that we have an oracle that for each past year of the training set allowed us to predict with 100% accuracy the average sales for the coming year.

### Year only

In [None]:
year_basic_model = train.groupby(['store','item','year'])[['sales']].mean().reset_index()
year_basic_model['sales'] = np.round(year_basic_model['sales']).astype(int)
display(year_basic_model.head())

In [None]:
cross_val = pd.merge(train,year_basic_model.rename(columns={'sales':'salespred'}),on=['store','item','year'],how='left')
model_smape = smape(cross_val['sales'],cross_val['salespred'])
model_train_smape_dict['Year Model'] = model_smape
print('SMAPE on train set is {:,.4f}'.format(model_smape))


### Year and Month

We keep trying to refine our still basic model and check whether adding the month improves our results

In [None]:
year_month_model = train.groupby(['store','item','year','month'])[['sales']].mean().reset_index()
year_month_model['sales'] = np.round(year_month_model['sales']).astype(int)
display(year_month_model.head())

In [None]:
cross_val = pd.merge(train,year_month_model.rename(columns={'sales':'salespred'}),on=['store','item','year','month'],how='left')
model_smape = smape(cross_val['sales'],cross_val['salespred'])
model_train_smape_dict['Year Month Model'] = model_smape
print('SMAPE on train set is {:,.4f}'.format(model_smape))

### Year, Month & Day of Week (DOW)

In [None]:
year_month_dow_model = train.groupby(['store','item','year','month','dayofweek'])[['sales']].mean().reset_index()
year_month_dow_model['sales'] = np.round(year_month_dow_model['sales']).astype(int)
display(year_month_dow_model.head())

cross_val = pd.merge(train,year_month_dow_model.rename(columns={'sales':'salespred'}),on=['store','item','year','month','dayofweek'],how='left')
model_smape = smape(cross_val['sales'],cross_val['salespred'])
model_train_smape_dict['Year Month Dow Model'] = model_smape
print('SMAPE on train set is {:,.4f}'.format(model_smape))

Let us note that it is meaningless to try the Year, Month & Day model since there would be no datapoint in the averaging window as the triplet Year, Month & Day always uniquely identify a single day

## Month Models

Since for now we cannot submit the previous models for lack of sales prediction for the test year 2018, we can look into removing the year from the previous models and look at some month models

### Month Only

In [None]:
month_model = train.groupby(['store','item','month'])[['sales']].mean().reset_index()
month_model['sales'] = np.round(month_model['sales']).astype(int)
display(month_model.head())

In [None]:
cross_val = pd.merge(train,month_model.rename(columns={'sales':'salespred'}),on=['store','item','month'],how='left')
model_smape = smape(cross_val['sales'],cross_val['salespred'])
model_train_smape_dict['Month Model'] = model_smape
print('SMAPE on train set is {:,.4f}'.format(model_smape))

In [None]:
sub = pd.merge(test,month_model,on=['store','item','month'],how='left')
sub = sub[['id','sales']].copy()
sub.to_csv('submission-month_model.csv',index=False)
model_test_smape_dict['Month Model'] = 20.25930 #given by kaggle 

### Month and Day of Week

In [None]:
month_dow_model = train.groupby(['store','item','month','dayofweek'])[['sales']].mean().reset_index()
month_dow_model['sales'] = np.round(month_dow_model['sales']).astype(int)
display(month_dow_model.head())

In [None]:
cross_val = pd.merge(train,month_dow_model.rename(columns={'sales':'salespred'}),on=['store','item','month','dayofweek'],how='left')
model_smape = smape(cross_val['sales'],cross_val['salespred'])
model_train_smape_dict['Month Dow Model'] = model_smape
print('SMAPE on train set is {:,.4f}'.format(model_smape))

In [None]:
sub = pd.merge(test,month_dow_model,on=['store','item','month','dayofweek'],how='left')
sub = sub[['id','sales']].copy()
sub.to_csv('submission-month_dow_model.csv',index=False)
model_test_smape_dict['Month Dow Model'] = 18.95844 #given by kaggle

## Using the Day instead of the Day of the Week
Now we come to XYZT's first model:
"Find the average of sales of an item at a store on the day and month of sales and use that as the prediction. This effectively gives us a sample size of 5 (since the training set is five years long) to find the mean. This is clearly a sub-optimal solution because almost no thought goes into it."

Please note that our implementation takes only a few seconds to run.

In [None]:
month_day_model = train.groupby(['store','item','month','day'])[['sales']].mean().reset_index()
month_day_model['sales'] = np.round(month_day_model['sales']).astype(int)
display(month_day_model.head())

cross_val = pd.merge(train,month_day_model.rename(columns={'sales':'salespred'}),on=['store','item','month','day'],how='left')
model_smape = smape(cross_val['sales'],cross_val['salespred'])
model_train_smape_dict['Month Day Model'] = model_smape
print('SMAPE on train set is {:,.4f}'.format(model_smape))

sub = pd.merge(test,month_day_model,on=['store','item','month','day'],how='left')
sub = sub[['id','sales']].copy()

sub.to_csv('submission-month_day_model.csv',index=False)
model_test_smape_dict['Month Day Model'] = 22.13108 #given by kaggle

## Summary of the evaluated models so far

In [None]:
models_list = ['Most Basic Model',
               'Year Model','Year Month Model','Year Month Dow Model',
               'Month Model','Month Dow Model','Month Day Model']

display(pd.merge(
    pd.DataFrame(index=model_train_smape_dict.keys(),data=list(model_train_smape_dict.values()),columns=['SMAPE train']),
    pd.DataFrame(index=model_test_smape_dict.keys(),data=list(model_test_smape_dict.values()),columns=['SMAPE test']),
    right_index = True,left_index=True,how='outer').loc[models_list].style.format('{:,.1f}'))
    

With hardly any technical machinery we were able to compute a predictor of SMAPE 19.0 
Let's take this model (Month Dow Model)  and value (18.95844 to be precise) as our benchmark

Let's also notice that the Month Day model has worse predictive power than the Month Dow or even the Month model. So the apparent precision (using the exact same days in the past) does not compensate in our dataset for the loss of averaging - only 5 days are used to predict each test sample vs. approx 25 (5 dows x 5 years)  or 130 (30 days in a month x 5 years) for the Month Dow and the Month models respectively.
# Predicting the sales per year
We now focus on trying to predict the sales per year so as to be able to submit something for our Year Model family, of which the Year Month Dow model looks promising based on the SMAPE train of 11.0

> ## Forecasting the sales growth

We start by looking at the sales growth per store for each year before zooming in on the the sales per store and per item for each year.

In [None]:
sales_year = train.groupby(['store','year'])[['sales']].mean()
sales_year = sales_year.unstack(0).pct_change().dropna()
display(sales_year.head())

In [None]:
sales_year.plot(kind='bar')
plt.legend(loc='upper left',bbox_to_anchor=[1.0,0.5]);
plt.grid(axis='y',ls='--')
plt.title('Sales YoY growth per store');

While the year-on-year (YoY) sales % growth seem to be approximately the same for all stores, there is no obvious trend to this growth unfortunately. We can attempt to fit a growth factor but there wouldn't be any intuition behind the 2018 resulting growth.

In [None]:
sales_growth = sales_year.mean(1)

p1 = np.poly1d(np.polyfit(sales_growth.index, sales_growth.values, 1))
p2 = np.poly1d(np.polyfit(sales_growth.index, sales_growth.values, 2))
p3 = np.poly1d(np.polyfit(sales_growth.index, sales_growth.values, 3))

sales_growth = pd.DataFrame(np.c_[sales_growth.values,p1(sales_growth.index),p2(sales_growth.index),p3(sales_growth.index)],index=sales_growth.index,columns=['Actual','Linear','Quadratic','Cubic'])
display(sales_growth.head())

display(pd.DataFrame(np.abs((sales_growth.subtract(sales_growth['Actual'],axis='index')).drop('Actual',axis=1)).mean()).rename(columns={0:'MAE'}))

In [None]:
#The cubic fit is of course accurate (4 points to fit) but we can see the overfitting:
year_to_forecast = 2018
for i,p in enumerate([p1,p2,p3]):
    print('D° {} prediction of the YoY growth %: {:,.3f}'.format(i+1,p(2018)))

The dispersion of the above forecast is unfortunately not surprising given the very small # of data points (namely 4). In particular, this shows the danger of overfitting a degree three polynomial. Since forecasting the YoY growth, which is a derivative of the yearly sales, does not work - small sample and maybe too much noise, we can attempt to forecast the yearly sales themselves.

> ## Forecasting the sales

In [None]:
sales_year = train.groupby(['store','year'])[['sales']].sum()
sales_year = sales_year.groupby(['store'])[['sales']].transform(lambda x: x/x.sum())
display(sales_year.head())

sales_year.unstack(0).plot(kind='bar')
plt.legend(loc='right',bbox_to_anchor=[1.4,0.5])
plt.title('Sales per year per store as a % of total store sales over the period');

If we average across all stores:

In [None]:
sales_year = train.groupby('year')[['sales']].mean() / train.groupby('year')[['sales']].mean().mean()
display(sales_year)

In [None]:
x = sales_year.index
y = sales_year.values.ravel()

p1 = np.poly1d(np.polyfit(x, y, 1))
p2 = np.poly1d(np.polyfit(x, y, 2))

plt.scatter(x,y,c='k')
plt.plot(x,p1(x),c='b')
plt.plot(x,p2(x),c='orange')
plt.grid(axis='y',ls='--')
plt.title('Linear and Quadratic Fit of Sales per Year');

In [None]:
# If we choose the quadratic fit:
adj_factor = p2(2018)
print('Adjustment factor: {:,.4f}'.format(p2(2018)))

To cross-check our results, we look into an analogous computation, in which however we avoid the intermediate step of computing the sales per store as a % of total sales for this store and then average. We directly compute the yearly sales for all stores divided by the total sales for all stores over the period.:

In [None]:
sales_year_alt = train['year'].nunique() * train.groupby('year')[['sales']].sum() / train['sales'].sum()
display(pd.concat([sales_year.rename(columns = {'sales' : 'sales per yer per store averaged'}),
                   sales_year_alt.rename(columns = {'sales' : 'sales per year averaged'})],
                  axis = 1))

In [None]:
x = sales_year_alt.index
y = sales_year_alt.values.ravel()

p1_alt = np.poly1d(np.polyfit(x, y, 1))
p2_alt = np.poly1d(np.polyfit(x, y, 2))

plt.scatter(x,y,c='k')
plt.plot(x,p1_alt(x),c='b')
plt.plot(x,p2_alt(x),c='orange')
plt.grid(axis='y',ls='--')
plt.title('Linear and Quadratic Fit of Sales per Year');

In [None]:
# If we choose the quadratic fit:
adj_factor_alt = p2_alt(2018)
print('Adjustment factor alt: {:,.4f}'.format(p2_alt(2018)))

The difference in sales forecast is not huge between the two forecasting methods but it does exist nonetheless.
# Using this sales prediction to complete the year models

We  need to check whether the item mix stays constant every year per store but if we dive directly into completing our year models:
## Year Model Completed

In [None]:
year_basic_model = train.groupby(['store','item','year'])[['sales']].mean().reset_index()
year_basic_model['sales'] = np.round(year_basic_model['sales']).astype(int)

year_basic_model = pd.merge(year_basic_model,
                            train.groupby(['store','item'])[['sales']].mean().reset_index().rename(columns={'sales' : 'mean_sales'}),
                            on = ['store','item'])

adj_factor_model = pd.DataFrame(index=np.arange(2013,2018),data=[p2(x) for x in np.arange(2013,2018)],columns=['sales_adj_factor'])
adj_factor_model = adj_factor_model.reset_index().rename(columns={'index' : 'year'})

year_basic_model = pd.merge(year_basic_model,adj_factor_model,on='year',how='left')
year_basic_model['salespred'] = np.round(year_basic_model['mean_sales'] * year_basic_model['sales_adj_factor']).astype(int)

display(year_basic_model.head())

In [None]:
model_smape = smape(year_basic_model['sales'],year_basic_model['salespred'])
print('SMAPE between Year Model & Year Model Completed set is {:,.4f}'.format(model_smape))

In [None]:
cross_val = pd.merge(train,year_basic_model[['store','item','year','salespred']],on=['store','item','year'],how='left')
model_smape = smape(cross_val['sales'],cross_val['salespred'])
model_train_smape_dict['Year Model Completed'] = model_smape
print('SMAPE on train set of the Year Model Completed is {:,.4f}'.format(model_smape))
print('SMAPE on train set of the Year Model was {:,.4f}'.format(model_train_smape_dict['Year Model']))

In [None]:
year_basic_model_aux = train.groupby(['store','item'])[['sales']].mean().reset_index()
year_basic_model_aux['year'] = 2018
year_basic_model_aux.rename(columns={'sales' : 'mean_sales'},inplace=True)
year_basic_model_aux['sales'] = np.nan 
year_basic_model_aux['sales_adj_factor'] = adj_factor
year_basic_model_aux['salespred'] = np.round(year_basic_model_aux['mean_sales'] * year_basic_model_aux['sales_adj_factor']).astype(int)
year_basic_model_aux = year_basic_model_aux[year_basic_model.columns]
display(year_basic_model_aux.head())

In [None]:
year_basic_model = pd.concat([year_basic_model, year_basic_model_aux],axis=0)

sub = pd.merge(test,year_basic_model.loc[year_basic_model['year'] == 2018],on=['store','item'],how='left')
sub = sub[['id','salespred']].copy()
sub.rename(columns = {'salespred' : 'sales'}, inplace=True)
sub.to_csv('submission-year_basic_model.csv',index=False)
model_test_smape_dict['Year Model'] = 40.30474 #given by kaggle

So our adjustement to forecast sales growth makes the year model much worse than the most basic model. What is going on here?

### Year and Month Completed

In [None]:
year_month_model = pd.merge(year_month_model,
                            train.groupby(['store','item','month'])[['sales']].mean().reset_index().rename(columns={'sales' : 'mean_sales'}),
                            on = ['store','item','month'])

year_month_model = pd.merge(year_month_model,adj_factor_model,on='year',how='left')
year_month_model['salespred'] = np.round(year_month_model['mean_sales'] * year_month_model['sales_adj_factor']).astype(int)

display(year_month_model.head())

In [None]:
year_month_model_aux = train.groupby(['store','item','month'])[['sales']].mean().reset_index()
year_month_model_aux['year'] = 2018
year_month_model_aux.rename(columns={'sales' : 'mean_sales'},inplace=True)
year_month_model_aux['sales'] = np.nan 
year_month_model_aux['sales_adj_factor'] = adj_factor
year_month_model_aux['salespred'] = np.round(year_month_model_aux['mean_sales'] * year_month_model_aux['sales_adj_factor']).astype(int)
year_month_model_aux = year_month_model_aux[year_month_model.columns]
display(year_month_model_aux.head())

In [None]:
year_month_model = pd.concat([year_month_model, year_month_model_aux],axis=0)

sub = pd.merge(test,year_month_model.loc[year_month_model['year'] == 2018],on=['store','item','month'],how='left')
sub = sub[['id','salespred']].copy()
sub.rename(columns = {'salespred' : 'sales'}, inplace=True)
sub.to_csv('submission-year_month_model.csv',index=False)
model_test_smape_dict['Year Month Model'] = 17.43060 #given by kaggle

### Year, Month & Day of Week (DOW) Completed

In [None]:
year_month_dow_model = pd.merge(year_month_dow_model,
                            train.groupby(['store','item','month','dayofweek'])[['sales']].mean().reset_index().rename(columns={'sales' : 'mean_sales'}),
                            on = ['store','item','month','dayofweek'])

In [None]:
year_month_dow_model = pd.merge(year_month_dow_model,adj_factor_model,on='year',how='left')
year_month_dow_model['salespred'] = np.round(year_month_dow_model['mean_sales'] * year_month_dow_model['sales_adj_factor']).astype(int)

display(year_month_dow_model.head())

In [None]:
year_month_dow_model_aux = train.groupby(['store','item','month','dayofweek'])[['sales']].mean().reset_index()
year_month_dow_model_aux['year'] = 2018
year_month_dow_model_aux.rename(columns={'sales' : 'mean_sales'},inplace=True)
year_month_dow_model_aux['sales'] = np.nan 
year_month_dow_model_aux['sales_adj_factor'] = adj_factor
year_month_dow_model_aux['salespred'] = np.round(year_month_dow_model_aux['mean_sales'] * year_month_dow_model_aux['sales_adj_factor']).astype(int)
year_month_dow_model_aux = year_month_dow_model_aux[year_month_dow_model.columns]
display(year_month_dow_model_aux.head())

In [None]:
year_month_dow_model = pd.concat([year_month_dow_model, year_month_dow_model_aux],axis=0)

In [None]:
sub = pd.merge(test,year_month_dow_model.loc[year_month_dow_model['year'] == 2018],on=['store','item','month','dayofweek'],how='left')
sub = sub[['id','salespred']].copy()
sub.rename(columns = {'salespred' : 'sales'}, inplace=True)
sub.to_csv('submission-year_month_dow_model.csv',index=False)
model_test_smape_dict['Year Month Dow Model'] = 14.25263 #given by kaggle

## Summary of the evaluated models so far

In [None]:
models_list = ['Most Basic Model',
               'Year Model','Year Month Model','Year Month Dow Model',
               'Month Model','Month Dow Model','Month Day Model']

display(pd.merge(
    pd.DataFrame(index=model_train_smape_dict.keys(),data=list(model_train_smape_dict.values()),columns=['SMAPE train']),
    pd.DataFrame(index=model_test_smape_dict.keys(),data=list(model_test_smape_dict.values()),columns=['SMAPE test']),
    right_index = True,left_index=True,how='outer').loc[models_list].style.format('{:,.1f}'))

We notice that the Year Month Dow Model SMAP test score is however 14.3 (14.25263 to be precise) and not the 13.87573 mentioned by Keeping it simple in his kernel.

## Keeping it Simple (Kiss) Model

To get to the 13.87573 score, the devil is actually in the details: instead of computing the average for a store & item on a particular month and dayofweek and adjusting for the yearly sales growth, the monthly and dayofweek factors are computed by averaging over all stores and items. The broader averaging seems to be keeping the signal while filtering more noise.

In [None]:
store_item_model = train.groupby(['store','item'])[['sales']].mean().reset_index()

month_model = train.groupby(['month'])[['sales']].mean().reset_index() 
month_model['sales'] = month_model['sales'] / train['sales'].mean()

dayofweek_model = train.groupby(['dayofweek'])[['sales']].mean().reset_index() 
dayofweek_model['sales'] = dayofweek_model['sales'] / train['sales'].mean()

In [None]:
kiss_model = pd.merge(train,store_item_model.rename(columns={'sales':'storeitem_sales'}),on=['store','item'],how='left')
kiss_model = pd.merge(kiss_model,month_model.rename(columns={'sales':'monthfactor'}),on=['month'],how='left')
kiss_model = pd.merge(kiss_model,dayofweek_model.rename(columns={'sales':'dayofweekfactor'}),on=['dayofweek'],how='left')
kiss_model = pd.merge(kiss_model,adj_factor_model,on='year',how='left')

kiss_model['salespred'] = kiss_model['storeitem_sales']*kiss_model['sales_adj_factor']*kiss_model['monthfactor']*kiss_model['dayofweekfactor']
kiss_model['salespred'] = np.round(kiss_model['salespred']).astype(int)
display(kiss_model.head())

In [None]:
model_smape = smape(kiss_model['sales'],kiss_model['salespred'])
model_train_smape_dict['Kiss Model'] = model_smape
print('SMAPE on train set is {:,.4f}'.format(model_smape))

In [None]:
kiss_model_aux = pd.merge(test,store_item_model,on=['store','item'],how='left')
kiss_model_aux = pd.merge(kiss_model_aux,month_model.rename(columns={'sales':'monthfactor'}),on=['month'],how='left')
kiss_model_aux = pd.merge(kiss_model_aux,dayofweek_model.rename(columns={'sales':'dayofweekfactor'}),on=['dayofweek'],how='left')
kiss_model_aux['sales'] = kiss_model_aux['sales']*adj_factor*kiss_model_aux['monthfactor']*kiss_model_aux['dayofweekfactor']
kiss_model_aux['sales'] = np.round(kiss_model_aux['sales']).astype(int)

In [None]:
sub = kiss_model_aux[['id','sales']].copy()
sub.to_csv('submission-kiss.csv',index=False)
model_test_smape_dict['Kiss Model'] = 13.87596 #given by kaggle 

## Summary of the evaluated models so far

In [None]:
models_list = ['Most Basic Model',
               'Year Model','Year Month Model','Year Month Dow Model',
               'Month Model','Month Dow Model','Month Day Model',
              'Kiss Model']

display(pd.merge(
    pd.DataFrame(index=model_train_smape_dict.keys(),data=list(model_train_smape_dict.values()),columns=['SMAPE train']),
    pd.DataFrame(index=model_test_smape_dict.keys(),data=list(model_test_smape_dict.values()),columns=['SMAPE test']),
    right_index = True,left_index=True,how='outer').loc[models_list].style.format('{:,.1f}'))

Coming up next, more efficient models to break the 13.9 barrier.