So, we are dealing with time series of sales, which of course present some periodicities (I am not going to show these in this notebook, just plot the time series to visualize what I am talking about !) :
* within a year, there is a clear monthly trend (with a peak in summer), that can also be analyzed week by week,
* within a week, there is generally a peak during the week-end.

Since ARIMA is pretty slow, I thought I might try something "manually". Let's give it a shot.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the data
df_train = pd.read_csv("../input/train.csv", sep=",")
df_test = pd.read_csv("../input/test.csv", sep=",")
ids = df_test.pop("id")
df_train['date'] = pd.to_datetime(df_train['date'], infer_datetime_format=True)
df_test['date'] = pd.to_datetime(df_test['date'], infer_datetime_format=True)
df_train['isTrain'] = True
df_test['isTrain'] = False
df = df_train.append(df_test, ignore_index=True, sort=True)

# Create some additional columns
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['week'] = df['date'].dt.week

df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.dayofweek
df['dayofyear'] = df['date'].dt.dayofyear
df['weekend'] = (df['weekday'] >= 5).astype(int)

# Some weird columns which help a bit
for div in np.arange(2, 6, 1):
    df['dayb%i' % div] = df['dayofyear'] % div

Now we normalize the data with regards to the store-specific and item-specific scales, and visualize the normalized sales afterwards.

In [None]:
all_means = df.groupby(['store', 'item'])['sales'].mean().unstack()
all_stds = df.groupby(['store', 'item'])['sales'].std().unstack()

df['normed'] = df['sales'] - all_means.values[df['store'].values - 1, df['item'].values - 1]
df['normed'] = df['normed'] / all_stds.values[df['store'].values - 1, df['item'].values - 1]

df_train = df.groupby('isTrain').get_group(True).copy()
df_test = df.groupby('isTrain').get_group(False).copy()

In [None]:
print('Factor plot - year vs normed')
sns.factorplot(x="year", y="normed", data=df_train, kind="box", hue="store", size=10)

print('Factor plot - weekday vs normed')
sns.factorplot(x="weekday", y="normed", data=df_train, kind="box", size=10)

df_train_ = df_train.copy()
df_train_['jfm'] = df_train_['month'] < 4
print('Joint plot - dayofyear vs normed')
sns.jointplot(x="dayofyear", y="normed", data=df_train_.groupby('jfm').get_group(True), size=10)

del df_train_

Sales vary significantly from one year to another, so we need to account for that when predicting the sales in 2018. Here is where the "homemade-ARIMA" starts. First we get the annual mean (normalized) sales and try to recognize a pattern.

In [None]:
annual_means = df_train.groupby('year')['normed'].mean()
annual_means.name = "Annual mean normalized sales"
annual_stds = df_train.groupby('year')['normed'].std()
annual_stds.name = "Annual std normalized sales"

fig, axs = plt.subplots(figsize=(10, 10), nrows=2, sharex=True)
annual_means.plot(ax=axs[0], title="Annual mean normalized sales")
annual_stds.plot(ax=axs[1], title="Annual std normalized sales")

We have a limited number of data points (5) here, but from these we can reckon that a linear regression would probably over-estimate the next annual mean & std in 2018. So we are going to use a linear regression but on the consecutive differences instead.

In [None]:
annual_mean_diffs = annual_means.diff().dropna()
annual_std_diffs = annual_stds.diff().dropna()

fig, axs = plt.subplots(figsize=(10, 10), nrows=2, sharex=True)
annual_mean_diffs.plot(ax=axs[0], title="Annual mean normalized sales variations")
annual_std_diffs.plot(ax=axs[1], title="Annual std normalized sales variations")

It is difficult to see if the linear regression is appropriate, because now we only have 4 data points, but we just keep going to see what we get in the end.

In [None]:
# Linear model for inter-annual variability
from sklearn.linear_model import LinearRegression

fig, axs = plt.subplots(figsize=(10, 10), nrows=2, sharex=True)
annual_means.plot(ax=axs[0])
annual_stds.plot(ax=axs[0])
plt.legend(['Means', 'Stds'])

annual_mean_diffs.plot(ax=axs[1])
annual_std_diffs.plot(ax=axs[1])

lr_mean_diffs = LinearRegression().fit(annual_mean_diffs.index.values.reshape(-1, 1),
                                       annual_mean_diffs.values)
lr_std_diffs = LinearRegression().fit(annual_std_diffs.index.values.reshape(-1, 1),
                                      annual_std_diffs.values)

mean_diff_2018 = lr_mean_diffs.predict([[2018]])[0]
mean_2018 = annual_means[2017] + mean_diff_2018

std_diff_2018 = lr_std_diffs.predict([[2018]])[0]
std_2018 = annual_stds[2017] + std_diff_2018

plt.sca(axs[0])
plt.plot(2018, mean_2018, 'ko')
plt.plot(2018, std_2018, 'rd')
plt.plot([2017, 2018], [annual_means[2017], mean_2018], 'k--')
plt.plot([2017, 2018], [annual_stds[2017], std_2018], 'r--')

plt.sca(axs[1])
plt.plot(2018, mean_diff_2018, "ko")
plt.plot(2018, std_diff_2018, "ro")
plt.plot([2017, 2018], [annual_mean_diffs[2017], mean_diff_2018], 'k--')
plt.plot([2017, 2018], [annual_std_diffs[2017], std_diff_2018], 'r--')


I usually forget to annotate graphs but the two previous ones show the annual mean (blue) and std (orange) to the top, their variations (same color code) to the bottom ; the forecasted values are shown in black and red respectively. It kind of makes sense !

In [None]:
annual_means[2018] = mean_2018
annual_stds[2018] = std_2018
df['residuals_year'] = df['normed'] - annual_means.values[df['year'].values - 2013]
df['residuals_year'] = df['residuals_year'] / annual_stds.values[df['year'].values - 2013]
df['residuals_year'].hist(bins=100, figsize=(10, 5))
plt.title("Distribution of residuals (year variations removed)")
pd.concat([annual_means, annual_stds], axis=1)

Now, we remove the sales variations within a year (we choose to do that per week instead of per month but it would probably give similar results), because we want the ML model to focus on the variations of other time scales that we cannot anticipate (apart from the business day / weekend variations).

In [None]:
df_train = df.groupby('isTrain').get_group(True).copy()
weekly_means = df_train.groupby('week')['residuals_year'].mean()
weekly_stds = df_train.groupby('week')['residuals_year'].std()

weekly_means.plot()
weekly_stds.plot()
plt.xlabel('Week')
plt.legend(['Means', 'Stds'])

df['residuals'] = df['residuals_year'] - weekly_means.values[df['week'].values - 1]
df['residuals'] = df['residuals'] / weekly_stds.values[df['week'].values - 1]

df_train = df.groupby('isTrain').get_group(True).copy().drop('isTrain', axis=1)
df_test = df.groupby('isTrain').get_group(False).copy().drop('isTrain', axis=1)

So now we have the residuals which :
* have the same scale, for all items and stores,
* do not vary from one year to another (in mean and std)
* do not vary from one week to another (in mean and std)
We are going to use a LightGBM model to predict the other variations.

In [None]:
predictors = [c for c in df_train.columns
              if c not in ['sales',
                           'normed',
                           'residuals_year',
                           'residuals',
                           'year',
                           'date',
                           'isTrain']]
categories = [c for c in predictors
              if c not in ['month', 'week', 'dayofyear']]
print('Categorical predictors:', categories)
print('Predictors:', predictors)


In [None]:
from sklearn.model_selection import KFold
from lightgbm import Dataset, train

df_train['jfm'] = df_train['month'] < 4
df_train = df_train.groupby('jfm').get_group(True).copy().drop('jfm', axis=1)

X_train = df_train[predictors].values
X_test = df_test[predictors].values
y_train = df_train['residuals'].values # we predict the residuals (after removing yearly and weekly trends

nfolds = 10
folds = KFold(n_splits=nfolds, shuffle=True)

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'mae'},
    'num_leaves': 45,
    'learning_rate': 0.02,
    'feature_fraction': 0.9,
    'max_depth': 6,
    'verbose': 0,
    'num_boost_round': 15000,
    'early_stopping_rounds': 100,
    'nthread': -1}

residuals = []
scores = []
ft_imp_split = []
ft_imp_gain = []

print('\tRunning %i K-folds...' % nfolds)
for ifold, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):

    lgb_train = Dataset(
        data=X_train[trn_idx, :],
        label=y_train[trn_idx],
        feature_name=predictors)
    lgb_val = Dataset(
        data=X_train[val_idx, :],
        label=y_train[val_idx],
        feature_name=predictors)

    model = train(
        params,
        lgb_train,
        num_boost_round=15000,
        early_stopping_rounds=100,
        valid_sets=[lgb_train, lgb_val],
        verbose_eval=20,
        categorical_feature=categories,
    )

    y_pred = model.predict(X_train[val_idx, :], num_iteration=model.best_iteration)
    score = model.best_score['valid_1']['l1']
    print('\toof best score is: {:6.4f} after {:6d} iterations'.format(score, model.best_iteration))
    
    ft_imp_split.append(model.feature_importance(importance_type="split", iteration=model.best_iteration))
    ft_imp_gain.append(model.feature_importance(importance_type="gain", iteration=model.best_iteration))

    residuals.append(model.predict(X_test, num_iteration=model.best_iteration))
    scores.append(score)

residuals = np.average(residuals, axis=0, weights=(1./np.array(scores))**2)

In [None]:
df_test['residuals'] = residuals

df_test['residuals_year'] = df_test['residuals'] * weekly_stds.values[df_test['week'].values - 1]
df_test['residuals_year'] += weekly_means.values[df_test['week'].values - 1]

df_test['normed'] = df_test['residuals_year'] * annual_stds[2018] + annual_means[2018]

df_test['sales'] = df_test['normed'] * all_stds.values[df_test['store'].values - 1, df_test['item'].values - 1]
df_test['sales'] += all_means.values[df_test['store'].values - 1, df_test['item'].values - 1]

Let's visualize the results and see if we managed to capture significant variations.

Fig. 1 : In average, across all items and stores, it seems we are in a range that makes sense, however the amplitude of the variations seems very under-estimated.
Fig. 2 : Focus on the item 1 ; it seems we have a bias here.
Fig. 3 : Focus on the item 1 sold in store 1 ; same conclusions as in Fig1 and Fig2.

In [None]:
df = df_train.append(df_test, ignore_index=True, sort=True)

sns.factorplot(x="weekday", y="residuals", hue="year", data=df, kind="box", size=8)
plt.title("All items & stores")

sns.factorplot(x="weekday", y="residuals", hue="year", data=df.groupby('item').get_group(1), kind="box", size=8)
plt.title("Item 1, all stores")

sns.factorplot(x="weekday", y="residuals", hue="year", data=df.groupby(['store', 'item']).get_group((1, 1)), kind="box", size=8)
plt.title("Item 1, store 1")

Let's visualize how important our features were (log-scaled).

In [None]:
# Show feature importance
sns.barplot(x=np.log1p(np.mean(ft_imp_split, axis=0)),
            y=predictors)
plt.title('Feature importance (log) by split')
plt.figure()
sns.barplot(x=np.log1p(np.mean(ft_imp_gain, axis=0)),
            y=predictors)
plt.title('Feature importance (log) by gain')

Et voilà ! This model is quite simple and doesn't perform too poorly. Before tuning the ML model hyperparameters, we can investigate other ways of dealing with the periodicities, and try to find solutions to that bias / amplitude of variations problem which we saw in the Figs1, 2, 3.

**What do you think ?**

In [None]:
df_test['ID'] = ids.values
sub = df_test[['ID', 'sales']]
print(sub.head())
sub.to_csv('homemade-arima.csv', sep=',', index=False)