# Contents

In this kernel, I will predict the total amount of monthly seles by using ARMA models. This is my first contribution to kaggle kernels, and any advises, questions and comments are welcomed.

# Data Exploration

In [None]:
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
%matplotlib inline

train = pd.read_csv('../input/sales_train.csv')
test = pd.read_csv('../input/test.csv')
submission = pd.read_csv('../input/sample_submission.csv')
items = pd.read_csv('../input/items.csv')
item_cats = pd.read_csv('../input/item_categories.csv')
shops = pd.read_csv('../input/shops.csv')
print("train:", train.shape)
print(train.head())
print("test:", test.shape)
print(test.head())
print("submission:",submission.shape)
print(submission.head())
print("items:",items.shape)
#print(items.head)
print("item_cats:",item_cats.shape)
#print(item_cats)
print("shops:",shops.shape)
#print(shops)


In this competition, we are required to predict total sales for every product and store in the next month. The first step is to analyze and visualize the shop- and item-dependence of the sales.

In [None]:
### Sales for each shops ###

SaleEachShop = train.groupby(["shop_id"],as_index=False)["item_cnt_day"].sum()
SaleEachShop = SaleEachShop.sort_values(by='item_cnt_day', ascending = False)
print(SaleEachShop.head())
print(SaleEachShop.tail())
ax = SaleEachShop.plot(y=["item_cnt_day"], bins=10, alpha=0.5, figsize=(16,4), kind='hist')

It seems that the data is composed of a few dominant shops and other small shops with a threshold around 100000. This implies that we need at least two different models to predict the future sale. How about the item dependence?

In [None]:
### Sales for each item ###

SaleEachItem = train.groupby(["item_id"],as_index=False)["item_cnt_day"].sum()
SaleEachItem = SaleEachItem.sort_values(by='item_cnt_day', ascending = False)
print(SaleEachItem["item_cnt_day"].describe())
print(SaleEachItem.head())
MaxSale = SaleEachItem["item_cnt_day"].max()
Q95 = SaleEachItem["item_cnt_day"].quantile(.95)
print("quantile 0.95 = {0}".format(Q95))
SaleEachItem[SaleEachItem["item_cnt_day"]< Q95].plot(y=["item_cnt_day"], bins=30, alpha=0.5, figsize=(16,4), title="~95%",kind='hist')


A quick view of the result revealed that the number of the sales of the top selling item is an order of magnitude larger than that of the second. In contrast, the sale of 95% of the items is less than 653, and the histogram has a long tail. I'm not familiar with techniques to incorporate this information into the model, and I will postpone this issue for the moment.

# Time Series Data Visualization
We would like to see the time series data to gain further insights of the data. Here are the graphs of the monthly sale for each shop.

In [None]:
Monthly_sale_shop = train.groupby(["date_block_num","shop_id"],as_index=False)["item_cnt_day"].sum()
MaxSpan = Monthly_sale_shop["date_block_num"].max()
for i in range(len(shops)):
    Monthly_sale_shop[Monthly_sale_shop["shop_id"] == i].plot(x="date_block_num",y="item_cnt_day",xlim=[0, MaxSpan],
                                                              title="shop {0}".format(i))


These graphs offer useful insights to predict the sales.
1. Some shops were opened recetnly, and others were already closed. In particular, identifing the closed shops may be very helpful because the sale in the next month is obviously 0.
2. If we take a glance at the data of the shops which are not closed during the span (e.g. shop 59),there seems to be periodic peaks at date_block_num=11 and 23, which correspond to December. Oh, I see. Christmas.

These observations bring us a promising strategy to predict the sale of each shop: First, examine the general tendency of the monthly sale, and then incorporate the shop features.

The next step we should take is to analyze the monthly sales summed over the shops.

In [None]:
Monthly_sale_total = train.groupby(["date_block_num"],as_index=False)["item_cnt_day"].sum()
Monthly_sale_total.plot(x="date_block_num",y="item_cnt_day")


We can visualize the (partial) autocorrelation and the seasonal effect of the time series data as follows.

In [None]:
import statsmodels.api as sm

### The index should be given in Datetime format to use seasonal_decompose
Monthly_sale_total["item_cnt_day"].index = pd.DatetimeIndex(freq="m",start='2013-1',periods=len(Monthly_sale_total))

fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(Monthly_sale_total["item_cnt_day"], lags=20, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(Monthly_sale_total["item_cnt_day"], lags=20, ax=ax2)

res = sm.tsa.seasonal_decompose(Monthly_sale_total["item_cnt_day"], freq=3)
res.plot()


It is evident that there is an overall decreasing trend in the data. Moreover, we can find clear peaks with pediod 12, which indicates the seasonal effects. In order to confirm the nonstationarity, we will plot the rolling meaan and variance. Let's define the function to test the stationarity for later usage. A nice function has been already proposed here https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/.

In [None]:
from statsmodels.tsa.stattools import adfuller

def test_stationarity(timeseries):

    #Plot rolling statistics:
    NWindow = 5
    plt.plot(timeseries, color='blue',label='Original')
    plt.plot(pd.rolling_mean(timeseries, window=NWindow), color='red', label='Rolling Mean')
    plt.plot(pd.rolling_std(timeseries,  window=NWindow), color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.show(block=False)
    
    #Dickey-Fuller test:
    print("Results of Dickey-Fuller Test:")
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)

test_stationarity(Monthly_sale_total["item_cnt_day"])


Both the rolling mean and the variance have peak at date_block_num=11, 23. Besides, the null hypothesis of the nonstationarity cannot be rejected according to Dickey-Fuller test. Sounds reasonable.

# Data processing
Now, it is found that the original time series is not stationary. The next step is to remove the trend and the seasonal effect in order to apply statistical models to time series. We will try to remove the trend and the seasonal effect by subtracting the data with the one shifted by 12, which is the period found above.


In [None]:
ModData = Monthly_sale_total - Monthly_sale_total.shift(periods=12)
ModData = ModData.dropna()
test_stationarity(ModData["item_cnt_day"])


In [None]:
import statsmodels.api as sm

### The index should be given in Datetime format to use seasonal_decompose
ModData.index = pd.DatetimeIndex(freq="m",start='2014-1',periods=len(ModData))

fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(ModData["item_cnt_day"], lags=20, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(ModData["item_cnt_day"], lags=20, ax=ax2)

res = sm.tsa.seasonal_decompose(ModData["item_cnt_day"], freq=3)
res.plot()


# Time Series Analysis: ARMA Model
Now both the overall trend and the seasonal effect look removed and p-value is actually less than 0.05. We are ready to use ARMA models, which is one of the most popular and widely used statistical method for time series prediction. First, we will determine the best hyperparameters of the ARMA model by using a function offered by statsmodels library.

In [None]:
res = sm.tsa.arma_order_select_ic(ModData["item_cnt_day"], ic='aic', trend='nc')
res

It is found that Akaike's Information Criterion is minimized with parameter (3,1). Let's make the model and examine the fitting.

In [None]:
ARMA_3_1 = sm.tsa.ARMA(ModData["item_cnt_day"], (3,1)).fit()
print(ARMA_3_1.summary())

In [None]:
### Autocorrelation function of the residual ###

resid = ARMA_3_1.resid
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=20, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(resid, lags=20, ax=ax2)

The fitting is well converged, and the autocorrelation of the residue seems to be weak. Finally, we will predict the future total sale. 

In [None]:
### prediction with ARMA model

pred = ARMA_3_1.predict('2015-1-31', '2016-010-31')
 
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(211)
ax.set_xlim([datetime.date(2013, 1, 31), datetime.date(2016, 10, 31)])
Shifted = Monthly_sale_total["item_cnt_day"].copy()
Shifted.index = pd.date_range('2014-01-01', periods = len(Shifted), freq = 'M')
ax.plot(Monthly_sale_total["item_cnt_day"],label="orig")
ax.plot(pred+Shifted, "r",label="pred")
plt.legend()

Good. ARMA model predicts both the prominent peak with period 12 and the overall decreasing trend.

# Perspective
It is important not to forget our final goal. Although we just focused on the total amount of sales, we need to seriously think about the shop- and item-dependence in order to correctly predict total sales for every product and store. Moreover, we have completely ignored prices and categories, which may carry quite useful information. Still, I belive that sharing this kernel may benefit both of us because the basic procedures made abov works as a template of more sophisticated and detailed analysis. I would apprecite it if this contribution helps you to perform nice analysis. Thank you for reading my article.