We now consider making lag features. Consider Lag Features for the model.

In [1]:
import pandas as pd
import numpy as np

In [2]:
CA_train = pd.read_pickle('CA_train.pkl')

Based on an understanding of this question, it is important to note that we are dealing with a Time Series Prediction problem, and therefore, should be applying time series validation methods.


Since our aim is to predict the next 28 days from now, it is important to note that clearly, the 'window' in which we do our validation should be of size 28 as well. (validation set has to be of 28 days)

What this means is that because we have 28 days for our validation set, and since we have a validation set, we cannot use our validation set otherwise there would be data leakage, 
the last day of our test set would be 28+28 days = 56 days from the last day of the training set. 

**Essentially, it means that no matter what, our lags cannot be of less than 28 days.**

We should work with a smaller data set to create the lagged features first.

In [3]:
window_days = 28

In [4]:
CA_train_processing = CA_train[['store_id','item_id','date','value']].copy()

In [5]:
for i in range(window_days, window_days+14):
    CA_train_processing['lag'+str(i)+'day'] = CA_train_processing.groupby(['store_id', 
                                                                           'item_id'])['value'].shift(i)

Having created a lag from 1 days to 14 days for prediction, we can now start working on creating lag variables for other terms

From previous EDA, we found out that there is a seasonality that happens every 7 days.

We also noticed that there is a cyclic trend that happens every month due to SNAP purchases

As such, we will do a differencing for evvery 7 days and for every 4 weeks to account of seasonality/cyclic nature of these

In [6]:
CA_train_processing['differencing_weekly'] = CA_train_processing.groupby(['store_id','item_id'])['value'].shift(window_days).diff(7)
CA_train_processing['differencing_monthly'] = CA_train_processing.groupby(['store_id','item_id'])['value'].shift(window_days).diff(28)

We now create rolling window features.

Consider now, How big should our window be?

In such a situation, we should consider a rolling windows of size 7(for 1 week), size 14(2 weeks), 28(for a month), 90(Quarterly). Furthermore, for these rolling windows, we find the mean and standard deviation of the window.

Why Standard Deviation? - The standard deviation is helpful because it helps to identify how big the variance is throughout the window - it could possibly identify certain trends (there might be interaction with events etc). 

In [7]:
windows = [7,14,28,90]
for i in windows:
    CA_train_processing['rolling_mean_'+ str(i)]= CA_train_processing.groupby(['store_id',
                                                                               'item_id'])['value'].shift(window_days).rolling(i).mean()
    CA_train_processing['rolling_sd_'+str(i)] = CA_train_processing.groupby(['store_id',
                                                                             'item_id'])['value'].shift(window_days).rolling(i).std()

In [8]:
CA_train_processing

Unnamed: 0,store_id,item_id,date,value,lag28day,lag29day,lag30day,lag31day,lag32day,lag33day,...,differencing_weekly,differencing_monthly,rolling_mean_7,rolling_sd_7,rolling_mean_14,rolling_sd_14,rolling_mean_28,rolling_sd_28,rolling_mean_90,rolling_sd_90
0,CA_1,HOBBIES_1_008,2011-01-29,12.0,,,,,,,...,,,,,,,,,,
1,CA_1,HOBBIES_1_008,2011-01-30,15.0,,,,,,,...,,,,,,,,,,
2,CA_1,HOBBIES_1_008,2011-01-31,0.0,,,,,,,...,,,,,,,,,,
3,CA_1,HOBBIES_1_008,2011-02-01,0.0,,,,,,,...,,,,,,,,,,
4,CA_1,HOBBIES_1_008,2011-02-02,0.0,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46845084,CA_4,FOODS_3_825,2016-05-22,,0.0,2.0,0.0,1.0,1.0,3.0,...,-1.0,-2.0,1.142857,1.214986,0.928571,1.141139,1.321429,1.564740,1.822222,2.920589
46845085,CA_4,FOODS_3_826,2016-05-21,,0.0,7.0,1.0,5.0,4.0,1.0,...,-2.0,-4.0,0.857143,1.214986,0.928571,1.141139,1.178571,1.492042,1.822222,2.920589
46845086,CA_4,FOODS_3_826,2016-05-22,,4.0,0.0,7.0,1.0,5.0,4.0,...,1.0,3.0,1.000000,1.527525,1.214286,1.368805,1.285714,1.583647,1.855556,2.928209
46845087,CA_4,FOODS_3_827,2016-05-21,,4.0,4.0,4.0,0.0,2.0,0.0,...,4.0,4.0,1.571429,1.812654,1.428571,1.554858,1.428571,1.642685,1.855556,2.928209


In [9]:
CA_train_processing[CA_train_processing.columns.difference(['store_id','item_id','date'])]

Unnamed: 0,differencing_monthly,differencing_weekly,lag28day,lag29day,lag30day,lag31day,lag32day,lag33day,lag34day,lag35day,...,lag41day,rolling_mean_14,rolling_mean_28,rolling_mean_7,rolling_mean_90,rolling_sd_14,rolling_sd_28,rolling_sd_7,rolling_sd_90,value
0,,,,,,,,,,,...,,,,,,,,,,12.0
1,,,,,,,,,,,...,,,,,,,,,,15.0
2,,,,,,,,,,,...,,,,,,,,,,0.0
3,,,,,,,,,,,...,,,,,,,,,,0.0
4,,,,,,,,,,,...,,,,,,,,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46845084,-2.0,-1.0,0.0,2.0,0.0,1.0,1.0,3.0,1.0,0.0,...,0.0,0.928571,1.321429,1.142857,1.822222,1.141139,1.564740,1.214986,2.920589,
46845085,-4.0,-2.0,0.0,7.0,1.0,5.0,4.0,1.0,2.0,2.0,...,2.0,0.928571,1.178571,0.857143,1.822222,1.141139,1.492042,1.214986,2.920589,
46845086,3.0,1.0,4.0,0.0,7.0,1.0,5.0,4.0,1.0,2.0,...,4.0,1.214286,1.285714,1.000000,1.855556,1.368805,1.583647,1.527525,2.928209,
46845087,4.0,4.0,4.0,4.0,4.0,0.0,2.0,0.0,0.0,0.0,...,0.0,1.428571,1.428571,1.571429,1.855556,1.554858,1.642685,1.812654,2.928209,


In [10]:
#using difference function returns the part of CA_train processing that contains only the numerical features
#the point of this is to change the selected columns to float16 to reduce the amount of mem usage
CA_train_processing[CA_train_processing.columns.difference(['store_id','item_id','date'])] = CA_train_processing[
    CA_train_processing.columns.difference(['store_id','item_id','date'])].astype(np.float16)

In [11]:
CA_train_full = CA_train.merge(CA_train_processing, how='inner', on=['store_id','item_id','date','value'])

In [12]:
x_train = CA_train_full[CA_train_full['date'] <= '2016-04-24']
y_train = x_train['value']
test = CA_train[CA_train['date'] >= '2016-04-25']

In [13]:
del CA_train, CA_train_processing

In [15]:
names = {'CA_1':'x_train_CA_1.pkl', 'CA_2':'x_train_CA_2.pkl', 
         'CA_3':'x_train_CA_3.pkl', 'CA_4': 'x_train_CA_4.pkl'}

def create_pickles(training_data, order):
    for i in training_data.store_id.unique():
        training_data[training_data['store_id']==i].to_pickle(names[i])
    return

create_pickles(x_train, names)