# Store Sales - Multiple Features Forecasting

https://www.kaggle.com/competitions/store-sales-time-series-forecasting

***In this project, forecasting will be processed with considering 'promotion' featureas well.<br>
Also, we will implement forecasting per each product family***

The evaluation metric for this competition is Root Mean Squared Logarithmic Error.

The RMSLE is calculated as:
$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$
where:

𝑛 is the total number of instances,<br>
𝑦̂ 𝑖 is the predicted value of the target for instance (i),<br>
𝑦𝑖 is the actual value of the target for instance (i), and,<br>
log is the natural logarithm.

The training data; <br>
***store_nbr*** identifies the store at which the products are sold.<br>
***family*** identifies the type of product sold.<br>
***sales*** gives the total sales for a product family at a particular store at a given date.
Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).<br>
***onpromotion*** gives the total number of items in a product family that were being promoted at a store at a given date.

## Blue print

1. Investigate the dataset. (unique values, data type etc)
2. How to numerize *family* features?
3. How to convert *date* to time features?
4. Split *train* dataset to *ourtrain* and *ourtest* for pre-validation.
5. Apply various ML models. (Trend, Periodtogram, Cycles, Hybrid)
6. Choose the best model and apply to our test set.
7. Apply and make csv file for submition.


## Preprocessing

In [12]:
# Import packages
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
import datetime
import math
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Ignore Future Warning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [113]:
# Load dataset
train = pd.read_csv('train.csv', parse_dates=["date"])
test = pd.read_csv('test.csv', parse_dates=["date"])

In [14]:
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


#### - Error function : 
$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$

𝑛 is the total number of instances,<br>
𝑦̂ 𝑖 is the predicted value of the target for instance (i),<br>
𝑦𝑖 is the actual value of the target for instance (i)

In [15]:
# Error Function (RMSLE)
def error(y_p, y_t):
    pred_log = np.array([math.log(i+1) for i in np.nditer(y_p)])
    act_log = np.array([math.log(i+1) for i in np.nditer(y_t)])
    dum_error = sum((pred_log - act_log)**2)/len(pred_log)
    linear_error = np.power(dum_error, 1/2)
    return round(linear_error, 4)

## 1. Data investigation

#### - train dataset
* shape : 3000888 × 6
* null : none
<br><br>
* *date* : timestamp. 2013-01-01 ~ 2017-08-15
* *store_nbr* : numpy. 1 ~ 54
* *family* : str. ['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS',
       'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS',
       'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE',
       'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES',
       'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE',
       'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE',
       'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY',
       'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES',
       'SEAFOOD']
* *sales* : numpy. 0 ~ 124717
* *onpromotion* : numpy. 0 ~ 741


#### - Correlation

In [16]:
# Check correlation
train.corr()

Unnamed: 0,id,store_nbr,sales,onpromotion
id,1.0,0.000301,0.085784,0.20626
store_nbr,0.000301,1.0,0.041196,0.007286
sales,0.085784,0.041196,1.0,0.427923
onpromotion,0.20626,0.007286,0.427923,1.0


## 2. Generate subsets

In [122]:
whole_train = train.copy()
whole_test = test.copy()

# Generates subsets for each family
def generate_subsets(dataset, key):
    subset = dataset.loc[dataset["store_nbr"]==key, :]
    return subset

# Save the subsets in dictionary
train_subsets = {}
for key in train["store_nbr"].unique():
    train_subsets.update({key : generate_subsets(train, key)})

train_subsets

{1:               id       date  store_nbr                      family  \
 0              0 2013-01-01          1                  AUTOMOTIVE   
 1              1 2013-01-01          1                   BABY CARE   
 2              2 2013-01-01          1                      BEAUTY   
 3              3 2013-01-01          1                   BEVERAGES   
 4              4 2013-01-01          1                       BOOKS   
 ...          ...        ...        ...                         ...   
 2999134  2999134 2017-08-15          1                     POULTRY   
 2999135  2999135 2017-08-15          1              PREPARED FOODS   
 2999136  2999136 2017-08-15          1                     PRODUCE   
 2999137  2999137 2017-08-15          1  SCHOOL AND OFFICE SUPPLIES   
 2999138  2999138 2017-08-15          1                     SEAFOOD   
 
                sales  onpromotion  
 0           0.000000            0  
 1           0.000000            0  
 2           0.000000           

## 3. One Hot Encode *'family'* features

In [123]:
# Integer encode
label_encoderF = LabelEncoder()
integer_encodedF = label_encoderF.fit_transform(whole_train['family'])

# Binary encode
onehot_encoderF = OneHotEncoder(sparse=False)
integer_encodedF = integer_encodedF.reshape(len(integer_encodedF), 1)
onehot_encodedF = onehot_encoderF.fit_transform(integer_encodedF)
# print(onehot_encodedF.shape, '\n', onehot_encodedF)

In [124]:
# Rename
onehot_encodedF = pd.DataFrame(onehot_encodedF)
onehot_encodedF = onehot_encodedF.rename(columns = {0:'AUTOMOTIVE', 1:'BABY CARE', 2:'BEAUTY', 3:'BEVERAGES', 
        4:'BOOKS', 5:'BREAD/BAKERY', 6:'CELEBRATION', 7:'CLEANING', 8:'DAIRY', 9:'DELI', 10:'EGGS',
        11:'FROZEN FOODS', 12:'GROCERY I', 13:'GROCERY II', 14:'HARDWARE',
        15:'HOME AND KITCHEN I', 16:'HOME AND KITCHEN II', 17:'HOME APPLIANCES',
        18:'HOME CARE', 19:'LADIESWEAR', 20:'LAWN AND GARDEN', 21:'LINGERIE',
        22:'LIQUOR,WINE,BEER', 23:'MAGAZINES', 24:'MEATS', 25:'PERSONAL CARE',
        26:'PET SUPPLIES', 27:'PLAYERS AND ELECTRONICS', 28:'POULTRY',
        29:'PREPARED FOODS', 30:'PRODUCE', 31:'SCHOOL AND OFFICE SUPPLIES',
        32:'SEAFOOD'})

In [125]:
# Add to train dataset
whole_train = pd.concat([whole_train, onehot_encodedF], axis=1) # Combine
whole_train = whole_train.drop(columns=['family'])              # Drop string values
print(whole_train.shape)
whole_train.head()

(3000888, 38)


Unnamed: 0,id,date,store_nbr,sales,onpromotion,AUTOMOTIVE,BABY CARE,BEAUTY,BEVERAGES,BOOKS,...,MAGAZINES,MEATS,PERSONAL CARE,PET SUPPLIES,PLAYERS AND ELECTRONICS,POULTRY,PREPARED FOODS,PRODUCE,SCHOOL AND OFFICE SUPPLIES,SEAFOOD
0,0,2013-01-01,1,0.0,0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,2013-01-01,1,0.0,0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,2013-01-01,1,0.0,0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,2013-01-01,1,0.0,0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,2013-01-01,1,0.0,0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 4. One Hot Encode *'store_nbr'* features

In [126]:
# Convert to numpy
integer_encodedS = np.array(whole_train['store_nbr'])

# Binary encode
onehot_encoderS = OneHotEncoder(sparse=False)
integer_encodedS = integer_encodedS.reshape(len(integer_encodedS), 1)
onehot_encodedS = onehot_encoderS.fit_transform(integer_encodedS)
# print(onehot_encodedS.shape, '\n', onehot_encodedS)

In [127]:
# Add to train dataset
onehot_encodedS = pd.DataFrame(onehot_encodedS)
whole_train = pd.concat([whole_train, onehot_encodedS], axis=1) # Combine
whole_train = whole_train.drop(columns=['store_nbr'])           # Drop string values
print(whole_train.shape)
whole_train.head()

(3000888, 91)


Unnamed: 0,id,date,sales,onpromotion,AUTOMOTIVE,BABY CARE,BEAUTY,BEVERAGES,BOOKS,BREAD/BAKERY,...,44,45,46,47,48,49,50,51,52,53
0,0,2013-01-01,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,2013-01-01,0.0,0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,2013-01-01,0.0,0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,2013-01-01,0.0,0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,2013-01-01,0.0,0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 5. Convert *'date'* to time features

In [128]:
# Make 'date' as an index
whole_train = pd.DataFrame(whole_train)
whole_train = whole_train.set_index('date')
whole_train = whole_train.to_period('D')
print(whole_train.shape, whole_train.head())

(3000888, 90)             id  sales  onpromotion  AUTOMOTIVE  BABY CARE  BEAUTY  BEVERAGES  \
date                                                                           
2013-01-01   0    0.0            0         1.0        0.0     0.0        0.0   
2013-01-01   1    0.0            0         0.0        1.0     0.0        0.0   
2013-01-01   2    0.0            0         0.0        0.0     1.0        0.0   
2013-01-01   3    0.0            0         0.0        0.0     0.0        1.0   
2013-01-01   4    0.0            0         0.0        0.0     0.0        0.0   

            BOOKS  BREAD/BAKERY  CELEBRATION  ...   44   45   46   47   48  \
date                                          ...                            
2013-01-01    0.0           0.0          0.0  ...  0.0  0.0  0.0  0.0  0.0   
2013-01-01    0.0           0.0          0.0  ...  0.0  0.0  0.0  0.0  0.0   
2013-01-01    0.0           0.0          0.0  ...  0.0  0.0  0.0  0.0  0.0   
2013-01-01    0.0           0.0    

In [129]:
# Split 'date' into detailed features
whole_train["day"] = whole_train.index.dayofweek
whole_train["week"] = whole_train.index.week
whole_train["dayofyear"] = whole_train.index.dayofyear
whole_train["year"] = whole_train.index.year

print(whole_train.shape)
whole_train.head()

(3000888, 94)


Unnamed: 0_level_0,id,sales,onpromotion,AUTOMOTIVE,BABY CARE,BEAUTY,BEVERAGES,BOOKS,BREAD/BAKERY,CELEBRATION,...,48,49,50,51,52,53,day,week,dayofyear,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,2013
2013-01-01,1,0.0,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,2013
2013-01-01,2,0.0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,2013
2013-01-01,3,0.0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,2013
2013-01-01,4,0.0,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,2013


## 6. Split *train* dataset into *ourtrain* and *ourtest*

In [130]:
ourtrain = whole_train[whole_train.index < '2017-01-01']    # 2013-01-01 ~ 2016-12-31
ourtest = whole_train[whole_train.index > '2016-12-31']     # 2017-01-01 ~ 2017-08-15

print(len(ourtrain)+len(ourtest))

3000888


In [131]:
ourtrain.head()

Unnamed: 0_level_0,id,sales,onpromotion,AUTOMOTIVE,BABY CARE,BEAUTY,BEVERAGES,BOOKS,BREAD/BAKERY,CELEBRATION,...,48,49,50,51,52,53,day,week,dayofyear,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,2013
2013-01-01,1,0.0,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,2013
2013-01-01,2,0.0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,2013
2013-01-01,3,0.0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,2013
2013-01-01,4,0.0,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,2013


In [132]:
ourtest.head()

Unnamed: 0_level_0,id,sales,onpromotion,AUTOMOTIVE,BABY CARE,BEAUTY,BEVERAGES,BOOKS,BREAD/BAKERY,CELEBRATION,...,48,49,50,51,52,53,day,week,dayofyear,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-01,2596374,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,6,52,1,2017
2017-01-01,2596375,0.0,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,6,52,1,2017
2017-01-01,2596376,0.0,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,6,52,1,2017
2017-01-01,2596377,0.0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,6,52,1,2017
2017-01-01,2596378,0.0,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,6,52,1,2017


## 7. Apply various ML models

### 1) Trend

In [133]:
# Fit data to trend model
def trend(ourtrain, ourtest):  

    # Targets
    y_train = ourtrain['sales']
    y_test = ourtest['sales']
    
    # Create features
    trend_dp = DeterministicProcess(
    index=ourtrain.index,   # dates from the training data
    constant=True,       # dummy feature for the bias (y_intercept)
    order=2,             # the time dummy (trend)
    drop=True,           # drop terms if necessary to avoid collinearity
    )

    # `in_sample` creates features for the dates given in the `index` argument
    X_train = trend_dp.in_sample()

    # Fit model
    model = LinearRegression(fit_intercept=False)
    model.fit(X_train, y_train)

    # Out of Sample 
    X_oos = trend_dp.out_of_sample(steps=len(ourtest.index))
    y_fore = pd.Series(model.predict(X_oos), index=X_oos.index)
    y_fore = y_fore.reindex(ourtest.index)

    return y_fore, y_test

In [134]:
# Compute errors
trend_fore, trend_test = trend(ourtrain, ourtest)
trend_error = error(trend_fore, trend_test)
print("Total Error from Trend model: ", round(trend_error, 4))

Total Error from Trend model:  3.6426


### 2) Periodogram

In [135]:
train_subsets[1]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.000000,0
1,1,2013-01-01,1,BABY CARE,0.000000,0
2,2,2013-01-01,1,BEAUTY,0.000000,0
3,3,2013-01-01,1,BEVERAGES,0.000000,0
4,4,2013-01-01,1,BOOKS,0.000000,0
...,...,...,...,...,...,...
2999134,2999134,2017-08-15,1,POULTRY,234.892000,0
2999135,2999135,2017-08-15,1,PREPARED FOODS,42.822998,0
2999136,2999136,2017-08-15,1,PRODUCE,2240.230000,7
2999137,2999137,2017-08-15,1,SCHOOL AND OFFICE SUPPLIES,0.000000,0


In [55]:
def seasonal(ourtrain, ourtest, error):

    y_test = ourtest['sales']

    # 12 sin/cos pairs for "A"nnual seasonality
    fourier = CalendarFourier(freq="A", order=12)
    
    season_dp = DeterministicProcess(
        index=ourtrain.index,
        constant=True,               # dummy feature for bias (y-intercept)
        order=1,                     # trend (order 1 means linear)
        seasonal=True,               # weekly seasonality (indicators)
        additional_terms=[fourier],  # annual seasonality (fourier)
        drop=True,                   # drop terms to avoid collinearity
    )

    X = season_dp.in_sample()  # create features for dates in ourtrain.index
    y = ourtrain["sales"]

    season_model = LinearRegression(fit_intercept=False)
    _ = season_model.fit(X, y)

    # # sales values computed from season_model
    # y_season_pred = pd.Series(season_model.predict(X), index=y.index) 

    # Forecasting for 2017-01-01 ~ 2017-08-15
    X_fore = season_dp.out_of_sample(steps=len(y_test))
    y_fore = pd.Series(season_model.predict(X_fore), index=X_fore.index)

    return error(y_fore, y_test)

In [56]:
# pd.to_datetime(date_col_to_force, errors = 'coerce')
seasonal_error = seasonal(ourtrain, ourtest, error)
print("Total Error from Seasonl model: ", round(seasonal_error, 4))

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-04-12 00:00:00

### 3) Cycles

In [65]:
def lagplot(x, y=None, lag=1, standardize=False, ax=None, **kwargs):
    from matplotlib.offsetbox import AnchoredText
    x_ = x.shift(lag)
    if standardize:
        x_ = (x_ - x_.mean()) / x_.std()
    if y is not None:
        y_ = (y - y.mean()) / y.std() if standardize else y
    else:
        y_ = x
    corr = y_.corr(x_)
    if ax is None:
        fig, ax = plt.subplots()
    scatter_kws = dict(
        alpha=0.75,
        s=3,
    )
    line_kws = dict(color='C3', )
    ax = sns.regplot(x=x_,
                     y=y_,
                     scatter_kws=scatter_kws,
                     line_kws=line_kws,
                     lowess=True,
                     ax=ax,
                     **kwargs)
    at = AnchoredText(
        f"{corr:.2f}",
        prop=dict(size="large"),
        frameon=True,
        loc="upper left",
    )
    at.patch.set_boxstyle("square, pad=0.0")
    ax.add_artist(at)
    ax.set(title=f"Lag {lag}", xlabel=x_.name, ylabel=y_.name)
    return ax
    

In [66]:
def plot_lags(x, y=None, lags=6, nrows=1, lagplot_kwargs={}, **kwargs):
    import math
    kwargs.setdefault('nrows', nrows)
    kwargs.setdefault('ncols', math.ceil(lags / nrows))
    kwargs.setdefault('figsize', (kwargs['ncols'] * 2, nrows * 2 + 0.5))
    fig, axs = plt.subplots(sharex=True, sharey=True, squeeze=False, **kwargs)
    for ax, k in zip(fig.get_axes(), range(kwargs['nrows'] * kwargs['ncols'])):
        if k + 1 <= lags:
            ax = lagplot(x, y, lag=k + 1, ax=ax, **lagplot_kwargs)
            ax.set_title(f"Lag {k + 1}", fontdict=dict(fontsize=14))
            ax.set(xlabel="", ylabel="")
        else:
            ax.axis('off')
    plt.setp(axs[-1, :], xlabel=x.name)
    plt.setp(axs[:, 0], ylabel=y.name if y is not None else x.name)
    fig.tight_layout(w_pad=0.1, h_pad=0.1)
    return fig

In [None]:
# Partial Autocorrelataion
from statsmodels.graphics.tsaplots import plot_pacf

plot_lags(ourtrain.sales, lags=12, nrows=2)
_ = plot_pacf(ourtrain.sales, lags=12)

In [None]:
def make_lags(ts, lags):
    return pd.concat(
        {
            f'y_lag_{i}': ts.shift(i)
            for i in range(1, lags + 1)
        },
        axis=1)

In [None]:
def cycle(subset, error):
    
    X = make_lags(subset.sales, lags=7)
    X = X.fillna(0.0)
    y = subset.sales

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=int(len(X.index)*0.3), shuffle=False)

    ts_model = LinearRegression()  # `fit_intercept=True` since we didn't use DeterministicProcess
    ts_model.fit(X_train, y_train)
    y_fore = pd.Series(ts_model.predict(X_test), index=X_test.index)
    y_fore[y_fore < 0] = 0

    return error(y_fore, y_test)

In [None]:
# pd.to_datetime(date_col_to_force, errors = 'coerce')
cycle_error = cycle(ourtrain, ourtest, error)
print("Total Error from Cycle model: ", round(seasonal_error, 4))

### 4) Hybrid

## 8. Modify test dataset

### - One Hot Encode 'family' features

In [69]:
# Integer encode
label_encoderTEST = LabelEncoder() 
integer_encodedTEST = label_encoderTEST.fit_transform(test['family'])

# Binary encode
onehot_encoderTEST = OneHotEncoder(sparse=False)   
integer_encodedTEST = integer_encodedTEST.reshape(len(integer_encodedTEST), 1)
onehot_encodedTEST = onehot_encoderTEST.fit_transform(integer_encodedTEST)

onehot_encodedTEST = pd.DataFrame(onehot_encodedTEST)
onehot_encodedTEST = onehot_encodedTEST.rename(columns = {0:'AUTOMOTIVE', 1:'BABY CARE', 2:'BEAUTY', 3:'BEVERAGES', 
        4:'BOOKS', 5:'BREAD/BAKERY', 6:'CELEBRATION', 7:'CLEANING', 8:'DAIRY', 9:'DELI', 10:'EGGS',
        11:'FROZEN FOODS', 12:'GROCERY I', 13:'GROCERY II', 14:'HARDWARE',
        15:'HOME AND KITCHEN I', 16:'HOME AND KITCHEN II', 17:'HOME APPLIANCES',
        18:'HOME CARE', 19:'LADIESWEAR', 20:'LAWN AND GARDEN', 21:'LINGERIE',
        22:'LIQUOR,WINE,BEER', 23:'MAGAZINES', 24:'MEATS', 25:'PERSONAL CARE',
        26:'PET SUPPLIES', 27:'PLAYERS AND ELECTRONICS', 28:'POULTRY',
        29:'PREPARED FOODS', 30:'PRODUCE', 31:'SCHOOL AND OFFICE SUPPLIES',
        32:'SEAFOOD'})  # Rename

# Add to train dataset
test = pd.concat([test, onehot_encodedTEST], axis=1) # Combine
test = test.drop(columns=['family'])              # Drop string values
print(test.shape)
test.head()

(28512, 37)


Unnamed: 0,id,date,store_nbr,onpromotion,AUTOMOTIVE,BABY CARE,BEAUTY,BEVERAGES,BOOKS,BREAD/BAKERY,...,MAGAZINES,MEATS,PERSONAL CARE,PET SUPPLIES,PLAYERS AND ELECTRONICS,POULTRY,PREPARED FOODS,PRODUCE,SCHOOL AND OFFICE SUPPLIES,SEAFOOD
0,3000888,2017-08-16,1,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3000889,2017-08-16,1,0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3000890,2017-08-16,1,2,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3000891,2017-08-16,1,20,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3000892,2017-08-16,1,0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### - One Hot Encode 'store_nbr' features

In [70]:
# Convert to numpy
integer_encodedTEST2 = np.array(test['store_nbr']) 

# Binary encode
onehot_encoderTEST2 = OneHotEncoder(sparse=False)
integer_encodedTEST2 = integer_encodedTEST2.reshape(len(integer_encodedTEST2), 1)
onehot_encodedTEST2 = onehot_encoderTEST2.fit_transform(integer_encodedTEST2)

# Add to train dataset
onehot_encodedTEST2 = pd.DataFrame(onehot_encodedTEST2)
test = pd.concat([test, onehot_encodedTEST2], axis=1) # Combine
test = test.drop(columns=['store_nbr'])           # Drop string values
print(test.shape)
test.head()

(28512, 90)


Unnamed: 0,id,date,onpromotion,AUTOMOTIVE,BABY CARE,BEAUTY,BEVERAGES,BOOKS,BREAD/BAKERY,CELEBRATION,...,44,45,46,47,48,49,50,51,52,53
0,3000888,2017-08-16,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3000889,2017-08-16,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3000890,2017-08-16,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3000891,2017-08-16,20,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3000892,2017-08-16,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [72]:
# Make 'date' as an index
test = pd.DataFrame(test)
test = test.set_index('date')
test = test.to_period('D')

# Split 'date' into detailed features
test["day"] = test.index.dayofweek
test["week"] = test.index.week
test["dayofyear"] = test.index.dayofyear
test["year"] = test.index.year

print(test.shape)
test.head()

(28512, 93)


Unnamed: 0_level_0,id,onpromotion,AUTOMOTIVE,BABY CARE,BEAUTY,BEVERAGES,BOOKS,BREAD/BAKERY,CELEBRATION,CLEANING,...,48,49,50,51,52,53,day,week,dayofyear,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-08-16,3000888,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2,33,228,2017
2017-08-16,3000889,0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2,33,228,2017
2017-08-16,3000890,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2,33,228,2017
2017-08-16,3000891,20,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2,33,228,2017
2017-08-16,3000892,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2,33,228,2017


## 9. Proceed forecasting with the best model

In [87]:
# Modify the trend model
def trend_modified(train, test):  

    # Targets
    y_train = train['sales']
    
    # Create features
    trend_dp = DeterministicProcess(
    index=train.index,   # dates from the training data
    constant=True,       # dummy feature for the bias (y_intercept)
    order=2,             # the time dummy (trend)
    drop=True,           # drop terms if necessary to avoid collinearity
    )

    # `in_sample` creates features for the dates given in the `index` argument
    X_train = trend_dp.in_sample()

    # Fit model
    model = LinearRegression(fit_intercept=False)
    model.fit(X_train, y_train)

    # Out of Sample 
    X_oos = trend_dp.out_of_sample(steps=len(test.index))
    y_fore = pd.Series(model.predict(X_oos), index=X_oos.index)

    return y_fore

In [88]:
# train : 2013-01-01 ~ 2017-08-15
# test : 2017-08-16 ~ 2017-08-31

y_fore = trend_modified(train, test)
print(y_fore.shape, type(y_fore))
y_fore

(28512,) <class 'pandas.core.series.Series'>


2017-08-16    497.712021
2017-08-17    497.712083
2017-08-18    497.712145
2017-08-19    497.712206
2017-08-20    497.712268
                 ...    
2095-09-03    499.453420
2095-09-04    499.453480
2095-09-05    499.453541
2095-09-06    499.453602
2095-09-07    499.453662
Freq: D, Length: 28512, dtype: float64

In [89]:
print(test.shape, type(test))

(28512, 93) <class 'pandas.core.frame.DataFrame'>


## 10. Generate csv file