# Walmart Sales Prediction (M5 Competition)

Prophet, Linear Regession, Gradient Boosting, Random Forest, Multi

For large supermarkets like Walmart stores, forecasting future sales of products is crucial for keeping stock such that consumer demand can be met. This forecasting study focuses on the demand for a subcategory of hobby products in a Walmart store in California, USA. To be exact, we will try to forecast the need for 149 hobby products for 28 consecutive days. 

Along with previous demand for the products, we have access to data about the sales prices of the products and special events on the calendar for the sampled time series data. We hope to give insight into the accuracy of models in product sales forecasting by comparing the performance of traditional time series forecasting and machine learning methods.


In [62]:
# Import Relevent Packages

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import os
from prophet import Prophet
import time
import warnings
from itertools import cycle
from sklearn.svm import SVR
import statsmodels.api as sm
from pmdarima import auto_arima
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.holtwinters import SimpleExpSmoothing, ExponentialSmoothing

In [20]:
%matplotlib inline
plt.style.use('bmh')
sns.set_style("darkgrid")
plt.rc('xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
warnings.filterwarnings("ignore")
pd.set_option('max_colwidth', 100)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

In [50]:
# Read in data
os.chdir("../data") 

merged = pd.read_csv('merged.csv')
merged.date = pd.to_datetime(merged.date)
merged.head()

Unnamed: 0,id,day,demand,date,wm_yr_wk_x,weekday,wday,month,year,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,item_id,sell_price
0,HOBBIES_2_002_CA_3_validation,d_1,0,2011-01-29,11101,Saturday,1,1,2011,,,,,0,HOBBIES_2_002,1.97
1,HOBBIES_2_002_CA_3_validation,d_2,0,2011-01-30,11101,Sunday,2,1,2011,,,,,0,HOBBIES_2_002,1.97
2,HOBBIES_2_002_CA_3_validation,d_3,0,2011-01-31,11101,Monday,3,1,2011,,,,,0,HOBBIES_2_002,1.97
3,HOBBIES_2_002_CA_3_validation,d_4,1,2011-02-01,11101,Tuesday,4,2,2011,,,,,1,HOBBIES_2_002,1.97
4,HOBBIES_2_002_CA_3_validation,d_5,0,2011-02-02,11101,Wednesday,5,2,2011,,,,,1,HOBBIES_2_002,1.97


## 3.1 Forecas with Prophet

The Prophet is a forecasting package launched by Facebook to predict time series data based on an additive model. While it does not take attributes other than the prior sales data, it fits with non-linear trends and various seasonality, including holiday effects. It also demonstrates remarkable forecasting power even with missing values and outliers.

In [53]:
# Extract unique id names

id_list = list(merged.id.unique())
len(id_list)

149

In [59]:
# Create a for loop to run all the 149 items

result = pd.DataFrame()

for item in id_list:
    mini = merged[merged.id ==item][["date", "demand"]]
    mini.rename(columns={'date':'ds', 'demand':'y'}, inplace = True)

    m = Prophet(daily_seasonality=True, yearly_seasonality=True)
    m.fit(mini)
    future = m.make_future_dataframe(periods=28, include_history=False)
    forecast = m.predict(future)[['ds', 'yhat']]
    forecast["id"] = item
    result = pd.concat([result, forecast])
    
result.head()

Unnamed: 0,ds,yhat,id
0,2016-06-20,0.036227,HOBBIES_2_002_CA_3_validation
1,2016-06-21,0.04266,HOBBIES_2_002_CA_3_validation
2,2016-06-22,0.063581,HOBBIES_2_002_CA_3_validation
3,2016-06-23,0.070406,HOBBIES_2_002_CA_3_validation
4,2016-06-24,0.0986,HOBBIES_2_002_CA_3_validation


In [60]:
# Reframe it to the submission format

wide_format= result.pivot(index="id", columns="ds", values="yhat").reset_index()
wide_format = wide_format.rename_axis(None, axis=1)
wide_format.head()

#wide_format.to_csv("sub_prophet.csv")

Unnamed: 0,id,2016-06-20 00:00:00,2016-06-21 00:00:00,2016-06-22 00:00:00,2016-06-23 00:00:00,2016-06-24 00:00:00,2016-06-25 00:00:00,2016-06-26 00:00:00,2016-06-27 00:00:00,2016-06-28 00:00:00,2016-06-29 00:00:00,2016-06-30 00:00:00,2016-07-01 00:00:00,2016-07-02 00:00:00,2016-07-03 00:00:00,2016-07-04 00:00:00,2016-07-05 00:00:00,2016-07-06 00:00:00,2016-07-07 00:00:00,2016-07-08 00:00:00,2016-07-09 00:00:00,2016-07-10 00:00:00,2016-07-11 00:00:00,2016-07-12 00:00:00,2016-07-13 00:00:00,2016-07-14 00:00:00,2016-07-15 00:00:00,2016-07-16 00:00:00,2016-07-17 00:00:00
0,HOBBIES_2_001_CA_3_validation,0.159759,0.137535,0.129654,0.136289,0.143165,0.106754,0.154174,0.134505,0.115026,0.110943,0.122331,0.134789,0.104636,0.158816,0.146218,0.133921,0.136926,0.155104,0.173857,0.14932,0.208272,0.199459,0.189845,0.194345,0.212778,0.230529,0.203747,0.259261
1,HOBBIES_2_002_CA_3_validation,0.036227,0.04266,0.063581,0.070406,0.0986,0.078797,0.099433,0.08078,0.086297,0.105361,0.10937,0.133792,0.109293,0.124367,0.099369,0.097864,0.109359,0.105401,0.121627,0.08888,0.095841,0.063058,0.054284,0.059211,0.049556,0.061113,0.024859,0.029586
2,HOBBIES_2_003_CA_3_validation,0.606129,0.545477,0.705462,0.674986,0.59571,0.818531,0.910145,0.63631,0.564901,0.713713,0.671966,0.581647,0.793971,0.875938,0.593587,0.515048,0.658326,0.612813,0.52062,0.733029,0.817057,0.538705,0.466006,0.616821,0.580346,0.498454,0.72215,0.818145
3,HOBBIES_2_004_CA_3_validation,-0.047937,-0.084226,0.056323,0.040529,-0.013401,0.047323,0.053136,-0.075786,-0.103722,0.044781,0.036417,-0.010707,0.056115,0.067259,-0.057139,-0.081371,0.07002,0.063755,0.017988,0.085486,0.096701,-0.028141,-0.053237,0.09697,0.089307,0.042025,0.10799,0.117743
4,HOBBIES_2_005_CA_3_validation,0.102767,0.137736,0.141471,0.119157,0.133265,0.126196,0.129707,0.092165,0.126663,0.129731,0.10656,0.119638,0.111393,0.113613,0.074709,0.107822,0.109538,0.085107,0.097079,0.087943,0.089549,0.050364,0.083582,0.085833,0.062401,0.075862,0.068715,0.072811


## 3.2 Forecast with Multiple Machine Learing models & feature engineering

Introducing appropriate variables can increase models’ complexity and ameliorate underfitting. In that sense, feature engineering is the critical process of selecting relevant features and applying a transformation to these data to construct a robust predictive model.

In the competition, two major features regarding time series are applied to enrich the dataset. First, based on the insight acquired from data visualization, we assume the demand for each item is autocorrelated to seven days ago. Therefore, a lag of seven days in demand is introduced. The second assumption is that a similar sales pattern could appear on both an annual and weekly basis. We consequently utilize a groupby method to add descriptive statistics for each month and day of the week. Several new features are generated as ['lag\_7', 'rmean\_7\_7', 
'demand\_month\_mean', 'demand\_month\_max', 'demand\_month\_max\_to\_min\_diff', 'demand\_dayofweek\_mean', 'demand\_dayofweek\_median', 'demand\_dayofweek\_max']. 

Besides, as categorical features require additional encoding to fit into regression models, several approaches are also experimented with, including one-hot encoding, label encoding, mean encoding, and group-by encoding. In our final practice, the categorical factors, including event names and event types, are processed with label encoder as it outperforms others.

In [58]:
def lags_windows(df):
    lags = [7]
    lag_cols = ["lag_{}".format(lag) for lag in lags ]
    for lag, lag_col in zip(lags, lag_cols):
        df[lag_col] = df[["id","demand"]].groupby("id")["demand"].shift(lag)
        
    wins = [7]
    for win in wins :
        for lag,lag_col in zip(lags, lag_cols):
            df["rmean_{}_{}".format(lag,win)] = df[["id", lag_col]].groupby("id")[lag_col].transform(lambda x : x.rolling(win).mean())  
    return df

def per_timeframe_stats(df, col):
    months = df['month'].unique().tolist()
    for y in months:
        df.loc[df['month'] == y, col+'_month_mean'] = df.loc[df['month'] == y].groupby(['id'])[col].transform(lambda x: x.mean()).astype("float32")
        df.loc[df['month'] == y, col+'_month_max'] = df.loc[df['month'] == y].groupby(['id'])[col].transform(lambda x: x.max()).astype("float32")
        df.loc[df['month'] == y, col+'_month_min'] = df.loc[df['month'] == y].groupby(['id'])[col].transform(lambda x: x.min()).astype("float32")
        df[col + 'month_max_to_min_diff'] = (df[col + '_month_max'] - df[col + '_month_min']).astype("float32")

    dayofweek = df['wday'].unique().tolist()

    for y in dayofweek:
        df.loc[df['wday'] == y, col+'_dayofweek_mean'] = df.loc[df['wday'] == y].groupby(['id'])[col].transform(lambda x: x.mean()).astype("float32")
        df.loc[df['wday'] == y, col+'_dayofweek_median'] = df.loc[df['wday'] == y].groupby(['id'])[col].transform(lambda x: x.median()).astype("float32")
        df.loc[df['wday'] == y, col+'_dayofweek_max'] = df.loc[df['wday'] == y].groupby(['id'])[col].transform(lambda x: x.max()).astype("float32")
    
    return df

def feat_eng(df):
    df = lags_windows(df)
    df = per_timeframe_stats(df,'demand')
    
    return df

In [81]:
def preprocess_item(dataframe, item_id):

    this_item = dataframe[dataframe.id ==item_id]
    this_item['day'] = this_item['day'].apply(lambda x: x.split('_')[1]).astype(int)
    this_item = this_item.drop(['weekday', 'item_id'], axis = 1)
    this_item = this_item.fillna('No')

    for c in ['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']:
        this_item[c] = LabelEncoder().fit_transform(this_item[c])

    this_train = this_item[this_item['date'] <= '2016-05-22']
    this_test = this_item[(this_item['date'] > '2016-05-6') & (this_item['date'] <= '2016-06-19')]
    this_train = feat_eng(this_train)
    this_train = this_train.dropna()
    
    return this_train, this_test

def predict_sales(test, model, train_cols, pred_name, tst, day):
        
    tst_X = tst.loc[tst.date == day , train_cols].copy()
    tst_X = tst_X.fillna(0) 
    test.loc[test.date == day, pred_name] = model.predict(tst_X)

    return test

def prepare_submission(dataframe, pred_name):
    result = dataframe[['id','date', pred_name]]
    result= result.pivot(index="id", columns="date", values=pred_name).reset_index()
    result = result.rename_axis(None, axis=1)
    #result.to_csv(f'sub_{pred_name}.csv')

    return result

In [78]:
sub_final = pd.DataFrame()

for item_id in id_list:
    
    # Split the data
    this_train, this_test = preprocess_item(merged, item_id)
    
    predictions = pd.DataFrame()
    predictions['date'] = this_test['date']
    
    # Choose features to use
    useless_cols = ['id','item_id','demand','date','weekday','demand_month_min', 'day']
    linreg_train_cols = ['sell_price','year','month','wday','lag_7','rmean_7_7']
    
    train_cols = this_train.columns[~this_train.columns.isin(useless_cols)]
    X_train = this_train[train_cols].copy()
    y_train = this_train["demand"]

    # Fit in models
    m_linreg = LinearRegression().fit(X_train[linreg_train_cols], y_train)
    m_rf = RandomForestRegressor(n_estimators=50, max_depth=5, random_state=26, n_jobs=-1).fit(X_train, y_train)
    m_gb = GradientBoostingRegressor().fit(X_train, y_train)
    m_mlp = MLPRegressor(hidden_layer_sizes=80, activation='relu', solver='adam', alpha=0.0001).fit(X_train, y_train)

    # Make predictions
    fday = datetime(2016, 5, 23) 
    max_lags = 15
    for tdelta in range(0, 28):
        day = fday + timedelta(days=tdelta)
        tst = this_test[(this_test.date >= day - timedelta(days=max_lags)) & (this_test.date <= day)].copy()
        tst = feat_eng(tst)
        tst = tst.fillna(0)
        
        this_test = predict_sales(this_test, m_linreg, linreg_train_cols, 'preds_LinearReg', tst, day)
        this_test = predict_sales(this_test, m_rf, train_cols, 'preds_RandomForest', tst, day)
        this_test = predict_sales(this_test, m_gb, train_cols, 'preds_GradeintBoosting', tst, day)
        this_test = predict_sales(this_test, m_mlp, train_cols, 'preds_MultiLayerPerceptron', tst, day)
        

    test_final = this_test.loc[this_test.date >= fday]
    sub_final = pd.concat([sub_final, test_final])

In [83]:
prepare_submission(sub_final, 'preds_MultiLayerPerceptron')

Unnamed: 0,id,2016-05-23 00:00:00,2016-05-24 00:00:00,2016-05-25 00:00:00,2016-05-26 00:00:00,2016-05-27 00:00:00,2016-05-28 00:00:00,2016-05-29 00:00:00,2016-05-30 00:00:00,2016-05-31 00:00:00,2016-06-01 00:00:00,2016-06-02 00:00:00,2016-06-03 00:00:00,2016-06-04 00:00:00,2016-06-05 00:00:00,2016-06-06 00:00:00,2016-06-07 00:00:00,2016-06-08 00:00:00,2016-06-09 00:00:00,2016-06-10 00:00:00,2016-06-11 00:00:00,2016-06-12 00:00:00,2016-06-13 00:00:00,2016-06-14 00:00:00,2016-06-15 00:00:00,2016-06-16 00:00:00,2016-06-17 00:00:00,2016-06-18 00:00:00,2016-06-19 00:00:00
0,HOBBIES_2_001_CA_3_validation,7.196698,7.19415,7.192375,7.1906,7.188826,7.218223,7.152712,6.786114,7.151219,7.128001,6.973691,7.124452,7.153849,7.43553,7.21198,7.650944,7.208431,7.206656,7.204882,7.247433,7.245659,7.243884,7.242109,7.240335,7.23856,7.236786,7.266183,6.80793
1,HOBBIES_2_002_CA_3_validation,-1.774186,-1.849356,-1.759418,-1.819013,-1.806848,-1.725971,-1.812504,-1.763707,-1.807296,-1.725832,-1.602239,-1.720624,-1.742657,-1.740053,-1.737449,-1.798819,-1.73224,-1.729636,-1.727032,-1.792345,-1.789741,-1.787137,-1.784532,-1.781928,-1.779324,-1.77672,-1.798753,-1.784211
2,HOBBIES_2_003_CA_3_validation,1.084144,1.602157,1.348943,1.224725,1.434819,0.806337,1.452331,1.067802,1.168894,2.043831,1.511376,1.918004,2.272065,2.563367,2.285617,2.615453,2.310706,2.32325,2.335794,2.230365,2.242909,2.255454,2.267998,2.280542,2.293087,2.305631,2.235787,1.910743
3,HOBBIES_2_004_CA_3_validation,0.150972,0.152136,0.153299,0.154462,0.155625,0.364569,0.127386,-0.760469,-0.227681,-0.187248,0.056461,-0.184922,-0.192823,-0.19166,-0.190497,0.410683,-0.18817,-0.187007,-0.185844,-0.23207,-0.230907,-0.229744,-0.228581,-0.227417,-0.226254,-0.225091,-0.232992,0.107825
4,HOBBIES_2_005_CA_3_validation,1.276323,1.409725,1.543126,1.676528,1.809929,1.017445,1.150847,0.596537,1.41765,1.573253,1.719765,1.840056,1.047572,1.180974,1.314375,2.249014,1.581178,1.71458,1.847981,1.002432,1.135834,1.269235,1.402637,1.536038,1.66944,1.802841,1.010357,0.893032
5,HOBBIES_2_006_CA_3_validation,-0.217615,-0.353565,-0.168716,-0.223613,-0.11811,-0.276375,-0.250218,-0.119711,-0.260594,-0.256928,-0.24854,-0.202906,-0.361171,-0.335014,-0.308856,-0.405572,-0.256542,-0.230385,-0.204228,-0.424365,-0.398208,-0.372051,-0.345894,-0.319737,-0.29358,-0.267423,-0.425688,-0.263559
6,HOBBIES_2_007_CA_3_validation,1.542233,1.776834,1.861055,1.948244,1.911513,1.828727,1.645164,1.838444,1.593052,1.850486,1.846527,2.071848,1.808884,1.650008,1.610309,1.942707,1.831671,1.942352,2.053033,2.027588,1.674561,1.613031,1.723712,1.834393,1.945073,2.055754,2.06274,1.987085
7,HOBBIES_2_008_CA_3_validation,-2.474754,-2.394107,-2.139345,-2.538268,-2.457621,-2.503133,-2.138125,-3.744522,-2.534369,-2.78653,-2.635676,-2.94508,-2.862459,-2.781812,-2.925154,-1.53529,-2.955571,-2.970779,-2.985988,-3.01607,-3.031279,-3.046488,-3.061696,-3.076905,-3.092113,-3.107322,-3.024701,-3.210713
8,HOBBIES_2_009_CA_3_validation,-5.104229,-4.830117,-4.556005,-4.281893,-4.00778,-5.664094,-5.389981,-6.769191,-4.841757,-4.582615,-4.250081,-4.03439,-5.690703,-5.416591,-5.142479,-2.94436,-4.594254,-4.320142,-4.04603,-5.602616,-5.328504,-5.054392,-4.780279,-4.506167,-4.232055,-3.957942,-5.614256,-5.810479
9,HOBBIES_2_010_CA_3_validation,-1.222477,-1.272601,-1.322725,-1.372849,-1.422974,-1.129152,-1.179276,-0.721715,-1.279524,-1.464961,-1.670157,-1.565209,-1.271388,-1.321512,-1.371636,-2.001137,-1.471884,-1.522009,-1.572133,-1.322106,-1.372231,-1.422355,-1.472479,-1.522603,-1.572727,-1.622851,-1.32903,-1.444245


## 3.3 Forecast with LSTM 

In this project, we built a vanilla LSTM with GPU accelration provided by Kaggle. Although LSTM is deemed robust in time-seris forecasting, our model was outperformed by other machine learning models.

In [2]:
import os
import re
import time
import warnings
import numpy as np
from tqdm import tqdm
import pandas as pd
from numpy import array
from numpy import newaxis
from sklearn import preprocessing, metrics
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.layers import RepeatVector,TimeDistributed
from keras.models import Sequential

#os.environ['TF_CPP_MIN_LOG_LEGEL']='3'
warnings.filterwarnings('ignore')

In [10]:
cal = pd.read_csv("../data/calendar_afcs2020.csv")
sellp = pd.read_csv("../data/sell_prices_afcs2020.csv")
stv = pd.read_csv("../data/sales_train_validation_afcs2020.csv")
#ss = pd.read_csv("../data/sample_submission_afcs2020.csv")

In [34]:
def build_model(n_features, n_out_seq_length, num_y):
    
    model = Sequential()
    
    model.add(LSTM(128, activation='relu', input_shape=(28, n_features),return_sequences=False))
    model.add(RepeatVector(n_out_seq_length))
    
    model.add(LSTM(32, activation='relu',return_sequences=True))
    
    model.add(TimeDistributed(Dense(num_y)))
    model.compile(optimizer='adam', loss='mse')

    start = time.time()
    print('Execution time:', time.time()-start)

    return model

def Normalize(this_list):
    this_list = np.array(this_list)

    low, high = np.percentile(this_list, [0, 100])
    delta = high - low
    if delta != 0:
        for i in range(0, len(this_list)):
            this_list[i] = (this_list[i]-low)/delta
            
    return  this_list, low, high

def FNoramlize(this_list, low, high):
    delta = high - low
    if delta != 0:
        for i in range(0, len(this_list)):
            this_list[i] = this_list[i]*delta + low
    return this_list

def transform(data):
    nan_features = ['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
    for feature in nan_features:
        data[feature].fillna('unknown', inplace = True)
    cat = ['event_name_1','event_type_1','event_name_2','event_type_2','snap_CA']
    for feature in cat:
        encoder = preprocessing.LabelEncoder()
        data[feature] = encoder.fit_transform(data[feature])
    
    return data

In [11]:
price_fea = cal[['wm_yr_wk','date']].merge(sellp, on = ['wm_yr_wk'], how = 'left')
price_fea['id'] = price_fea['item_id']+'_'+price_fea['store_id']+'_validation'
df = price_fea.pivot('id','date','sell_price')

price_df = stv.merge(df,on=['id'],how= 'left').iloc[:,-145:]
price_df.index = stv.id
price_df.head()

Unnamed: 0_level_0,9/10/2011,9/10/2012,9/10/2013,9/10/2014,9/10/2015,9/11/2011,9/11/2012,9/11/2013,9/11/2014,9/11/2015,...,9/8/2011,9/8/2012,9/8/2013,9/8/2014,9/8/2015,9/9/2011,9/9/2012,9/9/2013,9/9/2014,9/9/2015
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HOBBIES_2_001_CA_3_validation,5.47,5.97,5.97,5.47,5.47,5.47,5.97,5.97,5.47,5.47,...,5.47,5.97,5.97,5.47,5.47,5.47,5.97,5.97,5.47,5.47
HOBBIES_2_002_CA_3_validation,1.97,1.97,1.97,1.97,1.47,1.97,1.97,1.97,1.97,1.47,...,1.97,1.97,1.97,1.97,1.47,1.97,1.97,1.97,1.97,1.47
HOBBIES_2_003_CA_3_validation,1.97,1.97,1.97,1.97,1.97,1.97,1.97,1.97,1.97,1.97,...,1.97,1.97,1.97,1.97,1.97,1.97,1.97,1.97,1.97,1.97
HOBBIES_2_004_CA_3_validation,2.47,2.47,2.47,2.47,2.47,2.47,2.47,2.47,2.47,2.47,...,2.47,2.47,2.47,2.47,2.47,2.47,2.47,2.47,2.47,2.47
HOBBIES_2_005_CA_3_validation,,,4.47,4.47,4.47,,,4.47,4.47,4.47,...,,,4.47,4.47,4.47,,,4.47,4.47,4.47


In [13]:
days_val = range(1, 1914)

time_series_columns = [f'd_{i}' for i in days_val]
time_series_data = stv[time_series_columns]
time_series_data.head()

Unnamed: 0,d_1,d_2,d_3,d_4,d_5,d_6,d_7,d_8,d_9,d_10,...,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,1,0,0,0,0,...,0,0,1,0,0,0,1,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,3
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [14]:
days_cal = range(1, 1970)
x_label = ['event_name_1','event_type_1','event_name_2','event_type_2','snap_CA','wday_bool']

cal.loc[cal['wday'] < 3, 'wday_bool'] = 1
cal.loc[cal['wday'] >= 3, 'wday_bool'] = 0
cal.head()

Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,wday_bool
0,1/29/2011,11101,Saturday,1,1,2011,d_1,,,,,0,1.0
1,1/30/2011,11101,Sunday,2,1,2011,d_2,,,,,0,1.0
2,1/31/2011,11101,Monday,3,1,2011,d_3,,,,,0,0.0
3,2/1/2011,11101,Tuesday,4,2,2011,d_4,,,,,1,0.0
4,2/2/2011,11101,Wednesday,5,2,2011,d_5,,,,,1,0.0


In [15]:
time_series_columns = [f'd_{i}' for i in days_cal]
transfer_cal = pd.DataFrame(cal[x_label].values.T, index=x_label, columns= time_series_columns)
transfer_cal = transfer_cal.fillna(0)

cal['date'] = pd.to_datetime(cal['date'])
cal = cal[cal['date']>= '2016-1-27']

In [18]:
cal= transform(cal)

transfer_cal = pd.DataFrame(cal[x_label].values.T,index=x_label)
transfer_cal.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,135,136,137,138,139,140,141,142,143,144
event_name_1,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,...,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,6.0
event_type_1,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0
event_name_2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
event_type_2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
snap_CA,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
data_len, step_len = 100, 28

X_data = []

for i in tqdm(range(time_series_data.shape[0])):
    X_data.append([list(t) for t in zip(time_series_data.iloc[i][-data_len:],
                                   transfer_cal.loc['event_type_1'][-(data_len+step_len):-(step_len)],
                                   transfer_cal.loc['event_type_2'][-(data_len+step_len):-(step_len)],
                                   transfer_cal.loc['snap_CA'][-(data_len+step_len):-(step_len)],
                                   transfer_cal.loc['wday_bool'][-(data_len+28):-(step_len)],
                                   price_df.iloc[i][-(data_len+28):-(step_len)])]) 
X_data = np.asarray(X_data, dtype=np.float32)

100%|██████████| 149/149 [00:00<00:00, 1733.27it/s]


In [22]:
where_are_NaNs = np.isnan(X_data)
X_data[where_are_NaNs] = 0

In [27]:
X_data.shape

(149, 100, 6)

In [35]:
n_steps = 28
n_features = 6  
n_out_seq_length =28
num_y = 1
n_items = 149 # number of traindata

train_n, train_low, train_high = Normalize(X_data[:, -(n_steps*2):,:])

X_train = train_n[:,-28*2:-28,:]
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], n_features))

y = train_n[:,-28:,0] 
y = y.reshape((y.shape[0], y.shape[1], 1))

In [None]:
model = build_model(n_features, n_out_seq_length, num_y)
history = model.fit(X_train, y, epochs=500, batch_size=5)

In [None]:
x_input = array(X_train[:,-n_steps*1:])
x_input = x_input.reshape((n_items, n_steps*1, n_features))


y_predict = model.predict(x_input[:,-n_steps:], verbose=0)
x_input = np.concatenate((x_input[:,:,0].reshape(x_input.shape[0],x_input.shape[1]),y_predict.astype(np.float32).reshape(x_input.shape[0],x_input.shape[1])),axis=1).reshape((x_input.shape[0],x_input.shape[1]+28,1))


x_input = FNoramlize(x_input,train_low,train_high)
x_input = np.rint(x_input)

forecast = pd.DataFrame(x_input.reshape(x_input.shape[0],x_input.shape[1])).iloc[:,-28:]
forecast.columns = [f'F{i}' for i in range(1, forecast.shape[1] + 1)]
forecast[forecast < 0] =0

validation_ids = stv['id'].values
evaluation_ids = [i.replace('validation', 'evaluation') for i in validation_ids]
ids = np.concatenate([validation_ids, evaluation_ids])

predictions = pd.DataFrame(ids, columns=['id'])
forecast = pd.concat([forecast]*2).reset_index(drop=True)
predictions = pd.concat([predictions, forecast], axis=1)
#predictions.to_csv('submission.csv', index=False)  #Generate the csv file.