## XGBoost Model

In [87]:
import pandas as pd
import numpy as np
import math
from random import sample
import random
import itertools
import matplotlib.pyplot as plt
%matplotlib inline
from dtw import *
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from scipy import stats
from sklearn.metrics import mean_squared_error
from math import sqrt
from scipy.special import boxcox, inv_boxcox
from datetime import datetime
import xgboost as xgb

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Loading the Data

In [88]:
train = pd.read_csv("train.csv")
calendar = pd.read_csv("calendar.csv")
prices = pd.read_csv("prices.csv")

## Cleaning the Data for Forecasting

In [89]:
#deleting the 'd' in the column titles 
_cols = list(train.columns)
train.columns = pd.Index(_cols[:6] + [int(c.replace("d_","")) for c in _cols[6:]])
train.columns = train.columns.astype(str)

In [90]:
#pivoting the train df longer 
df_melt = pd.melt(train, id_vars = [i for i in train.columns if i.find("id") != -1],
                          value_vars = [i for i in train.columns if i.isnumeric()], var_name = 'd', value_name = 'sales')
df_melt.head()

Unnamed: 0,id,item_id,subcat_id,category_id,store_id,region_id,d,sales
0,Beauty_1_001_East_1,Beauty_1_001,Beauty_1,Beauty,East_1,East,1,0
1,Beauty_1_002_East_1,Beauty_1_002,Beauty_1,Beauty,East_1,East,1,0
2,Beauty_1_003_East_1,Beauty_1_003,Beauty_1,Beauty,East_1,East,1,0
3,Beauty_1_004_East_1,Beauty_1_004,Beauty_1,Beauty,East_1,East,1,0
4,Beauty_1_005_East_1,Beauty_1_005,Beauty_1,Beauty,East_1,East,1,0


In [91]:
#remove d_ in calendar
calendar['d'] = calendar['d'].str[2:]
calendar['date'] = pd.to_datetime(calendar['date'], format='%Y-%m-%d')
calendar

Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d
0,2011-01-29,11101,Saturday,1,1,2011,1
1,2011-01-30,11101,Sunday,2,1,2011,2
2,2011-01-31,11101,Monday,3,1,2011,3
3,2011-02-01,11101,Tuesday,4,2,2011,4
4,2011-02-02,11101,Wednesday,5,2,2011,5
...,...,...,...,...,...,...,...
1964,2016-06-15,11620,Wednesday,5,6,2016,1965
1965,2016-06-16,11620,Thursday,6,6,2016,1966
1966,2016-06-17,11620,Friday,7,6,2016,1967
1967,2016-06-18,11621,Saturday,1,6,2016,1968


In [92]:
#merge pivoted train with calendar
traincal = pd.merge(df_melt, calendar, on = 'd', how = 'left')
traincal['date'] = pd.to_datetime(traincal['date'], format='%Y-%m-%d')
traincal.d = traincal.d.astype('int')
traincal.head()

Unnamed: 0,id,item_id,subcat_id,category_id,store_id,region_id,d,sales,date,wm_yr_wk,weekday,wday,month,year
0,Beauty_1_001_East_1,Beauty_1_001,Beauty_1,Beauty,East_1,East,1,0,2011-01-29,11101,Saturday,1,1,2011
1,Beauty_1_002_East_1,Beauty_1_002,Beauty_1,Beauty,East_1,East,1,0,2011-01-29,11101,Saturday,1,1,2011
2,Beauty_1_003_East_1,Beauty_1_003,Beauty_1,Beauty,East_1,East,1,0,2011-01-29,11101,Saturday,1,1,2011
3,Beauty_1_004_East_1,Beauty_1_004,Beauty_1,Beauty,East_1,East,1,0,2011-01-29,11101,Saturday,1,1,2011
4,Beauty_1_005_East_1,Beauty_1_005,Beauty_1,Beauty,East_1,East,1,0,2011-01-29,11101,Saturday,1,1,2011


In [93]:
# Create features based from datetime
def create_features(df, label=None):
    """
    Creates time series features from datetime index
    """
    df['date'] = df.index
    df['dayofweek'] = df['date'].dt.dayofweek
    df['quarter'] = df['date'].dt.quarter
    df['month'] = df['date'].dt.month
    df['year'] = df['date'].dt.year
    df['dayofyear'] = df['date'].dt.dayofyear
    df['dayofmonth'] = df['date'].dt.day
    df['weekofyear'] = df['date'].dt.weekofyear
    
    X = df[['dayofweek','quarter','month','year',
           'dayofyear','dayofmonth','weekofyear']]
    if label:
        y = df[label]
        return X, y
    return X

### XGBoost Modelling

Our time series forecasting process using XGBoost involves training models on clusters of data, as running a single model on thousands of items can be computationally expensive. We selected XGBoost as our model of choice due to its ability to handle non-linear relationships and feature interactions. To account for seasonality and non-stationarity in our data, we used time-series feature engineering techniques to extract relevant features such as lagged variables and moving averages. We then used cross-validation to tune hyperparameters such as the learning rate and number of trees in the ensemble. Our error metric of choice was the mean absolute error (MAE) as it is scale-dependent and easy to interpret, and we also computed the mean absolute percentage error (MAPE) as a scale-independent measure of model performance.

We used two distance measures to disaggregate our time series data for our predictions: static weekly proportions and weekly proportions using a sliding window. In the static distance measure, the distance from an item to its cluster centroid is the proportion of its sales compared to the total sales made by the cluster. We fix our proportions based on the total sales of the last week in our training dataset. In the sliding window approach, we update our proportion value every time a prediction is made such that our new proportion value is inclusive of our last prediction. Each window represents the week before the day in question, and the window moves after each forecast.

In [94]:
###functions for store_id###
#run auto_arima on clusters based on store_id and subcat_id
def xgboost_cluster(df, cluster):
    """
    df has the columns: 'cluster', 'date', 'sales'
    cluster is the cluster of interest
    """
    if 'store_id' in df:
        temp = df[df['store_id'] == cluster].groupby('date').mean()
    elif 'cluster' in df:
        temp = df[df['cluster'] == cluster].groupby('date').mean()
        
    split_date = '22-Oct-2015' 
    temp_train = temp.loc[temp.index <= split_date].copy()
    temp_test = temp.loc[temp.index > split_date].copy()

    X_train, y_train = create_features(temp_train, label='total_sales_in_cluster')
    X_test, y_test = create_features(temp_test, label='total_sales_in_cluster')

    model = xgb.XGBRegressor(base_score=0.5, booster='gbtree',    
                             n_estimators=3000,
                             objective='reg:linear',
                             max_depth=5,
                             learning_rate=0.005,
                             reg_lambda=0.5,
                             reg_alpha=0.5)
    model.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)],
              early_stopping_rounds=50, verbose=False)
    
    test_pred = model.predict(X_test) # For obtaining metrics values
    rmse = sqrt(mean_squared_error(y_test,test_pred))
    mae = abs(y_test - test_pred).mean()
    mape = abs((y_test - test_pred)/y_test).mean()
    prediction = model.predict(X_test[-21:]) # Predicts last 21 days
    return prediction, rmse, mae, mape

#for loop to run auto arima for all unique store_id and subcat_id
def forecast_cluster(df, csvdest):
    """
    df has the columns: 'cluster', 'date', 'sales'
    csvdest is the file name for the prediction csv in string
    """
    prediction = pd.DataFrame()
    clusters = df['cluster'].unique()
    rmses = []
    maes = []
    mapes = []
    for cluster in clusters:
        predict, rmse, mae, mape = xgboost_cluster(df, cluster)
        rmses.append(rmse)
        maes.append(mae)
        mapes.append(mape)
        prediction[cluster] = predict
        prediction.to_csv(f"{csvdest}.csv")
        print(f"Cluster: {cluster} is predicted")
    return rmses, maes, mapes

#distancing using proportions obtained from last week of training dataset 
def distance1(df, days = np.array(range(1920, 1941))):
    """
    df has id, cluster, proportion, and the prediction days as the columns
    """
    for d in days:
        df[d] = df[d] * df['prop'] #columnwise multiplication # Removed round()

In [95]:
#assuming the prediction_df has been cleaned
#train_df should be traincal
def rolling_window(train_df, prediction_df):
    """
    train_df is traincal with the natural cluster of interest column name renamed to be 'cluster'
    prediction_df has cluster name and prediction days as the columns
    """
    prediction = pd.DataFrame()
    train_temp = train_df[['id', 'd', 'cluster', 'sales']]
    train_temp['d'] = train_temp['d'].astype(int)
    
    for day in np.array(range(1920, 1941)):
        temp = train_temp[train_temp['d'] >= day-7]
        
        total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
    
        merged = pd.merge(temp, total_cluster_sales, on = ['cluster', 'd'], how = 'left')
        #get weekly sums for item and stores
        merged['item_weekly_sales'] = pd.DataFrame(merged.groupby('id')['sales'].transform('sum'))
        merged['cluster_weekly_sales'] = pd.DataFrame(merged.groupby(['cluster', 'id'])['total_cluster_sales'].transform('sum'))
        #set distance
        merged['prop'] = merged['item_weekly_sales'] / merged['cluster_weekly_sales']
        
        merged_temp_subset = merged[['id', 'cluster', 'prop']]
        
        forecast_df = pd.merge(merged_temp_subset, prediction_df, on = 'cluster', how ='left')
        forecast_df[day] = round(forecast_df['prop'] * forecast_df[day])
        preds_for_day = forecast_df[['id', 'cluster', day]].drop_duplicates().rename(columns={day:'sales'})
        preds_for_day['d'] = [day]*30490
        
        train_temp = train_temp.append(preds_for_day)
        prediction[day] = preds_for_day[['sales']]
        print(f"Prediction made for day: {day}")
        
    return prediction 

### Natural Cluster: Store ID
#### Data preparation and forecast

In [96]:
#subsetting the df so it loads faster
temp_df = traincal[["date", "sales", "store_id"]]

In [97]:
#getting the daily total sales for each cluster
total_sales_by_store_id = temp_df.groupby(["store_id","date"]).sum().reset_index()
total_sales_by_store_id = total_sales_by_store_id.rename(columns={'sales': 'total_sales_in_cluster',
                                                               'store_id':'cluster'})

total_sales_by_store_id

Unnamed: 0,cluster,date,total_sales_in_cluster
0,Central_1,2011-01-29,2556
1,Central_1,2011-01-30,2687
2,Central_1,2011-01-31,1822
3,Central_1,2011-02-01,2258
4,Central_1,2011-02-02,1694
...,...,...,...
19185,West_3,2016-04-26,3200
19186,West_3,2016-04-27,2962
19187,West_3,2016-04-28,2870
19188,West_3,2016-04-29,3692


In [98]:
#forecasting
forecast_cluster(total_sales_by_store_id, "predictions_by_store")

  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: Central_1 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: Central_2 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: Central_3 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: East_1 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: East_2 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: East_3 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: East_4 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: West_1 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: West_2 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: West_3 is predicted


([282.0132456880723,
  459.3269503130358,
  413.34165344385974,
  423.36700312798536,
  1570.9575061408607,
  589.3551413476852,
  263.35196867867967,
  510.54680623722675,
  1004.0778666690289,
  713.8111500884767],
 [184.73296767629253,
  350.7563080313318,
  298.3765709362729,
  291.36915979834754,
  1311.5399150748528,
  399.59305592482,
  205.3791030963678,
  381.79921542661975,
  806.1147058297203,
  576.0118107820681],
 [11.191474912056437,
  inf,
  5.019192166869943,
  inf,
  9.392174200393304,
  5.085590914334453,
  inf,
  6.070354872828162,
  11.384805372968732,
  13.96874651987916])

To forecast future sales using the XGBoost model, we utilized the weekly proportion (static) distance measure to disaggregaate item predictions from cluster predictions. First, we subsetted the train_final_store data to obtain the last week of the training dataset. Then, we set the distance measure as the proportion of an item's weekly sale to its corresponding store_id's weekly sale.  Finally, we pivoted our prediction dataframe to forecast data for each item while taking into account the 'distance' between an item and its natural 'cluster', the store_id. This approach allowed us to make accurate sales predictions for each item while accounting for the sales patterns of similar items and stores.

In [99]:
#subsetting traincal so it loads faster when we merge
traincal_temp = traincal[["id", "store_id", "date", "sales", "d"]].rename(columns={"store_id":"cluster"})
#merging the subset with the df that has the daily total sales by cluster
train_final_store = pd.merge(traincal_temp, total_sales_by_store_id, on = ["cluster", "date"])
train_final_store.head()

Unnamed: 0,id,cluster,date,sales,d,total_sales_in_cluster
0,Beauty_1_001_East_1,East_1,2011-01-29,0,1,4337
1,Beauty_1_002_East_1,East_1,2011-01-29,0,1,4337
2,Beauty_1_003_East_1,East_1,2011-01-29,0,1,4337
3,Beauty_1_004_East_1,East_1,2011-01-29,0,1,4337
4,Beauty_1_005_East_1,East_1,2011-01-29,0,1,4337


In [100]:
#subsetting for days that were a week before our first prediction day, which is 1920
train_final_store['d'] = train_final_store['d'].astype(int)
subset_final_store = train_final_store[train_final_store['d'] >= 1913]
subset_final_store.drop(columns = 'date', inplace=True)
subset_final_store.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_final_store.drop(columns = 'date', inplace=True)


Unnamed: 0,id,cluster,sales,d,total_sales_in_cluster
58296880,Beauty_1_001_East_1,East_1,1,1913,6113
58296881,Beauty_1_002_East_1,East_1,0,1913,6113
58296882,Beauty_1_003_East_1,East_1,1,1913,6113
58296883,Beauty_1_004_East_1,East_1,2,1913,6113
58296884,Beauty_1_005_East_1,East_1,4,1913,6113


In [101]:
#get weekly sums for item and stores
subset_final_store['item_weekly_sales'] = pd.DataFrame(
    subset_final_store.groupby('id')['sales'].transform('sum'))

subset_final_store['store_weekly_sales'] = pd.DataFrame(subset_final_store.groupby(['cluster', 'id'])
                                                       ['total_sales_in_cluster'].transform('sum'))
#set distance
subset_final_store['prop'] = subset_final_store['item_weekly_sales'] / subset_final_store['store_weekly_sales']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_final_store['item_weekly_sales'] = pd.DataFrame(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_final_store['store_weekly_sales'] = pd.DataFrame(subset_final_store.groupby(['cluster', 'id'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_final_store['prop'] = subset_final_store

In [102]:
#every item has the same store_weekly_sales so there are duplicate rows, which we drop
subset_final_store = pd.DataFrame(subset_final_store[['id', 'cluster', 'prop']].drop_duplicates()).reset_index(drop = True)
subset_final_store

Unnamed: 0,id,cluster,prop
0,Beauty_1_001_East_1,East_1,0.000187
1,Beauty_1_002_East_1,East_1,0.000031
2,Beauty_1_003_East_1,East_1,0.000156
3,Beauty_1_004_East_1,East_1,0.000312
4,Beauty_1_005_East_1,East_1,0.000343
...,...,...,...
30485,Food_3_823_West_3,West_3,0.000201
30486,Food_3_824_West_3,West_3,0.000120
30487,Food_3_825_West_3,West_3,0.000161
30488,Food_3_826_West_3,West_3,0.000441


In [103]:
#loading the forecasts, which we saved as a csv
prediction_df_store = pd.read_csv('predictions_by_store.csv')
prediction_df_store = prediction_df_store.drop(columns= 'Unnamed: 0')
prediction_df_store.head()

Unnamed: 0,Central_1,Central_2,Central_3,East_1,East_2,East_3,East_4,West_1,West_2,West_3
0,4108.26,4632.636,4397.961,5708.0703,3180.079,7706.003,2738.3342,4217.224,4292.7065,3775.0298
1,3169.3892,3829.7668,4175.2744,4346.137,2145.6538,6432.1772,2407.1462,3035.1704,4337.0947,3423.1404
2,3050.8254,3478.1702,3740.1633,3835.921,2068.626,5845.055,2235.8623,2948.0254,4660.7637,3416.5283
3,3092.0632,3579.483,3701.0881,3727.0962,2046.8002,5658.973,2197.0237,2972.0818,3940.0676,2936.9277
4,2806.79,3369.973,3405.4111,3723.1924,2069.303,5475.958,2183.1443,2963.4153,4356.265,3254.6042


In [104]:
#editing our prediction df so we have 'clusters' and the prediction days as the column headers
prediction_df_store['d'] = np.array(range(1920, 1941))
prediction_df_store.set_index('d', inplace = True)
prediction_df_store = prediction_df_store.T
prediction_df_store = prediction_df_store.reset_index().rename(columns = {'index' : 'cluster'})
prediction_df_store

d,cluster,1920,1921,1922,1923,1924,1925,1926,1927,1928,...,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940
0,Central_1,4108.26,3169.3892,3050.8254,3092.0632,2806.79,3288.3584,3853.123,4122.2607,2994.3772,...,2678.7915,2874.3218,3684.677,3914.571,2787.146,2545.0078,2536.8647,2519.3325,2766.9963,3395.573
1,Central_2,4632.636,3829.7668,3478.1702,3579.483,3369.973,3836.7952,4508.9287,4673.978,3518.4521,...,3177.7192,3579.7778,4375.1436,4552.0054,3419.748,3022.369,3038.5083,3092.2734,3538.6138,4191.1875
2,Central_3,4397.961,4175.2744,3740.1633,3701.0881,3405.4111,4069.6458,4418.1978,4492.108,3719.8027,...,3186.4888,3549.4182,4180.603,4148.402,3543.1135,3206.1433,3161.849,3154.0574,3495.371,4050.2937
3,East_1,5708.0703,4346.137,3835.921,3727.0962,3723.1924,4475.7427,5522.5615,5652.493,4199.8633,...,3666.6548,4425.0986,5463.602,5601.0103,4038.6213,3633.462,3590.065,3593.5864,4246.143,5506.7954
4,East_2,3180.079,2145.6538,2068.626,2046.8002,2069.303,2375.9575,3255.3564,3126.8752,2124.7786,...,2061.3838,2369.6,3252.7368,3124.059,2122.1794,2012.4425,2012.4425,2026.981,2433.5261,3258.9065
5,East_3,7706.003,6432.1772,5845.055,5658.973,5475.958,5740.7285,6980.3823,7542.9,5997.013,...,5336.037,5654.7373,6878.7334,7384.164,5775.0933,5368.428,5185.4805,5168.9727,5549.3286,6810.3057
6,East_4,2738.3342,2407.1462,2235.8623,2197.0237,2183.1443,2342.3513,2627.663,2693.9038,2381.5398,...,2173.9482,2335.5947,2593.755,2691.934,2400.6526,2215.343,2157.232,2162.8716,2333.4666,2640.1006
7,West_1,4217.224,3035.1704,2948.0254,2972.0818,2963.4153,3687.6506,4786.969,4277.7666,2837.199,...,2670.3992,3423.3923,4403.868,4092.1577,2774.966,2634.28,2682.6223,2716.8882,3557.2551,4671.9043
8,West_2,4292.7065,4337.0947,4660.7637,3940.0676,4356.265,4689.3867,4696.893,3880.6318,3359.7678,...,3150.03,3541.083,3923.1548,3556.7256,3083.2656,3058.639,3062.7993,3061.4214,3737.4739,4117.6987
9,West_3,3775.0298,3423.1404,3416.5283,2936.9277,3254.6042,3644.1055,3851.3225,3699.3118,2737.0286,...,2393.5579,2941.6343,3339.9727,3243.242,2547.205,2373.1992,2268.8071,2289.81,3034.1206,3334.574


In [105]:
#merging our df that has the proportions for each item, with the predictions for each cluster
forecast_df = pd.merge(subset_final_store, prediction_df_store, on = 'cluster', how ='left' )
distance1(forecast_df)
forecast_df

Unnamed: 0,id,cluster,prop,1920,1921,1922,1923,1924,1925,1926,...,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940
0,Beauty_1_001_East_1,East_1,0.000187,1.067195,0.812565,0.717173,0.696827,0.696097,0.836796,1.032512,...,0.685527,0.827327,1.021489,1.047179,0.755071,0.679321,0.671207,0.671866,0.793869,1.029564
1,Beauty_1_002_East_1,East_1,0.000031,0.177866,0.135427,0.119529,0.116138,0.116016,0.139466,0.172085,...,0.114254,0.137888,0.170248,0.174530,0.125845,0.113220,0.111868,0.111978,0.132312,0.171594
2,Beauty_1_003_East_1,East_1,0.000156,0.889329,0.677137,0.597644,0.580689,0.580081,0.697330,0.860427,...,0.571272,0.689440,0.851240,0.872649,0.629226,0.566101,0.559340,0.559888,0.661558,0.857970
3,Beauty_1_004_East_1,East_1,0.000312,1.778658,1.354274,1.195289,1.161379,1.160162,1.394660,1.720853,...,1.142545,1.378879,1.702481,1.745298,1.258451,1.132202,1.118679,1.119776,1.323116,1.715940
4,Beauty_1_005_East_1,East_1,0.000343,1.956524,1.489702,1.314818,1.277516,1.276178,1.534126,1.892938,...,1.256799,1.516767,1.872729,1.919828,1.384296,1.245422,1.230547,1.231754,1.455427,1.887534
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30485,Food_3_823_West_3,West_3,0.000201,0.757521,0.686909,0.685582,0.589342,0.653089,0.731249,0.772830,...,0.480306,0.590287,0.670220,0.650809,0.511138,0.476221,0.455273,0.459487,0.608845,0.669136
30486,Food_3_824_West_3,West_3,0.000120,0.454513,0.412145,0.411349,0.353605,0.391853,0.438749,0.463698,...,0.288184,0.354172,0.402132,0.390485,0.306683,0.285733,0.273164,0.275692,0.365307,0.401482
30487,Food_3_825_West_3,West_3,0.000161,0.606017,0.549527,0.548465,0.471474,0.522471,0.584999,0.618264,...,0.384245,0.472229,0.536176,0.520647,0.408910,0.380977,0.364218,0.367590,0.487076,0.535309
30488,Food_3_826_West_3,West_3,0.000441,1.666546,1.511199,1.508280,1.296553,1.436796,1.608747,1.700227,...,1.056674,1.298631,1.474483,1.431780,1.124504,1.047686,1.001600,1.010872,1.339460,1.472100


In [106]:
#cleaning the forecast_df so we can submit on kaggle
final = forecast_df.drop(columns=['cluster', 'prop'])
final.set_index('id', inplace = True)
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
final.columns = columns
final

Unnamed: 0_level_0,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,...,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_East_1,1.067195,0.812565,0.717173,0.696827,0.696097,0.836796,1.032512,1.056804,0.785217,0.690334,...,0.685527,0.827327,1.021489,1.047179,0.755071,0.679321,0.671207,0.671866,0.793869,1.029564
Beauty_1_002_East_1,0.177866,0.135427,0.119529,0.116138,0.116016,0.139466,0.172085,0.176134,0.130869,0.115056,...,0.114254,0.137888,0.170248,0.174530,0.125845,0.113220,0.111868,0.111978,0.132312,0.171594
Beauty_1_003_East_1,0.889329,0.677137,0.597644,0.580689,0.580081,0.697330,0.860427,0.880670,0.654347,0.575278,...,0.571272,0.689440,0.851240,0.872649,0.629226,0.566101,0.559340,0.559888,0.661558,0.857970
Beauty_1_004_East_1,1.778658,1.354274,1.195289,1.161379,1.160162,1.394660,1.720853,1.761340,1.308695,1.150557,...,1.142545,1.378879,1.702481,1.745298,1.258451,1.132202,1.118679,1.119776,1.323116,1.715940
Beauty_1_005_East_1,1.956524,1.489702,1.314818,1.277516,1.276178,1.534126,1.892938,1.937474,1.439564,1.265613,...,1.256799,1.516767,1.872729,1.919828,1.384296,1.245422,1.230547,1.231754,1.455427,1.887534
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Food_3_823_West_3,0.757521,0.686909,0.685582,0.589342,0.653089,0.731249,0.772830,0.742327,0.549229,0.500174,...,0.480306,0.590287,0.670220,0.650809,0.511138,0.476221,0.455273,0.459487,0.608845,0.669136
Food_3_824_West_3,0.454513,0.412145,0.411349,0.353605,0.391853,0.438749,0.463698,0.445396,0.329537,0.300105,...,0.288184,0.354172,0.402132,0.390485,0.306683,0.285733,0.273164,0.275692,0.365307,0.401482
Food_3_825_West_3,0.606017,0.549527,0.548465,0.471474,0.522471,0.584999,0.618264,0.593862,0.439383,0.400139,...,0.384245,0.472229,0.536176,0.520647,0.408910,0.380977,0.364218,0.367590,0.487076,0.535309
Food_3_826_West_3,1.666546,1.511199,1.508280,1.296553,1.436796,1.608747,1.700227,1.633119,1.208304,1.100384,...,1.056674,1.298631,1.474483,1.431780,1.124504,1.047686,1.001600,1.010872,1.339460,1.472100


In [107]:
#saving the predictions to csv
final.to_csv('stores_prediction.csv')

#### Distance measure: 'sliding window'
We update our proportion value every time a prediction is made such that our new proportion value is inclusive of our prediction. Each 'window' is the week before the day in question and the 'window' moves after we forecast one day.

In [108]:
#forecasting using the 'rolling window' distance measure
traincal_temp = traincal.copy()
traincal_temp.rename(columns={"store_id":"cluster"}, inplace=True)
prediction_w_rolling_prop = rolling_window(traincal_temp, prediction_df_store)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_temp['d'] = train_temp['d'].astype(int)
  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1920


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1921


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1922


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1923


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1924


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1925


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1926


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1927


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1928


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1929


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1930


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1931


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1932


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1933


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1934


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1935


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1936


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1937


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1938


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1939


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1940


In [109]:
#cleaning the formatting of the predictions so we can submit on kaggle
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
prediction_w_rolling_prop.columns = columns
prediction_w_rolling_prop = prediction_w_rolling_prop.set_index(train['id'])

In [110]:
#saving the rolling window predictions to csv
prediction_w_rolling_prop.to_csv('store_pred_v2.csv')

### Analysis:
As expected, our sliding window forecasts using the XGBoost model scored lower than our fixed proportion forecasts. This is because our sliding window approach includes our previous forecasts, which are not always perfectly accurate, and can produce distorted distances from an item to its cluster. The sliding window distance measure was still effective in grouping similar time series together, but the added noise from the previous forecasts led to slightly lower forecast accuracy compared to the static weekly proportion approach.

### Natural Cluster: Subcategory ID
#### Data preparation and forecast

In [111]:
#subsetting traincal so it loads faster when we merge
temp_df = traincal[["date", "sales", "subcat_id"]]
#getting the daily total sales grouped by cluster
total_sales_by_subcat_id = temp_df.groupby(["subcat_id","date"]).sum().reset_index()
total_sales_by_subcat_id = total_sales_by_subcat_id.rename(columns={'sales': 'total_sales_in_cluster', 
                                                                    'subcat_id': 'cluster'})

In [112]:
total_sales_by_subcat_id.head()

Unnamed: 0,cluster,date,total_sales_in_cluster
0,Beauty_1,2011-01-29,3610
1,Beauty_1,2011-01-30,3172
2,Beauty_1,2011-01-31,2497
3,Beauty_1,2011-02-01,2531
4,Beauty_1,2011-02-02,1714


In [113]:
#forecasting
forecast_cluster(total_sales_by_subcat_id, 'prediction_by_cluster')

  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: Beauty_1 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: Beauty_2 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: Cleaning_1 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: Cleaning_2 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: Food_1 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: Food_2 is predicted


  temp = df[df['cluster'] == cluster].groupby('date').mean()
  df['weekofyear'] = df['date'].dt.weekofyear
  df['weekofyear'] = df['date'].dt.weekofyear


Cluster: Food_3 is predicted


([433.6731274214666,
  66.79109341074913,
  1009.2095135850996,
  219.3815792066758,
  489.4863832767539,
  1238.7218414531021,
  2397.842243052403],
 [300.0437177887762,
  50.63746227643877,
  766.2851460242147,
  163.15955165044176,
  358.5820197459915,
  1056.46952076857,
  1898.4929646596859],
 [inf,
  inf,
  inf,
  7.655372767862929,
  13.113652221162582,
  inf,
  5.143461213730206])

#### Distance measure : Weekly proportion (static)

In [114]:
#subsetting traincal so it loads faster when we merge
traincal_temp = traincal[["id", "subcat_id", "date", "sales", "d"]]

In [115]:
train_final_subcat = pd.merge(traincal_temp, total_sales_by_subcat_id, 
                             left_on = ["subcat_id", "date"], 
                             right_on = ["cluster", "date"])

#subsetting for days that were a week before our first prediction day, which is 1920
train_final_subcat.drop(columns = ['date', 'subcat_id'], inplace = True)
train_final_subcat['d'] = train_final_subcat['d'].astype(int)
subset_final_subcat = train_final_subcat[train_final_subcat['d'] >= 1913]
train_final_subcat.head()

Unnamed: 0,id,sales,d,cluster,total_sales_in_cluster
0,Beauty_1_001_East_1,0,1,Beauty_1,3610
1,Beauty_1_002_East_1,0,1,Beauty_1,3610
2,Beauty_1_003_East_1,0,1,Beauty_1,3610
3,Beauty_1_004_East_1,0,1,Beauty_1,3610
4,Beauty_1_005_East_1,0,1,Beauty_1,3610


In [116]:
#get weekly sums for item and subcat
subset_final_subcat['item_weekly_sales'] = pd.DataFrame(
    subset_final_subcat.groupby('id')['sales'].transform('sum'))

subset_final_subcat['subcat_weekly_sales'] = pd.DataFrame(subset_final_subcat.groupby(['cluster', 'id'])
                                                       ['total_sales_in_cluster'].transform('sum'))
#set distance
subset_final_subcat['prop'] = subset_final_subcat['item_weekly_sales'] / subset_final_subcat['subcat_weekly_sales']

subset_final_subcat = pd.DataFrame(subset_final_subcat[['id', 'cluster', 'prop']].drop_duplicates()).reset_index(drop = True)
subset_final_subcat

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_final_subcat['item_weekly_sales'] = pd.DataFrame(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_final_subcat['subcat_weekly_sales'] = pd.DataFrame(subset_final_subcat.groupby(['cluster', 'id'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset_final_subcat['prop'] = subset_final_

Unnamed: 0,id,cluster,prop
0,Beauty_1_001_East_1,Beauty_1,0.000234
1,Beauty_1_002_East_1,Beauty_1,0.000039
2,Beauty_1_003_East_1,Beauty_1,0.000195
3,Beauty_1_004_East_1,Beauty_1,0.000391
4,Beauty_1_005_East_1,Beauty_1,0.000430
...,...,...,...
30485,Food_3_823_West_3,Food_3,0.000038
30486,Food_3_824_West_3,Food_3,0.000023
30487,Food_3_825_West_3,Food_3,0.000031
30488,Food_3_826_West_3,Food_3,0.000085


In [117]:
#loading the predictions since we saved it as a csv in our function
#formating the prediction df so we have 'cluster' and the prediction days as the column headers
prediction_df_subcat = pd.read_csv('prediction_by_cluster.csv')
prediction_df_subcat = prediction_df_subcat.drop(columns= 'Unnamed: 0')
prediction_df_subcat['d'] = np.array(range(1920, 1941))
prediction_df_subcat.set_index('d', inplace = True)
prediction_df_subcat = prediction_df_subcat.T
prediction_df_subcat = prediction_df_subcat.reset_index().rename(columns = {'index' : 'cluster'})
prediction_df_subcat

d,cluster,1920,1921,1922,1923,1924,1925,1926,1927,1928,...,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940
0,Beauty_1,3795.334,3094.0776,2960.9004,2961.8152,2961.8152,3352.688,4008.417,3799.799,3069.9575,...,2921.6199,3402.8652,3986.6538,3798.8523,3085.7883,2903.047,2883.3872,2853.5261,3371.9287,4044.8179
1,Beauty_2,397.3793,348.68668,348.68668,351.41174,350.45663,360.17426,396.25546,396.84528,346.9131,...,357.62552,367.34314,404.24283,404.24283,354.90045,354.90045,356.59647,356.59647,366.3141,402.88965
2,Cleaning_1,9190.854,7144.938,6389.1367,6313.137,6263.9863,7128.967,9040.249,9005.049,6932.4053,...,6087.717,6917.046,8777.253,8519.082,6575.412,5912.7363,5819.7256,5857.062,7024.5347,9363.347
3,Cleaning_2,2308.7134,1706.2168,1651.5903,1651.1829,1664.9545,1923.0497,2354.6406,2308.757,1717.061,...,1645.8977,1890.7136,2337.5303,2294.8203,1711.2574,1599.6023,1599.4038,1622.5026,1891.9487,2378.3577
4,Food_1,3202.8977,2404.5603,2414.199,2522.8528,2576.3066,2938.1667,3218.4644,2877.0002,2305.862,...,2399.0566,2919.679,3214.6226,3120.9622,2428.4368,2432.0603,2427.4824,2482.5005,3031.1787,3292.873
5,Food_2,5586.4355,4975.806,4679.5645,4177.0854,4020.7415,4282.466,4764.4946,5095.8086,4364.145,...,3508.016,3710.027,4602.992,4849.1123,4005.5051,3601.076,3462.948,3372.2583,3572.4841,4608.704
6,Food_3,20955.17,17308.332,16031.606,15735.501,15743.161,17157.691,20444.56,20157.07,15434.326,...,13643.076,15693.227,18848.025,19136.547,14860.057,13392.662,13427.355,13429.416,15332.698,18596.068


In [118]:
#merging so we have the distance/proportion and the cluster predictions for each item
forecast_df = pd.merge(subset_final_subcat, prediction_df_subcat, on = 'cluster', how ='left')
#taking into account the 'distance' between an item and its' natural cluster
distance1(forecast_df)
forecast_df

Unnamed: 0,id,cluster,prop,1920,1921,1922,1923,1924,1925,1926,...,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940
0,Beauty_1_001_East_1,Beauty_1,0.000234,0.889636,0.725259,0.694042,0.694257,0.694257,0.785878,0.939583,...,0.684835,0.797640,0.934481,0.890460,0.723316,0.680481,0.675873,0.668874,0.790388,0.948115
1,Beauty_1_002_East_1,Beauty_1,0.000039,0.148273,0.120877,0.115674,0.115709,0.115709,0.130980,0.156597,...,0.114139,0.132940,0.155747,0.148410,0.120553,0.113414,0.112646,0.111479,0.131731,0.158019
2,Beauty_1_003_East_1,Beauty_1,0.000195,0.741363,0.604383,0.578369,0.578547,0.578547,0.654899,0.782986,...,0.570696,0.664700,0.778735,0.742050,0.602764,0.567068,0.563228,0.557395,0.658657,0.790096
3,Beauty_1_004_East_1,Beauty_1,0.000391,1.482726,1.208766,1.156737,1.157095,1.157095,1.309797,1.565971,...,1.141392,1.329400,1.557469,1.484101,1.205527,1.134136,1.126455,1.114789,1.317314,1.580192
4,Beauty_1_005_East_1,Beauty_1,0.000430,1.630999,1.329642,1.272411,1.272804,1.272804,1.440777,1.722569,...,1.255531,1.462340,1.713216,1.632511,1.326080,1.247549,1.239101,1.226268,1.449045,1.738211
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30485,Food_3_823_West_3,Food_3,0.000038,0.804891,0.664815,0.615776,0.604403,0.604697,0.659029,0.785278,...,0.524032,0.602779,0.723955,0.735037,0.570777,0.514414,0.515746,0.515826,0.588931,0.714277
30486,Food_3_824_West_3,Food_3,0.000023,0.482934,0.398889,0.369466,0.362642,0.362818,0.395417,0.471167,...,0.314419,0.361667,0.434373,0.441022,0.342466,0.308648,0.309448,0.309495,0.353359,0.428566
30487,Food_3_825_West_3,Food_3,0.000031,0.643913,0.531852,0.492621,0.483522,0.483757,0.527223,0.628223,...,0.419226,0.482223,0.579164,0.588030,0.456621,0.411531,0.412597,0.412660,0.471145,0.571422
30488,Food_3_826_West_3,Food_3,0.000085,1.770760,1.462594,1.354707,1.329686,1.330333,1.449864,1.727612,...,1.152871,1.326113,1.592701,1.617082,1.255709,1.131710,1.134642,1.134816,1.295648,1.571410


In [119]:
#preparing our forecast_df so we can submit on kaggle
final = forecast_df.drop(columns=['cluster', 'prop'])
final.set_index('id', inplace = True)
final.columns = columns
final

Unnamed: 0_level_0,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,...,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_East_1,0.889636,0.725259,0.694042,0.694257,0.694257,0.785878,0.939583,0.890682,0.719606,0.673474,...,0.684835,0.797640,0.934481,0.890460,0.723316,0.680481,0.675873,0.668874,0.790388,0.948115
Beauty_1_002_East_1,0.148273,0.120877,0.115674,0.115709,0.115709,0.130980,0.156597,0.148447,0.119934,0.112246,...,0.114139,0.132940,0.155747,0.148410,0.120553,0.113414,0.112646,0.111479,0.131731,0.158019
Beauty_1_003_East_1,0.741363,0.604383,0.578369,0.578547,0.578547,0.654899,0.782986,0.742235,0.599671,0.561229,...,0.570696,0.664700,0.778735,0.742050,0.602764,0.567068,0.563228,0.557395,0.658657,0.790096
Beauty_1_004_East_1,1.482726,1.208766,1.156737,1.157095,1.157095,1.309797,1.565971,1.484470,1.199343,1.122457,...,1.141392,1.329400,1.557469,1.484101,1.205527,1.134136,1.126455,1.114789,1.317314,1.580192
Beauty_1_005_East_1,1.630999,1.329642,1.272411,1.272804,1.272804,1.440777,1.722569,1.632917,1.319277,1.234703,...,1.255531,1.462340,1.713216,1.632511,1.326080,1.247549,1.239101,1.226268,1.449045,1.738211
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Food_3_823_West_3,0.804891,0.664815,0.615776,0.604403,0.604697,0.659029,0.785278,0.774236,0.592834,0.525513,...,0.524032,0.602779,0.723955,0.735037,0.570777,0.514414,0.515746,0.515826,0.588931,0.714277
Food_3_824_West_3,0.482934,0.398889,0.369466,0.362642,0.362818,0.395417,0.471167,0.464541,0.355701,0.315308,...,0.314419,0.361667,0.434373,0.441022,0.342466,0.308648,0.309448,0.309495,0.353359,0.428566
Food_3_825_West_3,0.643913,0.531852,0.492621,0.483522,0.483757,0.527223,0.628223,0.619389,0.474268,0.420411,...,0.419226,0.482223,0.579164,0.588030,0.456621,0.411531,0.412597,0.412660,0.471145,0.571422
Food_3_826_West_3,1.770760,1.462594,1.354707,1.329686,1.330333,1.449864,1.727612,1.703318,1.304236,1.156129,...,1.152871,1.326113,1.592701,1.617082,1.255709,1.131710,1.134642,1.134816,1.295648,1.571410


In [120]:
#saving predictions to csv
final.to_csv('subcat.csv')

#### Distance measure: rolling window

In [121]:
#forecasting using the 'rolling window' distance measure
traincal_temp = traincal.copy()
traincal_temp.rename(columns={"subcat_id":"cluster"}, inplace=True)
prediction_w_rolling_prop_subcat = rolling_window(traincal_temp, prediction_df_subcat)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_temp['d'] = train_temp['d'].astype(int)
  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1920


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1921


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1922


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1923


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1924


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1925


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1926


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1927


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1928


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1929


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1930


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1931


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1932


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1933


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1934


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1935


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1936


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1937


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1938


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1939


  total_cluster_sales = temp.groupby(["cluster","d"]).sum().reset_index().rename(columns={'sales':'total_cluster_sales'})
  train_temp = train_temp.append(preds_for_day)


Prediction made for day: 1940


In [122]:
#cleaning the formatting of the predictions so we can submit on kaggle
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
prediction_w_rolling_prop_subcat.columns = columns
prediction_w_rolling_prop_subcat = prediction_w_rolling_prop_subcat.set_index(train['id'])

In [123]:
#saving the rolling window predictions to csv
prediction_w_rolling_prop_subcat.to_csv("subcat_rolling_predictions.csv")

## Summary

In summary, our data preparation stage, we did not use transformations such as boxcox as we found that it decreased the performance of our models significantly. We trained two models using two natural clusters, subcategory id and store id, on their respective cluster centroids defined as the total sales per day. To obtain the predictions for each item, we used a deallocation method based on a distance metric that measured the proportion of an item's weekly sale to its cluster's weekly sale. We used both the static weekly proportion and rolling window distance measures on both clusters. We found that the model trained on the subcategory id cluster using the rolling window distance measure performed the best, as sales patterns across subcategories are more similar compared to stores, which are more closely tied to the specific location where the item is sold.


### Future Direction

We explored using DTW clustering to group items based on their time-series similarity, but ultimately opted to use subcategory and store clusters with XGBoost for our models due to computational limitations. However, we believe that if we had more resources, using DTW clustering could potentially improve our results. In addition, running XGBoost on each item_id could further improve the accuracy of our forecasts. We also recognize the potential value of incorporating insights on anomalies within the general trend, such as spikes and plateaus, for more accurate forecasting.