In [3]:
# importing the necessary packages

import pandas as pd
import numpy as np
import datetime
from datetime import datetime
import math
from sklearn.metrics import mean_squared_error
from math import sqrt
from statsmodels.tsa.holtwinters import ExponentialSmoothing

In [4]:
#loading the needed datasets

train = pd.read_csv("train_ds.csv")
calendar = pd.read_csv("calendar.csv")
#prices = pd.read_csv("prices.csv")

We aim to provide reliable sales forecast figures for individual item ids. We do that by implementing Holt-Winters Exponential Smoothing model. 

Now, note that forecasting for individual ids is neither possible nor sensible. There are 30490  unique ids and the sales figures for individual ids have a lot of noise. Moreover, it takes a very long time to fit exponential smoothing model that many times as our approach is to group the data by some category to obtain sales input data for the model and then fit this data to the model one at a time. Hence, we turn to clustering. We have different ways in which we can cluster the data, such as natural clusters already given to us in the dataset, e.g. store_ids, subcat_ids, and we can also try to cluster the data using other techniques such as dynamic time warping (DTW). Then, we can group train data by this cluster categories and aggregate the sales data for each category by days. Once we have aggregated sales data, we train the ETS model on this data and obtain a 21-day sales forecast for each category. Now that we have forecasted sales for each category we can obtain individual ids sales forecasts by calculating the weight of each id in that category and multiplying the weights by the sales forecasts. 

# Preparing the dataset for forecasting

Our goal is to first transform the dataset to make it easier to group train data by a specific category, e.g. item id, and get respective sales data from day 1 to day 1919. So, first, we pivot the sales data into a 'sales' column transforming the dataset from the wide to the long format.

In [5]:
#deleting the 'd' in the column titles 
_cols = list(train.columns)
train.columns = pd.Index(_cols[:6] + [int(c.replace("d_","")) for c in _cols[6:]])

In [6]:
#pivoting train sales
train.columns = train.columns.astype(str)
df_melt = pd.melt(train, id_vars = [i for i in train.columns if i.find("id") != -1],
                          value_vars = [i for i in train.columns if i.isnumeric()], var_name = 'd', value_name = 'sales')
df_melt.head()

Unnamed: 0,id,item_id,subcat_id,category_id,store_id,region_id,d,sales
0,Beauty_1_001_East_1,Beauty_1_001,Beauty_1,Beauty,East_1,East,1,0
1,Beauty_1_002_East_1,Beauty_1_002,Beauty_1,Beauty,East_1,East,1,0
2,Beauty_1_003_East_1,Beauty_1_003,Beauty_1,Beauty,East_1,East,1,0
3,Beauty_1_004_East_1,Beauty_1_004,Beauty_1,Beauty,East_1,East,1,0
4,Beauty_1_005_East_1,Beauty_1_005,Beauty_1,Beauty,East_1,East,1,0


In [7]:
#remove d_ in calendar
calendar['d'] = calendar['d'].str[2:]

calendar['date'] = pd.to_datetime(calendar['date'], format='%Y-%m-%d')

calendar

Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d
0,2011-01-29,11101,Saturday,1,1,2011,1
1,2011-01-30,11101,Sunday,2,1,2011,2
2,2011-01-31,11101,Monday,3,1,2011,3
3,2011-02-01,11101,Tuesday,4,2,2011,4
4,2011-02-02,11101,Wednesday,5,2,2011,5
...,...,...,...,...,...,...,...
1964,2016-06-15,11620,Wednesday,5,6,2016,1965
1965,2016-06-16,11620,Thursday,6,6,2016,1966
1966,2016-06-17,11620,Friday,7,6,2016,1967
1967,2016-06-18,11621,Saturday,1,6,2016,1968


We need date type index for the sales data to train ETS model. Hence, we merge calendar and train datasets.

In [8]:
#merge pivoted train with calendar
traincal = pd.merge(df_melt, calendar, on = 'd', how = 'left')

traincal.head()

Unnamed: 0,id,item_id,subcat_id,category_id,store_id,region_id,d,sales,date,wm_yr_wk,weekday,wday,month,year
0,Beauty_1_001_East_1,Beauty_1_001,Beauty_1,Beauty,East_1,East,1,0,2011-01-29,11101,Saturday,1,1,2011
1,Beauty_1_002_East_1,Beauty_1_002,Beauty_1,Beauty,East_1,East,1,0,2011-01-29,11101,Saturday,1,1,2011
2,Beauty_1_003_East_1,Beauty_1_003,Beauty_1,Beauty,East_1,East,1,0,2011-01-29,11101,Saturday,1,1,2011
3,Beauty_1_004_East_1,Beauty_1_004,Beauty_1,Beauty,East_1,East,1,0,2011-01-29,11101,Saturday,1,1,2011
4,Beauty_1_005_East_1,Beauty_1_005,Beauty_1,Beauty,East_1,East,1,0,2011-01-29,11101,Saturday,1,1,2011


In [9]:
traincal['date'] = pd.to_datetime(traincal['date'], format='%Y-%m-%d')

# Clustering by Store ID

First, we consider natural store_id clusters. As shown in the EDA, there are 10 unique store_ids and the sales data is a lot less noisy. Since we aggregate by store_ids, we need to obtain corresponding weights for each id in the store_id. 

## Obtaining item weights for each item based on the last week of train data

In [9]:
# subseting for the last 7 days only
traincal['d'] = traincal['d'].astype(int)
traincal_lastweek = traincal[traincal['d'] > 1912] 
traincal_lastweek = traincal_lastweek.drop(columns = ['date', 'd'])

In [10]:
# get last week sales sums for item and stores
traincal_lastweek['item_lastweek_sales'] = pd.DataFrame(traincal_lastweek.groupby('id')['sales'].transform('sum'))

traincal_lastweek['store_lastweek_sales'] = pd.DataFrame(traincal_lastweek.groupby('store_id')['sales'].transform('sum'))

# calc weights
traincal_lastweek['w'] = traincal_lastweek['item_lastweek_sales'] / traincal_lastweek['store_lastweek_sales']

In [11]:
# removing duplicated weights
traincal_lastweek = traincal_lastweek[['id','store_id','w']]
traincal_lastweek = pd.DataFrame(traincal_lastweek[['id', 'store_id', 'w']].drop_duplicates()).reset_index(drop = True)

In [12]:
# Pivot sales dataframe to get item-sales mapping
item_sales_df = traincal_lastweek.pivot(index='id', columns='store_id', values='w')
item_sales_df = item_sales_df.fillna(0)

## Splitting the dataset into testing and training datasets

In [13]:
# selecting the 'date', 'sales', and 'store_id' columns
temp_df = traincal[["date", "sales", "store_id"]]
temp_df.head()

Unnamed: 0,date,sales,store_id
0,2011-01-29,0,East_1
1,2011-01-29,0,East_1
2,2011-01-29,0,East_1
3,2011-01-29,0,East_1
4,2011-01-29,0,East_1


In [42]:
# grouping the dataset by store and date then finding the total sales of each store per day
total_sales_by_store_id = temp_df.groupby(["store_id","date"]).sum().reset_index()
total_sales_by_store_id = total_sales_by_store_id.rename(columns={'sales': 'total_sales_in_store'})
total_sales_by_store_id.head()

Unnamed: 0,store_id,date,total_sales_in_store
0,Central_1,2011-01-29,2556
1,Central_1,2011-01-30,2687
2,Central_1,2011-01-31,1822
3,Central_1,2011-02-01,2258
4,Central_1,2011-02-02,1694


Now we are going to define some functons that will train our models given the input sales data. Going back to the EDA, we realize that there is a seasonality component in the sales data, which is that the data has a period of 7 corresponding to sales weeks. So we proceed with the seasonality component in the model. Moreover, we also add the trend, because some id sales data have trend associated with it. Trend and seasonality are both additive, because there are no non-linear changes in the sales data. 

In [16]:
# defining a function that splits the data for one store into training and test data sets
def forecast_store_with_split(df):
    df.set_index('date', inplace = True)
    groups = df.groupby('store_id')
    forecasts = {}
    real_forecasts = {}
    metrics = {}
    
    # forecast for each store using ETS with parameters optimized by the model
    for name, group in groups:
        id_data = group['total_sales_in_store']
        size = int(len(id_data) * 0.90)
        train, test = id_data[0:size], id_data[size:len(id_data)]

        ets_model = ExponentialSmoothing(train, trend = 'add', seasonal = 'add', seasonal_periods = 7, freq = 'D')
        ets_fit = ets_model.fit()
        forecast = ets_fit.forecast(len(test))
        real_forecast = ets_fit.forecast(21)
        forecasts.update({name:forecast})
        real_forecasts.update({name:real_forecast})
        
        #calculating metrics
        rmse = sqrt(mean_squared_error(test, forecast))
        mae = abs(test - forecast).mean()
        mape = abs((test - forecast)/test).mean()
        metrics.update({name:[rmse, mae, mape]})
    
    predicted = pd.DataFrame(real_forecasts)
    metrics = pd.DataFrame(metrics)
    
    return predicted, metrics

In [43]:
results = forecast_store_with_split(total_sales_by_store_id)
predicted = results[0]
metrics = results[1]



In [18]:
predicted.head()

Unnamed: 0,Central_1,Central_2,Central_3,East_1,East_2,East_3,East_4,West_1,West_2,West_3
2015-10-22,2618.019045,3397.998214,3340.686999,3710.945626,3437.969981,5282.933238,2343.400983,3143.869373,3338.650585,2938.792121
2015-10-23,2914.468235,3811.553853,3636.99518,4434.648598,4255.154853,5767.135474,2458.968766,3783.321163,3713.913456,3521.853375
2015-10-24,3585.186209,4491.676404,4218.092068,5420.853604,5690.777512,7013.848344,2780.835662,4544.121477,4047.967266,3942.577652
2015-10-25,3858.449446,4648.776004,4355.1899,5742.453549,5764.481306,7441.107488,2869.898963,4245.433596,3791.362109,3899.64699
2015-10-26,2956.519584,3770.137778,3740.800416,4292.789162,3606.727443,6126.721215,2594.39583,3130.432846,3342.348962,3126.984839


In [44]:
metrics

Unnamed: 0,Central_1,Central_2,Central_3,East_1,East_2,East_3,East_4,West_1,West_2,West_3
0,575.550081,803.789551,532.69978,1160.767219,630.528572,1350.091336,314.625578,577.019259,3039.322056,745.853927
1,463.194153,647.83205,388.749761,939.533626,351.149794,1092.172194,207.459722,387.160598,2640.955261,555.952677
2,17.104148,inf,6.201884,inf,11.320082,6.872864,inf,10.042474,14.874789,18.343166


In [45]:
# replace inf values with NaN
metrics.replace([np.inf, -np.inf], np.nan, inplace=True)
metrics['sum'] = metrics.apply(lambda row: row[row.notnull()].sum(), axis=1)

print(metrics['sum'])

0    9730.247360
1    7674.159835
2      84.759407
Name: sum, dtype: float64


We get root mean square error (rmse) of 9730, and the mean absolute error (mae) of 7674.

### Creating the final prediction dataset to be submitted on Kaggle

In [20]:
predicted = predicted.T

# matrix mult of the item weights and store forecasts to obtain item forecast
item_sales = item_sales_df.dot(predicted)
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
item_sales.columns = columns
item_sales.head()

Unnamed: 0_level_0,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,...,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_Central_1,0.448502,0.499288,0.614191,0.661005,0.506492,0.459459,0.459752,0.455011,0.505796,0.620699,...,0.513,0.465967,0.46626,0.461519,0.512305,0.627208,0.674021,0.519509,0.472475,0.472769
Beauty_1_001_Central_2,0.378888,0.425001,0.500837,0.518355,0.420383,0.360545,0.366046,0.375143,0.421256,0.497092,...,0.416638,0.3568,0.3623,0.371398,0.417511,0.493347,0.510864,0.412892,0.353054,0.358555
Beauty_1_001_Central_3,0.500309,0.544685,0.631711,0.652243,0.560231,0.498344,0.486552,0.498321,0.542697,0.629724,...,0.558243,0.496356,0.484565,0.496334,0.54071,0.627736,0.648268,0.556256,0.494369,0.482577
Beauty_1_001_East_1,0.693808,0.829113,1.013496,1.073623,0.802591,0.690174,0.673551,0.681171,0.816476,1.000859,...,0.789953,0.677537,0.660914,0.668533,0.803839,0.988222,1.048349,0.777316,0.6649,0.648277
Beauty_1_001_East_2,0.653917,0.809349,1.082411,1.09643,0.686016,0.627758,0.613063,0.655192,0.810624,1.083686,...,0.68729,0.629032,0.614337,0.656466,0.811898,1.08496,1.098979,0.688564,0.630307,0.615612


In [None]:
item_sales.to_csv('sales_pred_split.csv')

## Training on the entire dataset

In [22]:
# defining a function that trains the model on the whole dataset instead of splitting it into training and testing data
def forecast_store_complete(df):
    df.set_index('date', inplace = True)
    groups = df.groupby('store_id')
    forecasts = {}
    
    # forecast for each store using ETS with parameters optimized by the model
    for name, group in groups:
        id_data = group['total_sales_in_store']
        ets_model = ExponentialSmoothing(id_data, trend = 'add', seasonal = 'add', seasonal_periods = 7, freq = 'D')
        ets_fit = ets_model.fit()
        forecast = ets_fit.forecast(21)
        forecasts.update({name:forecast})
    
    sales_pred_df = pd.DataFrame(forecasts)
    sales_pred_df.to_csv('sales_pred_df.csv')
    
    return sales_pred_df

In [23]:
# grouping the dataset by store and date then finding the total sales of each store per day
total_sales_by_store_id = temp_df.groupby(["store_id","date"]).sum().reset_index()
total_sales_by_store_id = total_sales_by_store_id.rename(columns={'sales': 'total_sales_in_store'})
total_sales_by_store_id.head()

Unnamed: 0,store_id,date,total_sales_in_store
0,Central_1,2011-01-29,2556
1,Central_1,2011-01-30,2687
2,Central_1,2011-01-31,1822
3,Central_1,2011-02-01,2258
4,Central_1,2011-02-02,1694


In [24]:
sales_pred_df = forecast_store_complete(total_sales_by_store_id)



In [25]:
sales_pred_df.head()

Unnamed: 0,Central_1,Central_2,Central_3,East_1,East_2,East_3,East_4,West_1,West_2,West_3
2016-05-01,4214.118195,4711.870499,4469.894595,5818.359159,5837.410227,7460.825367,3055.147598,4815.405533,5119.604312,4144.684098
2016-05-02,3315.502221,3778.735281,3871.150115,4369.562931,3920.085438,6008.756317,2746.600779,3252.012908,4581.395624,3146.696505
2016-05-03,3111.133887,3374.505071,3595.406654,3860.875824,3723.389579,5476.101904,2543.744346,3107.687319,4508.211118,3114.811765
2016-05-04,3083.700151,3373.280073,3513.52405,3794.594981,3682.916384,5282.672931,2493.815149,3205.708429,4529.290391,2998.987622
2016-05-05,3083.485841,3480.349706,3592.580059,3800.93051,3708.587992,5207.942246,2506.301497,3265.659963,4617.970485,3159.074417


### Creating the final prediction dataset to be submitted on Kaggle

In [155]:
sales_pred_df = sales_pred_df.T

# matrix mult of the item weights and store forecasts to obtain item forecast
item_sales_pred = item_sales_df.dot(sales_pred_df)
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
item_sales_pred.columns = columns
item_sales_pred.head()

Unnamed: 0_level_0,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,...,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_Central_1,0.721936,0.56799,0.532979,0.52828,0.528243,0.563976,0.682679,0.728369,0.574424,0.539413,...,0.534676,0.570409,0.689113,0.734803,0.580858,0.545847,0.541147,0.54111,0.576843,0.695546
Beauty_1_001_Central_2,0.52539,0.421342,0.376269,0.376132,0.388071,0.414186,0.500306,0.521728,0.41768,0.372607,...,0.384409,0.410524,0.496644,0.518065,0.414018,0.368944,0.368808,0.380747,0.406862,0.492982
Beauty_1_001_Central_3,0.669421,0.579752,0.538456,0.526193,0.538033,0.561548,0.652034,0.671065,0.581395,0.540099,...,0.539676,0.563191,0.653677,0.672708,0.583038,0.541742,0.529479,0.541319,0.564834,0.65532
Beauty_1_001_East_1,1.087815,0.816944,0.721839,0.709447,0.710631,0.807668,1.059036,1.075415,0.804544,0.709439,...,0.698231,0.795268,1.046636,1.063015,0.792144,0.697039,0.684647,0.685832,0.782868,1.034236
Beauty_1_001_East_2,1.110302,0.745618,0.708205,0.700507,0.70539,0.839566,1.117065,1.111051,0.746368,0.708955,...,0.70614,0.840316,1.117815,1.111801,0.747118,0.709705,0.702007,0.70689,0.841066,1.118565


In [None]:
item_sales_pred.to_csv('sales_pred.csv')

# Clustering by subcategory

## Obtaining item weights by subcat for the last week

In [26]:
# subseting for the last 7 days only
traincal['d'] = traincal['d'].astype(int)
traincal_lastweek_subcat = traincal[traincal['d'] > 1912] 
traincal_lastweek_subcat = traincal_lastweek_subcat.drop(columns = ['date', 'd'])

In [27]:
# get last week sales sums for item and subcategory
traincal_lastweek_subcat['item_lastweek_sales'] = pd.DataFrame(traincal_lastweek_subcat.groupby('id')['sales'].transform('sum'))

traincal_lastweek_subcat['subcat_lastweek_sales'] = pd.DataFrame(traincal_lastweek_subcat.groupby('subcat_id')['sales'].transform('sum'))

# calc weights
traincal_lastweek_subcat['w'] = traincal_lastweek_subcat['item_lastweek_sales'] / traincal_lastweek_subcat['subcat_lastweek_sales']

In [28]:
# removing duplicated weights
traincal_lastweek_subcat = traincal_lastweek_subcat[['id','subcat_id','w']]
traincal_lastweek_subcat = pd.DataFrame(traincal_lastweek_subcat[['id', 'subcat_id', 'w']].drop_duplicates()).reset_index(drop = True)

In [29]:
# Pivot sales dataframe to get item-sales mapping
item_sales_df_subcat = traincal_lastweek_subcat.pivot(index='id', columns='subcat_id', values='w')
item_sales_df_subcat = item_sales_df_subcat.fillna(0)
item_sales_df_subcat.head()

subcat_id,Beauty_1,Beauty_2,Cleaning_1,Cleaning_2,Food_1,Food_2,Food_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Beauty_1_001_Central_1,0.000156,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_Central_2,0.000117,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_Central_3,0.000156,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_East_1,0.000234,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_East_2,0.000234,0.0,0.0,0.0,0.0,0.0,0.0


## Data to be used for forecasting

In [30]:
# selecting the 'date', 'sales', and 'subcat_id' columns
temp_df_subcat = traincal[["date", "sales", "subcat_id"]]
temp_df_subcat.head()

Unnamed: 0,date,sales,subcat_id
0,2011-01-29,0,Beauty_1
1,2011-01-29,0,Beauty_1
2,2011-01-29,0,Beauty_1
3,2011-01-29,0,Beauty_1
4,2011-01-29,0,Beauty_1


In [31]:
# grouping the dataset by subcat and date then finding the total sales of each subcat per day
total_sales_by_subcat_id = temp_df_subcat.groupby(["subcat_id","date"]).sum().reset_index()
total_sales_by_subcat_id = total_sales_by_subcat_id.rename(columns={'sales': 'total_sales_in_subcat'})
total_sales_by_subcat_id.head()

Unnamed: 0,subcat_id,date,total_sales_in_subcat
0,Beauty_1,2011-01-29,3610
1,Beauty_1,2011-01-30,3172
2,Beauty_1,2011-01-31,2497
3,Beauty_1,2011-02-01,2531
4,Beauty_1,2011-02-02,1714


## Splitting the data into training and testing datasets

In [32]:
# defining a function that splits the data for one subcategory into training and test data sets
def forecast_subcat_with_split(df):
    df.set_index('date', inplace = True)
    groups = df.groupby('subcat_id')
    forecasts = {}
    real_forecasts = {}
    metrics = {}
    
    # forecast for each store using ETS with parameters optimized by the model
    for name, group in groups:
        id_data = group['total_sales_in_subcat']
        size = int(len(id_data) * 0.90)
        train, test = id_data[0:size], id_data[size:len(id_data)]

        ets_model = ExponentialSmoothing(train, trend = 'add', seasonal = 'add', seasonal_periods = 7, freq = 'D')
        ets_fit = ets_model.fit()
        forecast = ets_fit.forecast(len(test))
        real_forecast = ets_fit.forecast(21)
        forecasts.update({name:forecast})
        real_forecasts.update({name:real_forecast})
        
        #calculating metrics
        rmse = sqrt(mean_squared_error(test, forecast))
        mae = abs(test - forecast).mean()
        mape = abs((test - forecast)/test).mean()
        metrics.update({name:[rmse, mae, mape]})
    
    predicted = pd.DataFrame(real_forecasts)
    metrics = pd.DataFrame(metrics)
    
    return predicted, metrics

In [33]:
results_subcat = forecast_subcat_with_split(total_sales_by_subcat_id)
predicted_subcat = results_subcat[0]
metrics_subcat = results_subcat[1]



In [162]:
predicted_subcat.head()

Unnamed: 0,Beauty_1,Beauty_2,Cleaning_1,Cleaning_2,Food_1,Food_2,Food_3
2015-10-22,3168.238094,593.025143,6535.149695,1604.967296,2671.877276,4309.728949,14575.53685
2015-10-23,3660.975032,614.887287,7627.050439,1849.236764,3051.974935,4572.365703,16540.179853
2015-10-24,4169.821447,643.459793,9409.084177,2261.807712,3323.368362,5451.192647,19701.938832
2015-10-25,3948.39676,644.383723,9260.868206,2207.265202,2988.373474,5901.885095,20678.894178
2015-10-26,3256.24657,576.078746,7175.741268,1664.908721,2449.047753,5077.827784,15853.639218


In [34]:
metrics_subcat

Unnamed: 0,Beauty_1,Beauty_2,Cleaning_1,Cleaning_2,Food_1,Food_2,Food_3
0,408.719443,261.766133,936.091888,291.170871,1036.990755,1832.044451,10453.723088
1,246.384409,239.627256,632.091063,217.00133,873.514344,1463.059634,8694.725617
2,inf,inf,inf,9.506797,14.237543,inf,5.48925


In [46]:
# replace inf values with NaN
metrics_subcat.replace([np.inf, -np.inf], np.nan, inplace=True)
metrics_subcat['sum'] = metrics_subcat.apply(lambda row: row[row.notnull()].sum(), axis=1)

# obtaining the rmse, mae, mape
print(metrics_subcat['sum'])

0    15220.506628
1    12366.403654
2       29.233590
Name: sum, dtype: float64


We get root mean square error (rmse) of 15220.50, and the mean absolute error (mae) of 12366.40.

In [47]:
predicted_subcat = predicted_subcat.T
predicted_subcat.head()

Unnamed: 0,2015-10-22,2015-10-23,2015-10-24,2015-10-25,2015-10-26,2015-10-27,2015-10-28,2015-10-29,2015-10-30,2015-10-31,...,2015-11-02,2015-11-03,2015-11-04,2015-11-05,2015-11-06,2015-11-07,2015-11-08,2015-11-09,2015-11-10,2015-11-11
Beauty_1,3168.238094,3660.975032,4169.821447,3948.39676,3256.24657,3073.690829,3198.02986,3170.145711,3662.882649,4171.729064,...,3258.154187,3075.598446,3199.937477,3172.053328,3664.790266,4173.63668,3952.211993,3260.061804,3077.506063,3201.845094
Beauty_2,592.992837,614.804534,643.417121,644.298538,576.055376,581.599107,590.388623,595.310553,617.122249,645.734836,...,578.373092,583.916822,592.706338,597.628268,619.439965,648.052551,648.933968,580.690807,586.234537,595.024053
Cleaning_1,6535.149695,7627.050439,9409.084177,9260.868206,7175.741268,6480.425416,6421.143509,6567.542447,7659.443191,9441.476929,...,7208.13402,6512.818168,6453.536261,6599.935199,7691.835943,9473.869681,9325.65371,7240.526772,6545.21092,6485.929013
Cleaning_2,1604.967296,1849.236764,2261.807712,2207.265202,1664.908721,1537.928735,1535.442528,1600.090501,1844.359969,2256.930917,...,1660.031926,1533.05194,1530.565733,1595.213706,1839.483175,2252.054122,2197.511613,1655.155131,1528.175145,1525.688938
Food_1,2671.877276,3051.974935,3323.368362,2988.373474,2449.047753,2481.915655,2491.483521,2630.856643,3010.954301,3282.347728,...,2408.027119,2440.895022,2450.462887,2589.836009,2969.933668,3241.327095,2906.332207,2367.006486,2399.874388,2409.442254


In [48]:
# matrix mult of the item weights and subcat forecasts to obtain item forecast
item_sales_subcat = item_sales_df_subcat.dot(predicted_subcat)
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
item_sales_subcat.columns = columns
item_sales_subcat.head()

Unnamed: 0_level_0,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,...,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_Central_1,0.495095,0.572094,0.651611,0.617009,0.508848,0.48032,0.499751,0.495393,0.572392,0.651909,...,0.509146,0.480619,0.500049,0.495691,0.572691,0.652207,0.617605,0.509444,0.480917,0.500347
Beauty_1_001_Central_2,0.371321,0.429071,0.488708,0.462757,0.381636,0.36024,0.374813,0.371545,0.429294,0.488932,...,0.38186,0.360464,0.375037,0.371769,0.429518,0.489155,0.463204,0.382083,0.360688,0.37526
Beauty_1_001_Central_3,0.495095,0.572094,0.651611,0.617009,0.508848,0.48032,0.499751,0.495393,0.572392,0.651909,...,0.509146,0.480619,0.500049,0.495691,0.572691,0.652207,0.617605,0.509444,0.480917,0.500347
Beauty_1_001_East_1,0.742643,0.858142,0.977416,0.925514,0.763272,0.720481,0.749626,0.74309,0.858589,0.977864,...,0.763719,0.720928,0.750073,0.743537,0.859036,0.978311,0.926408,0.764167,0.721375,0.75052
Beauty_1_001_East_2,0.742643,0.858142,0.977416,0.925514,0.763272,0.720481,0.749626,0.74309,0.858589,0.977864,...,0.763719,0.720928,0.750073,0.743537,0.859036,0.978311,0.926408,0.764167,0.721375,0.75052


In [None]:
item_sales_subcat.to_csv('sales_pred_subcat_train.csv')

## Training on the entire dataset

In [186]:
# grouping the dataset by subcategory and date then finding the total sales of each subcategory per day
total_sales_by_subcat_id = temp_df_subcat.groupby(["subcat_id","date"]).sum().reset_index()
total_sales_by_subcat_id = total_sales_by_subcat_id.rename(columns={'sales': 'total_sales_in_subcat'})
total_sales_by_subcat_id.head()

Unnamed: 0,subcat_id,date,total_sales_in_subcat
0,Beauty_1,2011-01-29,3610
1,Beauty_1,2011-01-30,3172
2,Beauty_1,2011-01-31,2497
3,Beauty_1,2011-02-01,2531
4,Beauty_1,2011-02-02,1714


In [184]:
# # defining a function that trains the model on the whole dataset instead of splitting it into training and testing data
def forecast_subcat_complete(df):
    df.set_index('date', inplace = True)
    groups = df.groupby('subcat_id')
    forecasts = {}
    
    # forecast for each store using ETS with parameters optimized by the model
    for name, group in groups:
        id_data = group['total_sales_in_subcat']
        ets_model = ExponentialSmoothing(id_data, trend = 'add', seasonal = 'add', seasonal_periods = 7, freq = 'D')
        ets_fit = ets_model.fit()
        forecast = ets_fit.forecast(21)
        forecasts.update({name:forecast})
    
    sales_pred_df_subcat = pd.DataFrame(forecasts)
    
    return sales_pred_df_subcat

In [187]:
results_subcat_complete = forecast_subcat_complete(total_sales_by_subcat_id)




  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)












  self._init_dates(dates, freq)






  self._init_dates(dates, freq)












  self._init_dates(dates, freq)












  self._init_dates(dates, freq)








In [188]:
results_subcat_complete.head()

Unnamed: 0,Beauty_1,Beauty_2,Cleaning_1,Cleaning_2,Food_1,Food_2,Food_3
2016-05-01,4157.412106,539.247897,9704.828377,2549.66482,3785.941591,6187.654168,22629.363601
2016-05-02,3356.966167,458.954425,7455.376701,1919.841772,3175.798209,5076.817887,17354.643441
2016-05-03,3252.69987,464.597071,6799.760371,1797.172607,3264.216426,4717.72554,16177.925111
2016-05-04,3283.05023,491.905024,6710.487083,1804.320875,3300.794968,4486.452174,15909.393834
2016-05-05,3313.907781,484.306625,6930.284774,1859.314063,3486.920487,4437.454322,16214.3223


In [189]:
# Creating the final prediction dataset to be submitted on Kaggle
results_subcat_complete = results_subcat_complete.T

# matrix mult of the item weights and subcategory forecasts to obtain item forecast
item_sales_subcat_complete = item_sales_df_subcat.dot(results_subcat_complete)
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
item_sales_subcat_complete.columns = columns
item_sales_subcat_complete.head()

Unnamed: 0_level_0,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,...,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_Central_1,0.649672,0.524587,0.508294,0.513037,0.517859,0.573278,0.696157,0.651504,0.52642,0.510126,...,0.519691,0.575111,0.69799,0.653337,0.528253,0.511959,0.516702,0.521524,0.576944,0.699822
Beauty_1_001_Central_2,0.487254,0.393441,0.38122,0.384778,0.388394,0.429959,0.522118,0.488628,0.394815,0.382595,...,0.389769,0.431333,0.523492,0.490003,0.396189,0.383969,0.387526,0.391143,0.432708,0.524867
Beauty_1_001_Central_3,0.649672,0.524587,0.508294,0.513037,0.517859,0.573278,0.696157,0.651504,0.52642,0.510126,...,0.519691,0.575111,0.69799,0.653337,0.528253,0.511959,0.516702,0.521524,0.576944,0.699822
Beauty_1_001_East_1,0.974508,0.786881,0.762441,0.769555,0.776788,0.859918,1.044236,0.977257,0.78963,0.76519,...,0.779537,0.862667,1.046985,0.980005,0.792379,0.767939,0.775053,0.782286,0.865415,1.049734
Beauty_1_001_East_2,0.974508,0.786881,0.762441,0.769555,0.776788,0.859918,1.044236,0.977257,0.78963,0.76519,...,0.779537,0.862667,1.046985,0.980005,0.792379,0.767939,0.775053,0.782286,0.865415,1.049734


In [None]:
item_sales_subcat_complete.to_csv('sales_pred_subcat_complete.csv')

# Clustering by both store and subcategory

Now we want to consider another way to cluster the data, which is by both subcategory and store. We do it by creating a new variable 'subcat_store_id', which is the 'subcat_id' and 'store_id' concatenated. 

In [10]:
#Creating a new category subcategory id + store id
traincal['subcat_store_id'] = traincal['subcat_id'] + traincal['store_id']

## Obtaining item weights based on subcat_store_id sales

In [11]:
# subseting for the last 7 days only
traincal['d'] = traincal['d'].astype(int)
traincal_lastweek = traincal[traincal['d'] > 1912] 
traincal_lastweek = traincal_lastweek.drop(columns = ['date', 'd'])

In [12]:
# get last week sales sums for item and subcat_store_id
traincal_lastweek['item_lastweek_sales'] = pd.DataFrame(traincal_lastweek.groupby('id')['sales'].transform('sum'))

traincal_lastweek['subcat_store_lastweek_sales'] = pd.DataFrame(traincal_lastweek.groupby('subcat_store_id')['sales'].transform('sum'))

# calc weights
traincal_lastweek['w'] = traincal_lastweek['item_lastweek_sales'] / traincal_lastweek['subcat_store_lastweek_sales']

In [13]:
# removing duplicated weights
traincal_lastweek = traincal_lastweek[['id','subcat_store_id','w']]
traincal_lastweek = pd.DataFrame(traincal_lastweek[['id', 'subcat_store_id', 'w']].drop_duplicates()).reset_index(drop = True)

In [15]:
# Pivot sales dataframe to get item-sales mapping
item_sales_subcat_store_df = traincal_lastweek.pivot(index='id', columns='subcat_store_id', values='w')
item_sales_subcat_store_df = item_sales_subcat_store_df.fillna(0)
item_sales_subcat_store_df.head()

subcat_store_id,Beauty_1Central_1,Beauty_1Central_2,Beauty_1Central_3,Beauty_1East_1,Beauty_1East_2,Beauty_1East_3,Beauty_1East_4,Beauty_1West_1,Beauty_1West_2,Beauty_1West_3,...,Food_3Central_1,Food_3Central_2,Food_3Central_3,Food_3East_1,Food_3East_2,Food_3East_3,Food_3East_4,Food_3West_1,Food_3West_2,Food_3West_3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_Central_1,0.002169,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_Central_2,0.0,0.001207,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_Central_3,0.0,0.0,0.001569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_East_1,0.0,0.0,0.0,0.00171,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_East_2,0.0,0.0,0.0,0.0,0.002301,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Preparing data to train the model

In [16]:
# selecting the 'date', 'sales', and 'subcat_store_id' columns
temp_df_subcat_store = traincal[["date", "sales", "subcat_store_id"]]
temp_df_subcat_store.head()

Unnamed: 0,date,sales,subcat_store_id
0,2011-01-29,0,Beauty_1East_1
1,2011-01-29,0,Beauty_1East_1
2,2011-01-29,0,Beauty_1East_1
3,2011-01-29,0,Beauty_1East_1
4,2011-01-29,0,Beauty_1East_1


In [17]:
# grouping the dataset by subcat_store and date then finding the total sales of each store per day
total_sales_by_subcat_store_id = temp_df_subcat_store.groupby(["subcat_store_id","date"]).sum().reset_index()
total_sales_by_subcat_store_id = total_sales_by_subcat_store_id.rename(columns={'sales': 'total_sales_in_subcat_store'})
total_sales_by_subcat_store_id.head()

Unnamed: 0,subcat_store_id,date,total_sales_in_subcat_store
0,Beauty_1Central_1,2011-01-29,241
1,Beauty_1Central_1,2011-01-30,246
2,Beauty_1Central_1,2011-01-31,96
3,Beauty_1Central_1,2011-02-01,229
4,Beauty_1Central_1,2011-02-02,91


## Splitting into train and test

In [18]:
# defining a function that splits the data for one subcat_store into training and test data sets
def forecast_subcat_store_with_split(df):
    df.set_index('date', inplace = True)
    groups = df.groupby('subcat_store_id')
    forecasts = {}
    real_forecasts = {}
    metrics = {}
    
    # forecast for each subcat_store using ETS with parameters optimized by the model
    for name, group in groups:
        id_data = group['total_sales_in_subcat_store']
        size = int(len(id_data) * 0.90)
        train, test = id_data[0:size], id_data[size:len(id_data)]

        ets_model = ExponentialSmoothing(train, trend = 'add', seasonal = 'add', seasonal_periods = 7, freq = 'D')
        ets_fit = ets_model.fit()
        forecast = ets_fit.forecast(len(test))
        real_forecast = ets_fit.forecast(21)
        forecasts.update({name:forecast})
        real_forecasts.update({name:real_forecast})
        
        #calculating metrics
        rmse = sqrt(mean_squared_error(test, forecast))
        mae = abs(test - forecast).mean()
        mape = abs((test - forecast)/test).mean()
        metrics.update({name:[rmse, mae, mape]})
    
    predicted = pd.DataFrame(real_forecasts)
    metrics = pd.DataFrame(metrics)
    
    return predicted, metrics

In [19]:
results_subcat_store = forecast_subcat_store_with_split(total_sales_by_subcat_store_id)
predicted_subcat_store = results_subcat_store[0]
metrics_subcat_store = results_subcat_store[1]



In [20]:
metrics_subcat_store.head()

Unnamed: 0,Beauty_1Central_1,Beauty_1Central_2,Beauty_1Central_3,Beauty_1East_1,Beauty_1East_2,Beauty_1East_3,Beauty_1East_4,Beauty_1West_1,Beauty_1West_2,Beauty_1West_3,...,Food_3Central_1,Food_3Central_2,Food_3Central_3,Food_3East_1,Food_3East_2,Food_3East_3,Food_3East_4,Food_3West_1,Food_3West_2,Food_3West_3
0,78.12056,58.665242,85.685194,90.118298,84.678003,93.77835,80.776972,86.763636,52.585049,51.327531,...,372.042524,330.066194,332.744963,387.92262,355.595101,788.386997,175.866213,280.847137,1519.964088,1978.556631
1,60.185202,44.03616,69.358015,65.516532,64.5904,71.060402,61.109531,65.792146,39.730583,39.488348,...,320.152838,261.24437,266.90578,304.989357,251.814687,645.499101,130.151521,199.077734,1305.342717,1702.456929
2,inf,inf,inf,inf,inf,inf,inf,inf,inf,inf,...,7.75764,inf,3.848337,inf,5.260167,3.141835,inf,8.33918,5.850228,4.650352


In [35]:
# replace inf values with NaN
metrics_subcat_store.replace([np.inf, -np.inf], np.nan, inplace=True)
metrics_subcat_store['sum'] = metrics_subcat_store.apply(lambda row: row[row.notnull()].sum(), axis=1)

# obtaining the rmse, mae, mape
print(metrics_subcat_store['sum'])

0    24594.036296
1    19837.816533
2       83.126192
Name: sum, dtype: float64


We get root mean square error (rmse) of 24594.03, and the mean absolute error (mae) of 19837.81.

In [22]:
predicted_subcat_store = predicted_subcat_store.T
predicted_subcat_store.head()

Unnamed: 0,2015-10-22,2015-10-23,2015-10-24,2015-10-25,2015-10-26,2015-10-27,2015-10-28,2015-10-29,2015-10-30,2015-10-31,...,2015-11-02,2015-11-03,2015-11-04,2015-11-05,2015-11-06,2015-11-07,2015-11-08,2015-11-09,2015-11-10,2015-11-11
Beauty_1Central_1,196.744174,231.667139,255.042145,267.65056,199.756232,198.502971,189.030388,194.985401,229.908365,253.283371,...,197.997459,196.744197,187.271615,193.226627,228.149591,251.524598,264.133012,196.238685,194.985424,185.512841
Beauty_1Central_2,298.113301,343.669152,373.066917,387.867493,323.194782,275.539866,300.390107,298.465384,344.021235,373.419,...,323.546864,275.891949,300.742189,298.817466,344.373317,373.771082,388.571658,323.898947,276.244031,301.094271
Beauty_1Central_3,304.884808,355.380699,399.64537,399.895057,338.145563,314.44794,299.721329,308.940721,359.436613,403.701284,...,342.201476,318.503854,303.777243,312.996635,363.492526,407.757198,408.006885,346.25739,322.559768,307.833157
Beauty_1East_1,459.356297,518.855985,579.763352,556.596033,474.674004,435.868182,464.500105,458.881687,518.381376,579.288743,...,474.199394,435.393572,464.025495,458.407078,517.906766,578.814133,555.646814,473.724785,434.918962,463.550886
Beauty_1East_2,327.816908,383.857445,492.935344,456.313334,337.608666,309.004191,322.576737,330.156774,386.19731,495.27521,...,339.948531,311.344057,324.916603,332.49664,388.537176,497.615076,460.993065,342.288397,313.683922,327.256469


In [63]:
# matrix mult of the item weights and subcat_store forecasts to obtain item forecast
item_sales_subcat_store = item_sales_subcat_store_df.dot(predicted_subcat_store)
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
item_sales_subcat_store.columns = columns
item_sales_subcat_store.head()

Unnamed: 0_level_0,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,...,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_Central_1,0.426777,0.502532,0.553237,0.580587,0.433311,0.430592,0.410044,0.422962,0.498717,0.549422,...,0.429496,0.426777,0.406229,0.419147,0.494901,0.545607,0.572957,0.42568,0.422962,0.402414
Beauty_1_001_Central_2,0.359751,0.414725,0.450201,0.468062,0.390018,0.33251,0.362498,0.360175,0.41515,0.450626,...,0.390443,0.332935,0.362923,0.3606,0.415575,0.451051,0.468912,0.390868,0.33336,0.363348
Beauty_1_001_Central_3,0.478251,0.55746,0.626895,0.627286,0.530424,0.493252,0.470151,0.484613,0.563822,0.633257,...,0.536787,0.499614,0.476513,0.490975,0.570184,0.639619,0.640011,0.543149,0.505976,0.482876
Beauty_1_001_East_1,0.785448,0.887186,0.991331,0.951717,0.81164,0.745286,0.794244,0.784637,0.886375,0.990519,...,0.810828,0.744475,0.793432,0.783825,0.885563,0.989708,0.950094,0.810017,0.743663,0.79262
Beauty_1_001_East_2,0.755433,0.883929,1.115274,1.030867,0.78091,0.719076,0.748935,0.760904,0.889399,1.120744,...,0.786381,0.724547,0.754406,0.766375,0.89487,1.126215,1.041809,0.791851,0.730017,0.759877


In [65]:
item_sales_subcat_store.to_csv('sales_pred_subcat_store_train.csv')

## Training on the entire dataset

In [69]:
# grouping the dataset by subcat_store_id and date then finding the total sales of each store per day
total_sales_by_subcat_store_id = temp_df_subcat_store.groupby(["subcat_store_id","date"]).sum().reset_index()
total_sales_by_subcat_store_id = total_sales_by_subcat_store_id.rename(columns={'sales': 'total_sales_in_subcat_store'})
total_sales_by_subcat_store_id.head()

Unnamed: 0,subcat_store_id,date,total_sales_in_subcat_store
0,Beauty_1Central_1,2011-01-29,241
1,Beauty_1Central_1,2011-01-30,246
2,Beauty_1Central_1,2011-01-31,96
3,Beauty_1Central_1,2011-02-01,229
4,Beauty_1Central_1,2011-02-02,91


In [70]:
# defining a function that trains the model on the whole dataset instead of splitting it into training and testing data
def forecast_subcat_store_complete(df):
    df.set_index('date', inplace = True)
    groups = df.groupby('subcat_store_id')
    forecasts = {}
    
    # forecast for each store using ETS with parameters optimized by the model
    for name, group in groups:
        id_data = group['total_sales_in_subcat_store']
        ets_model = ExponentialSmoothing(id_data, trend = 'add', seasonal = 'add', seasonal_periods = 7, freq = 'D')
        ets_fit = ets_model.fit()
        forecast = ets_fit.forecast(21)
        forecasts.update({name:forecast})
    
    sales_pred_df_subcat_store = pd.DataFrame(forecasts)
    
    return sales_pred_df_subcat_store

In [71]:
results_subcat_store_complete = forecast_subcat_store_complete(total_sales_by_subcat_store_id)
results_subcat_store_complete.head()














Unnamed: 0,Beauty_1Central_1,Beauty_1Central_2,Beauty_1Central_3,Beauty_1East_1,Beauty_1East_2,Beauty_1East_3,Beauty_1East_4,Beauty_1West_1,Beauty_1West_2,Beauty_1West_3,...,Food_3Central_1,Food_3Central_2,Food_3Central_3,Food_3East_1,Food_3East_2,Food_3East_3,Food_3East_4,Food_3West_1,Food_3West_2,Food_3West_3
2016-05-01,322.686998,391.931505,417.599803,561.82638,445.834532,638.047503,387.946501,450.425195,262.517992,235.925432,...,1988.179074,2368.097069,2081.765833,2895.074517,2557.160232,3305.173923,1336.941022,2161.763143,2108.73905,2096.656583
2016-05-02,226.948915,344.472234,338.49718,484.468366,315.130435,476.553639,362.350119,328.832287,249.615522,209.542303,...,1585.21212,1853.179578,1774.724994,2112.073085,1736.138032,2672.37946,1200.874747,1429.231635,1807.669289,1551.47949
2016-05-03,223.523346,315.977594,321.806077,437.462063,319.795143,469.710297,363.074175,337.873695,239.875051,203.31654,...,1485.366652,1665.896125,1624.611257,1893.659718,1673.543306,2399.774386,1082.30027,1369.650538,1753.224223,1519.718755
2016-05-04,229.083819,313.374687,323.890884,431.772499,323.089033,462.008189,356.532578,344.830356,249.951488,210.292306,...,1471.080108,1652.175971,1577.499427,1887.47557,1634.886844,2305.672256,1061.410673,1438.508087,1768.319595,1472.880525
2016-05-05,223.563124,325.214676,323.722902,433.682017,316.963048,461.58797,357.415739,343.647832,259.837892,216.778275,...,1471.655572,1703.005035,1611.922106,1887.594383,1682.089068,2279.605644,1083.547307,1473.656286,1796.247336,1554.175718


In [76]:
# Creating the final prediction dataset to be submitted on Kaggle
results_subcat_store_complete = results_subcat_store_complete.T

# matrix mult of the item weights and subcat_store_id forecasts to obtain item forecast
item_sales_subcat_store_complete = item_sales_subcat_store_df.dot(results_subcat_store_complete)
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
item_sales_subcat_store_complete.columns = columns
item_sales_subcat_store_complete.head()

Unnamed: 0_level_0,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,...,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_Central_1,0.699972,0.492297,0.484866,0.496928,0.484953,0.529484,0.66454,0.70104,0.493365,0.485935,...,0.486021,0.530552,0.665608,0.702109,0.494434,0.487003,0.499065,0.487089,0.531621,0.666676
Beauty_1_001_Central_2,0.472966,0.415695,0.381308,0.378167,0.392455,0.410512,0.463673,0.476131,0.41886,0.384473,...,0.39562,0.413677,0.466838,0.479296,0.422025,0.387638,0.384497,0.398785,0.416842,0.470003
Beauty_1_001_Central_3,0.655059,0.530976,0.504794,0.508064,0.507801,0.56422,0.651476,0.65854,0.534457,0.508275,...,0.511282,0.567701,0.654957,0.662021,0.537938,0.511756,0.515026,0.514763,0.571182,0.658438
Beauty_1_001_East_1,0.960661,0.828387,0.748012,0.738283,0.741548,0.822775,1.026446,0.960567,0.828293,0.747918,...,0.741454,0.822681,1.026352,0.960473,0.828199,0.747824,0.738095,0.74136,0.822587,1.026258
Beauty_1_001_East_2,1.026086,0.725271,0.736007,0.743588,0.729489,0.894991,1.124892,1.027697,0.726882,0.737618,...,0.731099,0.896601,1.126503,1.029307,0.728492,0.739228,0.746809,0.73271,0.898212,1.128113


In [77]:
item_sales_subcat_store_complete.to_csv('sales_pred_subcat_store_complete.csv')

# Clustering with DTW clusters

Finally, we consider another way of clustering based on Dynamic Time Warping (DTW). Here, the clusters are assigned based on the similarity of timeseries. In DTW clustering, each time series is compared to every other time series in the dataset, and a similarity matrix is constructed based on the DTW distance between each pair of time series. DTW clustering uses a distance measure called the "DTW distance" to compute the similarity between two time series. This similarity matrix is then used as input to a clustering algorithm, such as k-means or hierarchical clustering, to group the time series into clusters based on their similarity.

## Obtaining item weights by cluster

In [23]:
clusters = pd.read_csv('clusters.csv')
clusters.drop(columns='item1', inplace=True)
clusters.head()

Unnamed: 0,cluster,subcat_id,store_id
0,6,Beauty_1,Central_1
1,14,Beauty_1,Central_2
2,4,Beauty_1,Central_3
3,16,Beauty_1,East_1
4,14,Beauty_1,East_2


In [25]:
train_w_clusters = pd.merge(traincal, clusters, on = ["subcat_id", "store_id"])

#finding average sales per cluster for each date
temp_df = train_w_clusters[["date", "sales", "cluster"]]

total_sales_by_cluster = temp_df.groupby(["cluster","date"]).sum().reset_index()

total_sales_by_cluster = total_sales_by_cluster.rename(columns={'sales': 'total_sales_in_cluster'})

total_sales_by_cluster.head()

Unnamed: 0,cluster,date,total_sales_in_cluster
0,1,2011-01-29,154
1,1,2011-01-30,185
2,1,2011-01-31,185
3,1,2011-02-01,138
4,1,2011-02-02,100


In [26]:
# subseting for the last 7 days only
train_w_clusters['d'] = train_w_clusters['d'].astype(int)
traincal_lastweek_clusters = train_w_clusters[train_w_clusters['d'] > 1912] 
traincal_lastweek_clusters = traincal_lastweek_clusters.drop(columns = ['date', 'd'])

# get last week sales sums for item and clusters
traincal_lastweek_clusters['item_lastweek_sales'] = pd.DataFrame(traincal_lastweek_clusters.groupby('id')['sales'].transform('sum'))

traincal_lastweek_clusters['store_lastweek_sales'] = pd.DataFrame(traincal_lastweek_clusters.groupby('cluster')['sales'].transform('sum'))

# calc weights
traincal_lastweek_clusters['w'] = traincal_lastweek_clusters['item_lastweek_sales'] / traincal_lastweek_clusters['store_lastweek_sales']

In [27]:
# removing duplicated weights
traincal_lastweek_clusters = traincal_lastweek_clusters[['id','cluster','w']]
traincal_lastweek_clusters = pd.DataFrame(traincal_lastweek_clusters[['id', 'cluster', 'w']].drop_duplicates()).reset_index(drop = True)

In [28]:
# Pivot sales dataframe to get item-sales mapping
item_sales_df_clusters = traincal_lastweek_clusters.pivot(index='id', columns='cluster', values='w')
item_sales_df_clusters = item_sales_df_clusters.fillna(0)
item_sales_df_clusters.head()

cluster,1,2,3,4,5,6,7,8,9,10,...,13,14,15,16,17,18,19,20,21,22
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_Central_1,0.0,0.0,0.0,0.0,0.0,0.000744,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_Central_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000307,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_Central_3,0.0,0.0,0.0,0.000842,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_East_1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000785,0.0,0.0,0.0,0.0,0.0,0.0
Beauty_1_001_East_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000614,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Training with split

In [29]:
# defining a function that splits the data for one cluster into training and test data sets
def forecast_clusters_with_split(df):
    df.set_index('date', inplace = True)
    groups = df.groupby('cluster')
    forecasts = {}
    real_forecasts = {}
    metrics = {}
    
    # forecast for each store using ETS with parameters optimized by the model
    for name, group in groups:
        id_data = group['total_sales_in_cluster']
        size = int(len(id_data) * 0.90)
        train, test = id_data[0:size], id_data[size:len(id_data)]

        ets_model = ExponentialSmoothing(train, trend = 'add', seasonal = 'add', seasonal_periods = 7, freq = 'D')
        ets_fit = ets_model.fit()
        forecast = ets_fit.forecast(len(test))
        real_forecast = ets_fit.forecast(21)
        forecasts.update({name:forecast})
        real_forecasts.update({name:real_forecast})
        
        #calculating metrics
        rmse = sqrt(mean_squared_error(test, forecast))
        mae = abs(test - forecast).mean()
        mape = abs((test - forecast)/test).mean()
        metrics.update({name:[rmse, mae, mape]})
    
    predicted = pd.DataFrame(real_forecasts)
    metrics = pd.DataFrame(metrics)
    
    return predicted, metrics

In [30]:
results_clusters = forecast_clusters_with_split(total_sales_by_cluster)
predicted_clusters = results_clusters[0]
metrics_clusters = results_clusters[1]



In [31]:
predicted_clusters.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,13,14,15,16,17,18,19,20,21,22
2015-10-22,593.025143,707.488664,1178.572035,569.185174,494.306379,631.41377,1906.245785,1346.105324,1241.366202,701.555328,...,1283.047171,1138.184007,633.263074,946.119619,950.300593,2465.310534,4973.904874,3757.123018,4675.826939,2367.044718
2015-10-23,614.887287,824.142833,1379.274711,667.335468,590.805531,737.023031,2182.377278,1518.843176,1525.356071,710.569611,...,1309.355163,1329.04027,734.037005,1134.750824,1068.633848,2610.350771,5758.232174,4114.60307,5535.193449,2601.5011
2015-10-24,643.459793,966.399303,1564.921544,820.198175,582.488057,792.150168,2487.147446,1884.074753,2094.103226,789.981587,...,1503.505345,1559.334677,926.360376,1325.880679,1261.138692,3102.629651,7069.536187,4916.825606,6618.12596,3123.199738
2015-10-25,644.383723,917.248538,1586.981758,811.375501,524.883172,773.667411,2399.964534,1860.945952,2217.163239,813.280103,...,1498.297662,1504.394775,840.91324,1211.409875,1337.808972,3169.350333,7303.934848,5214.007759,6994.205062,3378.32602
2015-10-26,576.078746,716.192323,1248.311327,614.453813,475.417854,612.510472,2042.237977,1463.916735,1470.3454,775.951922,...,1382.251637,1204.253044,653.816271,944.712544,1084.646402,2693.182734,5408.53565,4192.37719,5300.200382,2765.416736


In [32]:
metrics_clusters

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,13,14,15,16,17,18,19,20,21,22
0,262.524168,223.303179,218.551879,113.467008,85.225022,151.120726,967.586183,415.971535,216.329376,479.202314,...,174.828242,200.906536,123.504678,172.763376,170.301846,579.210109,760.893914,3776.621277,1319.939992,788.386997
1,240.34979,179.140183,160.390348,86.385822,61.655479,112.114626,853.747853,347.281458,153.266276,387.779667,...,114.168251,149.074904,92.554892,121.144589,124.936869,454.262218,514.760015,3206.730158,1062.944081,645.499101
2,inf,4.260728,inf,inf,3.245814,inf,inf,inf,inf,inf,...,inf,inf,inf,inf,inf,inf,10.102791,3.81165,26.851589,3.141835


In [33]:
# replace inf values with NaN
metrics_clusters.replace([np.inf, -np.inf], np.nan, inplace=True)
metrics_clusters['sum'] = metrics_clusters.apply(lambda row: row[row.notnull()].sum(), axis=1)

# obtaining the rmse, mae, mape
print(metrics_clusters['sum'])

0    11607.491263
1     9394.917257
2       51.414408
Name: sum, dtype: float64


We get root mean square error (rmse) of 11607.49, and the mean absolute error (mae) of 9394.92.

In [34]:
# Creating the final prediction dataset to be submitted on Kaggle
predicted_clusters = predicted_clusters.T

# matrix mult of the item weights and store forecasts to obtain item forecast
item_sales_clusters = item_sales_df_clusters.dot(predicted_clusters)
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
item_sales_clusters.columns = columns
item_sales_clusters.head()

Unnamed: 0_level_0,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,...,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_Central_1,0.469802,0.54838,0.589397,0.575645,0.455737,0.482121,0.46367,0.46716,0.545738,0.586756,...,0.453095,0.479479,0.461028,0.464518,0.543097,0.584114,0.570362,0.450453,0.476838,0.458386
Beauty_1_001_Central_2,0.349279,0.407848,0.478519,0.46166,0.369554,0.333885,0.344036,0.347469,0.406038,0.476709,...,0.367744,0.332075,0.342226,0.345659,0.404228,0.474899,0.458039,0.365934,0.330264,0.340416
Beauty_1_001_Central_3,0.479112,0.56173,0.690403,0.682976,0.517217,0.482479,0.471594,0.48188,0.564498,0.69317,...,0.519984,0.485247,0.474361,0.484647,0.567265,0.695937,0.688511,0.522752,0.488014,0.477128
Beauty_1_001_East_1,0.742929,0.891049,1.041131,0.951245,0.741824,0.689564,0.718986,0.741082,0.889202,1.039284,...,0.739977,0.687717,0.717139,0.739235,0.887355,1.037437,0.94755,0.73813,0.68587,0.715292
Beauty_1_001_East_2,0.698558,0.815696,0.957038,0.923319,0.739108,0.667769,0.688072,0.694938,0.812075,0.953418,...,0.735488,0.664149,0.684452,0.691318,0.808455,0.949798,0.916079,0.731867,0.660529,0.680832


In [None]:
item_sales_clusters.to_csv('item_sales_clusters.csv')

## Training on entire dataset

In [203]:
total_sales_by_cluster = temp_df.groupby(["cluster","date"]).sum().reset_index()

total_sales_by_cluster = total_sales_by_cluster.rename(columns={'sales': 'total_sales_in_cluster'})

total_sales_by_cluster.head()

Unnamed: 0,cluster,date,total_sales_in_cluster
0,1,2011-01-29,154
1,1,2011-01-30,185
2,1,2011-01-31,185
3,1,2011-02-01,138
4,1,2011-02-02,100


In [206]:
# defining a function that trains the model on the whole dataset instead of splitting it into training and testing data
def forecast_cluster_complete(df):
    df.set_index('date', inplace = True)
    groups = df.groupby('cluster')
    forecasts = {}
    
    # forecast for each store using ETS with parameters optimized by the model
    for name, group in groups:
        id_data = group['total_sales_in_cluster']
        ets_model = ExponentialSmoothing(id_data, trend = 'add', seasonal = 'add', seasonal_periods = 7)
        ets_fit = ets_model.fit()
        forecast = ets_fit.forecast(21)
        forecasts.update({name:forecast})
    
    sales_pred_df_cluster = pd.DataFrame(forecasts)
    
    return sales_pred_df_cluster

In [208]:
results_clusters_complete = forecast_cluster_complete(total_sales_by_cluster)




  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)












  self._init_dates(dates, freq)






  self._init_dates(dates, freq)












  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)






  self._init_dates(dates, freq)












  self._init_dates(dates, freq)












  self._init_dates(dates, freq)












  self._init_dates(dates, freq)












  self._init_dates(dates, freq)








In [209]:
results_clusters_complete.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,13,14,15,16,17,18,19,20,21,22
2016-05-01,539.247897,1159.119249,1715.225607,840.544498,523.664628,894.344095,2810.821336,1861.161872,2159.645831,893.180465,...,1734.909142,1621.411541,860.452416,1286.835441,1291.337255,3348.083133,7772.562433,6173.916739,7368.66725,3305.173923
2016-05-02,458.954425,905.620586,1329.723971,617.953472,482.833353,677.802144,2389.461924,1524.091509,1468.178825,841.198683,...,1562.610816,1308.76765,655.321864,961.643241,989.259766,2660.497844,5778.033867,5149.731334,5481.165852,2672.37946
2016-05-03,464.597071,877.25212,1238.428292,581.750433,487.558414,723.548769,2220.043157,1411.112292,1274.206522,843.564101,...,1492.650746,1217.272925,639.096206,876.380523,920.412589,2488.618162,5383.754269,4811.46653,5079.928181,2399.774386
2016-05-04,491.905024,877.449794,1249.852556,590.726349,500.806861,710.454822,2242.407934,1362.471938,1254.441697,799.827316,...,1464.138443,1226.993913,622.752944,911.046922,883.973494,2409.050548,5350.377032,4731.392325,4982.954337,2305.672256
2016-05-05,484.306625,904.998294,1268.523503,590.287186,524.778938,722.444916,2279.305636,1411.985505,1242.619527,804.619868,...,1458.867033,1256.337461,617.154065,925.015373,860.554877,2490.213536,5469.891937,4758.949306,5124.47634,2279.605644


In [210]:
# Creating the final prediction dataset to be submitted on Kaggle
results_clusters_complete = results_clusters_complete.T

# matrix mult of the item weights and cluster forecasts to obtain item forecast
item_sales_clusters_complete = item_sales_df_clusters.dot(results_clusters_complete)
columns = ['d_1920', 'd_1921','d_1922','d_1923', 'd_1924', 'd_1925','d_1926','d_1927','d_1928','d_1929','d_1930',
             'd_1931','d_1932','d_1933','d_1934','d_1935','d_1936','d_1937','d_1938','d_1939','d_1940']
item_sales_clusters_complete.columns = columns
item_sales_clusters_complete.head()

Unnamed: 0_level_0,d_1920,d_1921,d_1922,d_1923,d_1924,d_1925,d_1926,d_1927,d_1928,d_1929,...,d_1931,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Beauty_1_001_Central_1,0.665435,0.504317,0.538355,0.528612,0.537533,0.599636,0.703719,0.671365,0.510247,0.544285,...,0.543463,0.605566,0.709649,0.677295,0.516177,0.550215,0.540472,0.549394,0.611496,0.715579
Beauty_1_001_Central_2,0.497569,0.401627,0.373549,0.376533,0.385537,0.432363,0.520489,0.500275,0.404333,0.376255,...,0.388243,0.435069,0.523195,0.502981,0.407039,0.378961,0.381945,0.390949,0.437775,0.525901
Beauty_1_001_Central_3,0.707529,0.520163,0.489689,0.497244,0.496875,0.56397,0.718514,0.710672,0.523305,0.492831,...,0.500017,0.567113,0.721656,0.713814,0.526448,0.495974,0.503529,0.50316,0.570255,0.724799
Beauty_1_001_East_1,1.010471,0.755118,0.688167,0.715388,0.726357,0.881183,1.137274,1.010663,0.755309,0.688358,...,0.726548,0.881374,1.137465,1.010854,0.7555,0.688549,0.71577,0.726739,0.881565,1.137656
Beauty_1_001_East_2,0.995138,0.803253,0.747099,0.753065,0.771075,0.864727,1.040978,1.00055,0.808666,0.752511,...,0.776487,0.870139,1.04639,1.005962,0.814078,0.757923,0.763889,0.781899,0.875551,1.051803


In [211]:
item_sales_clusters_complete.to_csv('sales_pred_clusters_complete.csv')

## Conclusion

Based on the results of the ETS model training using different clustering methods such as store_id and subcat_id, it was found that the model trained on store_id provided the best performance. This was indicated by a higher Kaggle score and better evaluation metrics compared to the other clustering methods. Therefore, it can be concluded that using store_id as a clustering method can effectively improve the accuracy of the ETS model in forecasting time series data. This finding may have practical implications for our company. Considering that the ETS model trained on store_id clustering provided the best performance in forecasting sales data, we come up with the following practical implications for the company.

Firstly, accurate sales forecasting is critical for effective inventory management, especially in retail companies where inventory turnover can have a significant impact on profits. By accurately forecasting sales data for each store, the company can optimize their inventory levels, reducing the risk of stockouts or overstocking. This can help to reduce storage costs and increase profits.

Secondly, accurate sales forecasting can also help the company to improve its production planning and supply chain management. By knowing the demand for each store, the company can plan its production and distribution strategies accordingly, ensuring that each store has sufficient stock to meet customer demand. This can help to minimize transportation costs and reduce lead times, improving overall efficiency.

Finally, accurate sales forecasting can help the company to make informed decisions about pricing and promotions. By understanding the demand for each store, the company can tailor its pricing and promotion strategies accordingly, improving its competitiveness in the market and increasing sales revenue.

In summary, the finding that the ETS model trained on store_id clustering provided the best performance in forecasting sales data can have significant implications for the company, improving inventory management, production planning, supply chain management, and pricing and promotion strategies.