# STORE SALES TIMESERIES FORECAST - KAGGLE

## TABLE OF CONTENTS

A. **Pace: Plan stage**
    1. **Introduction: Definition of forecasting task with identification of features and target**

B. **pAce: Analyze stage**
    2. **Data Preparation**
    3. **Data Visualization**
    4. **Multistep forecasting strategies: multioutput, direct, recursive, DiRec**

C. **paCe: Construct stage**
    5. **Model Construction: Boosted Hybrid vs Stacked hybrid**

D. **pacE: Execute stage**
    6. **Hyperparameter tuning**
    7. **Conclusion**

# A. Plan Stage

## 1. Introduction: Definition of forecasting task with identification of features and target
# In this section, we will define the forecasting task and identify the relevant features and target variable.

# Introduction
"""
This project aims to forecast store sales using a time series dataset from Kaggle. The primary goal is to predict future sales for each product family sold at Favorita stores located in Ecuador. Accurate forecasts can help in inventory management, staffing, and overall business strategy.
"""

# Identification of Features and Target
"""
Features:
- Store ID
- Date
- Promotion
- Holiday
- Day of the week
- Season
- Weather

Target:
- Sales
"""

In [4]:
%%time

## Imports

# Installing select libraries

## this install is particularly for dask
!pip install dask

!pip install catboost
!pip install colorama
!pip install category_encoders
!pip install optuna
!pip install xgboost
!pip install seaborn

# General library imports
import dask.dataframe as dd

from gc import collect
from warnings import filterwarnings
filterwarnings('ignore')
from IPython.display import clear_output

import xgboost as xgb
import lightgbm as lgb
import catboost as cb
import sklearn as sk
import pandas as pd

print(f"---> XGBoost = {xgb.__version__} | LightGBM = {lgb.__version__} | Catboost = {cb.__version__}")
print(f"---> Sklearn = {sk.__version__} | Pandas = {pd.__version__}\n\n")
collect()

# Data manipulation and visualization
import numpy as np
import joblib
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Model and pipeline specifics
from category_encoders import OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Time series specific
from sklearn.metrics import mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor

# Ensemble and tuning
import optuna
from optuna import create_study
optuna.logging.set_verbosity(optuna.logging.ERROR)

---> XGBoost = 2.1.0 | LightGBM = 4.5.0 | Catboost = 1.2.5
---> Sklearn = 1.5.1 | Pandas = 2.2.2


CPU times: total: 4.78 s
Wall time: 1min 37s


In [5]:
# Setting rc parameters in seaborn for plots and graphs
sns.set({"axes.facecolor": "#f7f9fc",
          "figure.facecolor": "#f7f9fc",
          "axes.edgecolor": "#000000",
          "grid.color": "#EBEBE7",
          "font.family": "serif",
          "axes.labelcolor": "#000000",
          "xtick.color": "#000000",
          "ytick.color": "#000000",
          "grid.alpha": 0.4,
         "grid.linewidth"       : 0.75,
         "grid.linestyle"       : "--",
         "axes.titlecolor"      : '#0099e6',
         'axes.titlesize'       : 8.5,
         'axes.labelweight'     : "bold",
         'legend.fontsize'      : 7.0,
         'legend.title_fontsize': 7.0,
         'font.size'            : 7.5,
         'xtick.labelsize'      : 7.5,
         'ytick.labelsize'      : 7.5,
        })

# Making sklearn pipeline outputs as dataframe
from sklearn import set_config
set_config(transform_output = "pandas")
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)
pd.options.display.float_format = '{:,.2f}'.format

print()
collect()




4

In [6]:
favorita_train = dd.read_csv('data/train.csv')
favorita_test = dd.read_csv('data/test.csv')

stores = dd.read_csv('data/stores.csv')

transactions = dd.read_csv('data/transactions.csv')

oil = dd.read_csv('data/oil.csv')

holidays = dd.read_csv('data/holidays_events.csv')

In [7]:
favorita_train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


In [8]:
favorita_test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


In [9]:
stores.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [10]:
transactions.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [11]:
oil.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [12]:
holidays.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [13]:
# Merge favorita_train and stores on 'store_nbr'
first_join = favorita_train.merge(stores, on='store_nbr', how='inner')

# Merge the result with transactions on 'store_nbr' and 'date'
second_join = first_join.merge(transactions, on=['store_nbr', 'date'], how='inner')

# Merge the result with oil on 'date'
third_join = second_join.merge(oil, on='date', how='inner')

# Merge the result with holidays on 'date'
final_join_train = third_join.merge(holidays, on='date', how='inner')

In [14]:
# Merge favorita_test and stores on 'store_nbr'
first_join = favorita_test.merge(stores, on='store_nbr', how='inner')

# Merge the result with transactions on 'store_nbr' and 'date'
second_join = first_join.merge(transactions, on=['store_nbr', 'date'], how='inner')

# Merge the result with oil on 'date'
third_join = second_join.merge(oil, on='date', how='inner')

# Merge the result with holidays on 'date'
final_join_test = third_join.merge(holidays, on='date', how='inner')

In [15]:
# Compute the final result
final_join_train = final_join_train.compute()
final_join_test = final_join_test.compute()

In [16]:
# Display the head of the final DataFrame
final_join_train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type_x,cluster,transactions,dcoilwtico,type_y,locale,locale_name,description,transferred
0,561,2013-01-01,25,AUTOMOTIVE,0.0,0,Salinas,Santa Elena,D,1,770,,Holiday,National,Ecuador,Primer dia del ano,False
1,562,2013-01-01,25,BABY CARE,0.0,0,Salinas,Santa Elena,D,1,770,,Holiday,National,Ecuador,Primer dia del ano,False
2,563,2013-01-01,25,BEAUTY,2.0,0,Salinas,Santa Elena,D,1,770,,Holiday,National,Ecuador,Primer dia del ano,False
3,564,2013-01-01,25,BEVERAGES,810.0,0,Salinas,Santa Elena,D,1,770,,Holiday,National,Ecuador,Primer dia del ano,False
4,565,2013-01-01,25,BOOKS,0.0,0,Salinas,Santa Elena,D,1,770,,Holiday,National,Ecuador,Primer dia del ano,False


In [17]:
final_join_test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion,city,state,type_x,cluster,transactions,dcoilwtico,type_y,locale,locale_name,description,transferred


In [18]:
final_join_train.shape

(322047, 17)

In [19]:
final_join_test.shape

(0, 16)

In [20]:
# Export the DataFrame to a CSV file
final_join_train.to_csv('store_train.csv', index=False)

In [21]:
final_join_test.to_csv('store_test.csv', index=False)

Features:
 -Product family
- City
- Percent oil
- Locale
- Date(dayoftheweek)
- Date(quartile)
- Date(month)
- Date(year)
- Type_y
- Promotion
- Store numbe
- Cluster
- Description
- Transferred holidaysr


Target:
- Sments can be made through advanced hyperparameter tuning and incorporating additional features.
"""

# Save the best model
import joblib

# Save the best model
joblib.dump(grid_search.best_estimator_, 'best_model.pkl')


In [22]:
# B. Analyze Stage

## 2. Data Preparation
final_join_train.isna().sum()
final_join_test.isna().sum()



id                  0
date                0
store_nbr           0
family              0
sales               0
onpromotion         0
city                0
state               0
type_x              0
cluster             0
transactions        0
dcoilwtico      22044
type_y              0
locale              0
locale_name         0
description         0
transferred         0
dtype: int64

In [None]:


# Display the first few rows of the dataset
data.head()

# Data Cleaning
# Handle missing values, convert data types, etc.
# For example:
data['Date'] = pd.to_datetime(data['Date'])
data = data.fillna(method='ffill')


In [None]:

# Feature Engineering
# Create new features, encode categorical variables, etc.
data['DayOfWeek'] = data['Date'].dt.dayofweek
data['Month'] = data['Date'].dt.month
data['Year'] = data['Date'].dt.year

# Display the cleaned data
data.head()

## 3. Data Visualization
# In this section, we will visualize the data to identify trends, seasonality, and other patterns.

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Sales over time
plt.figure(figsize=(10, 6))
sns.lineplot(x='Date', y='Sales', data=data)
plt.title('Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

# Additional visualizations
# Sales distribution by store
plt.figure(figsize=(10, 6))
sns.boxplot(x='Store', y='Sales', data=data)
plt.title('Sales Distribution by Store')
plt.xlabel('Store')
plt.ylabel('Sales')
plt.show()

## 4. Multistep forecasting strategies: multioutput, direct, recursive, DiRec
# This section covers different strategies for multi-step forecasting.

# Multistep Forecasting Strategies
"""
1. Multioutput: Predicting all future steps simultaneously.
2. Direct: Building separate models for each forecast step.
3. Recursive: Using the forecast from the previous step as an input for the next step.
4. DiRec: Combining direct and recursive approaches.
"""

# Example implementation of each strategy

# Multioutput Strategy
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor

# Define the base regressor
base_regressor = RandomForestRegressor()

# Multioutput regressor
multioutput_model = MultiOutputRegressor(base_regressor)

# Assuming X_train and y_train are prepared with the appropriate shape
# y_train should have multiple columns, one for each future step

# Fit the model
multioutput_model.fit(X_train, y_train)

# Predict
y_pred_multioutput = multioutput_model.predict(X_test)

# Direct Strategy
from sklearn.model_selection import train_test_split

# Create a dictionary to hold models for each step
direct_models = {}
steps = 5  # Number of future steps to predict

for step in range(1, steps + 1):
    model = RandomForestRegressor()
    y_train_step = y_train[:, step-1]  # Training data for this specific step
    model.fit(X_train, y_train_step)
    direct_models[step] = model

# Predict
y_pred_direct = np.column_stack([direct_models[step].predict(X_test) for step in range(1, steps + 1)])

# Recursive Strategy
recursive_model = RandomForestRegressor()

# Fit the model on initial training data
recursive_model.fit(X_train, y_train[:, 0])  # Fit on first step

# Initialize predictions
y_pred_recursive = np.zeros((X_test.shape[0], steps))

# Predict recursively
for step in range(steps):
    if step == 0:
        y_pred_recursive[:, step] = recursive_model.predict(X_test)
    else:
        X_test_step = np.hstack([X_test, y_pred_recursive[:, :step]])
        y_pred_recursive[:, step] = recursive_model.predict(X_test_step)

# DiRec Strategy
# Direct steps
direct_steps = 3  # Number of steps to predict directly
total_steps = 5   # Total number of steps to predict

# Create a dictionary to hold models for direct steps
direct_models = {}

for step in range(1, direct_steps + 1):
    model = RandomForestRegressor()
    y_train_step = y_train[:, step-1]
    model.fit(X_train, y_train_step)
    direct_models[step] = model

# Initialize predictions
y_pred_direc = np.zeros((X_test.shape[0], total_steps))

# Predict directly for initial steps
for step in range(1, direct_steps + 1):
    y_pred_direc[:, step-1] = direct_models[step].predict(X_test)

# Recursive steps
recursive_model = RandomForestRegressor()
X_train_recursive = np.hstack([X_train, y_train[:, :direct_steps]])
recursive_model.fit(X_train_recursive, y_train[:, direct_steps])

for step in range(direct_steps + 1, total_steps + 1):
    X_test_recursive = np.hstack([X_test, y_pred_direc[:, :step-1]])
    y_pred_direc[:, step-1] = recursive_model.predict(X_test_recursive)

# C. Construct Stage

## 5. Model Construction: Boosted Hybrid vs Stacked hybrid
# This section involves constructing and comparing different forecasting models.

# Model Construction
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

# Define models
boosted_model = GradientBoostingRegressor()
stacked_model = LinearRegression()

# Train-test split
X = data.drop('Sales', axis=1)
y = data['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train models
boosted_model.fit(X_train, y_train)
stacked_model.fit(X_train, y_train)

# Evaluate models
from sklearn.metrics import mean_squared_log_error

def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

y_pred_boosted = boosted_model.predict(X_test)
y_pred_stacked = stacked_model.predict(X_test)

print("Boosted Model RMSLE:", rmsle(y_test, y_pred_boosted))
print("Stacked Model RMSLE:", rmsle(y_test, y_pred_stacked))

# D. Execute Stage

## 6. Hyperparameter tuning
# This section focuses on tuning the hyperparameters of the models to improve performance.

# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2]
}

# Grid search for boosted model
grid_search = GridSearchCV(boosted_model, param_grid, cv=3, scoring='neg_mean_squared_log_error')
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print("Best parameters for boosted model:", best_params)

# Update boosted model with best parameters
boosted_model = GradientBoostingRegressor(**best_params)
boosted_model.fit(X_train, y_train)

## 7. Conclusion
# Summarize findings and discuss the next steps.

# Conclusion
"""
In this project, we explored various techniques for forecasting store sales using time series data. We prepared and visualized the data, implemented different multi-step forecasting strategies, and compared the performance of boosted and stacked hybrid models. Further improvements can be made through advanced hyperparameter tuning and incorporating additional features.
"""

# Save the best model
import joblib

# Save the best model
joblib.dump(grid_search.best_estimator_, 'best_model.pkl')


In [None]:
B. **pAce: Analyze stage**
    2. **Data Preparation**
    3. **Data Visualization**
    4. **Multistep forecasting strategies: multioutput, direct, recursive, DiRec**

# B. Analyze Stage

## 2. Data Preparation
# This section involves cleaning and preparing the data for analysis.

# Data Preparation
```python
import pandas as pd

# Load the dataset
data = pd.read_csv('path_to_your_dataset.csv')

# Display the first few rows of the dataset
data.head()

# Data Cleaning
# Handle missing values, convert data types, etc.
# ...

# Feature Engineering
# Create new features, encode categorical variables, etc.
# ...
