<a href="https://colab.research.google.com/github/ikoghoemmanuell/Grocery-Store-Forecasting-Challenge-For-Azubian/blob/main/dev/notebooks/sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Title

# Description

In [None]:
pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install pmdarima

# Importation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

import matplotlib.dates as mdates
%matplotlib inline
from itertools import product

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import kpss
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from category_encoders.binary import BinaryEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.api import AutoReg
from pmdarima import auto_arima
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX

import warnings
import os
warnings.filterwarnings("ignore")
from google.colab import drive

# Data Loading

In [None]:
# Mount Google Drive
drive.mount('/content/drive')

# Define the folder path in Google Drive where your CSV files are located
folder_path = "/content/drive/MyDrive/Colab Notebooks/datasets/grocery store azubian"

# Load the CSV files into DataFrames
data = {}

# Iterate over the files in the folder
for file_name in os.listdir(folder_path):
    if file_name.endswith(".csv"):
        # Remove the file extension to get the variable name
        variable_name = file_name.replace(".csv", "")

        # Construct the file path
        file_path = os.path.join(folder_path, file_name)

        # Read the CSV file content into a DataFrame
        data[variable_name] = pd.read_csv(file_path)

# Access the data using dictionary keysdri
holidays = data["holidays"]
dates = data["dates"]
sample = data["SampleSubmission"]
stores = data["stores"]
test = data["test"]
train = data["train"]

In [None]:
# train = pd.read_csv("C:/Users/LENOVO/Music/Grocery-Store-Forecasting-Challenge-For-Azubian/assets/grocery store azubian/train.csv")
# test = pd.read_csv("C:/Users/LENOVO/Music/Grocery-Store-Forecasting-Challenge-For-Azubian/assets/grocery store azubian/test.csv")
# stores = pd.read_csv("C:/Users/LENOVO/Music/Grocery-Store-Forecasting-Challenge-For-Azubian/assets/grocery store azubian/stores.csv")
# sample = pd.read_csv("C:/Users/LENOVO/Music/Grocery-Store-Forecasting-Challenge-For-Azubian/assets/grocery store azubian/SampleSubmission.csv")
# dates = pd.read_csv("C:/Users/LENOVO/Music/Grocery-Store-Forecasting-Challenge-For-Azubian/assets/grocery store azubian/dates.csv")
# holidays = pd.read_csv("C:/Users/LENOVO/Music/Grocery-Store-Forecasting-Challenge-For-Azubian/assets/grocery store azubian/holidays.csv")

# Dataset overview

In [None]:
train.head()

train date is in numerical format. We'll have to convert it to Datetime format later

In [None]:
test.head()

In [None]:
test.info()

In [None]:
stores.head()

In [None]:
stores.info()

city, type & cluster are categoric variables, so they are not supposed to be in number datatype(int64)

furthermore, cities do not have an ordinal relationship with one another. Ordinal variables have a natural order. Just like "good-better-best" or "positive-neutral-negative". Nominal variables don't.

We don't want our machine learning models to think that one city-0 comes before city-1, which is before city-3.

Therefore, we'll have to change the datatypes to obect or string to make it more descriptive, for example: 'London', 'Tokyo', 'Rome' and so on.

Same goes for type and cluster.

In [None]:
dates['date'].unique()

This dataset contains dates and the features that have already been extracted from it

In [None]:
dates.info()

in this case, these categories have an ordinal relationship with one another, meaning one date naturally comes before the other,

so we can leave them as they are.

In [None]:
dates.dayofyear.unique()

we have 365 days in a year, 366 days is for a loop year. This is a problem for us. Let me explain why.

**Problem**:
When you have a loop year, then new year's eve would fall on day 366

otherwise, it would fall on day 365. So everyday might not fall on the appropriate number for each year.

**Solution**:
we will later convert two new columns called "sin(dayofyear)" & "cos(dayofyear)". These new columns will help our machine learning models to understand the cyclic nature of a year.

Cyclic means that a year usually starts and ends in a similar way.

In [None]:
holidays.head()

In [None]:
holidays.info()

In [None]:
holidays.type.unique()

The type column is a categoric variable, and each type of holiday does not have an ordinal relationship, since a holiday like new year is not higher or better than Christmas for example.

so we'll later convert them to string type to make it more descriptive

In [None]:
train.describe()

train dates range from **365** to **1626**

In [None]:
test.describe()

test dates range from **1627** to **1682**

this is a continuation from train. This makes sense since we are to predict future transactions based on past data

**note**: we will not be using transaction data to train our models, since transaction data was not provided for our test data.

In [None]:
dates.describe(),

dates are from **365** till **1684** which covers the train and test dates

so, we'll be able to add the features from here to both the train and test data based on the date

In [None]:
 holidays.describe()

In [None]:
# count the number of dates in the holidays dataset
holidays.date.nunique()

notice that the dates in the holiday dataset are not complete

so, we will later create a column for holidays in our train and test dataset based on the following logic:

if a date is in the holidays table, then its a holiday, else that date is not a holiday

## Hypothesis
**H0**: holidays have a big effect on sales, hence the sales data is seasonal.

**H1**: holidays don't affect sales, hence sales data is stationary.

## Questions

1. Is the train data complete?
2. Do we have seasonality in our sales?
3. Are there outliers in our dataset?
4. What is the difference between RMSLE, RMSE and MSE?

| Issues                                  | how we intend to solve them                                                                                                   |
|----------------------------------------|------------------------------------------------------------------------------------------------------------|
| 1. City, type & cluster in our stores dataset are mumerical | convert to string and make the categories more descriptive.                                            |
| 2. The dayofyear column in our dates dataset ranges from 1 to 366. This will make some days fall on the wrong number | find the sine and cosine of this column to represent the cyclic nature of a year. | We can also include weather conditions, holidays and events to this.                        |


# Data Cleaning

Here, we will prepare our data for Univariate and Bivariate analysis.

## Fixing our issues

1. City, type & cluster in our stores dataset are mumerical

Solution: convert to string and make the categories more descriptive.

**city**

In [None]:
stores.city.unique()

In [None]:
# using each city number as index,
# convert each city number to the corresponding city from a list of us_cities
stores.city = stores.city.apply(lambda x: 'city_'+ str(x))

In [None]:
stores.city.unique()

**type**

In [None]:
stores.type.unique()

In [None]:
# convert each store_type number to the corresponding store_type from a list of grocery_store_types
stores.type = stores.type.apply(lambda x: 'store_'+ str(x))

In [None]:
stores.type.unique()

**cluster**

In [None]:
stores.cluster.unique()

In [None]:
# convert each cluster number to the corresponding cluster from a list of us_cities
stores.cluster = stores.cluster.apply(lambda x: 'cluster_'+ str(x))

In [None]:
stores.cluster.unique()

In [None]:
holidays.type.unique()

2. The dayofyear column in our dates dataset ranges from 1 to 366. This will make some days fall on the wrong number

Solution: find the sine and cosine of this column to represent the cyclic nature of a year. We can also include weather conditions, holidays and events to this.

In [None]:
dates.info()

In [None]:
# create new coolumns to represent the cyclic nature of a year
dates["sin(dayofyear)"] = np.sin(dates["dayofyear"])
dates["cos(dayofyear)"] = np.cos(dates["dayofyear"])

In [None]:
def get_datetime(df):
  # Create a new column combining the year, month, and day of the month in the desired format
  df['date_extracted'] = (
      dates['year'].astype(int).add(2000).astype(str) + '-' +
      dates['month'].astype(str).str.zfill(2) + '-' +
      dates['dayofmonth'].astype(str).str.zfill(2)
  )

get_datetime(dates)

### merging our data

In [None]:
stores.rename(  # rename type to store_type to make it more descriptive
      columns={'type': 'store_type'},
      inplace=True)
holidays.rename(  # rename type to holiday_type to make it more descriptive
      columns={'type': 'holiday_type'},
      inplace=True)
# make each holiday type a string
holidays['holiday_type'] = holidays['holiday_type'].apply(lambda x: 'holiday_' + str(x))

In [None]:
#merging train and test with stores dataset

def merge(df1, df2):
    merged_df = df1.merge(df2, how='left', on='date')

    return merged_df

def merge_stores(df1, df2):
    merged_df = df1.merge(df2, how='left', on='store_id')

    return merged_df

In [None]:
def get_is_holiday_column(df):
  df['holiday_type'] = df['holiday_type'].fillna('Workday')

  # create column to show if its a holiday or not (non-holidays are zeros)
  df['is_holiday'] = df['holiday_type'].apply(
      lambda x: False if x=='Workday'
      else True)

we did this so our non-holidays can be zeros

now we must merge holidays with the merged data

since non-holidays are zeros, we don't want our ML Models to think that non-holidays(zeros) have an ordinal relationship with other holidays(1,2,3,4,)

in other words, non-holidays(zeros) don't always come before holidays(1,2,3,4,)

so, we must create a new column to show whether or not.

In [None]:
train_merged = merge_stores(train, stores)
train_merged1 = merge(train_merged, holidays)
get_is_holiday_column(train_merged1)
train_merged2 = merge(train_merged1, dates)

test_merged = merge_stores(test, stores)
test_merged1 = merge(test_merged, holidays)
get_is_holiday_column(test_merged1)
test_merged2 = merge(test_merged1, dates)

In [None]:
train_merged2['holiday_type'].unique()

In [None]:
# Convert the column to datetime with errors='coerce'
train_merged2['date_ext'] = pd.to_datetime(train_merged2['date_extracted'], errors='coerce')

# Filter rows with NaT values and convert them to a list
invalid_dates = train_merged2.loc[train_merged2['date_ext'].isna(), 'date_extracted'].tolist()

print(invalid_dates) #get a list of invalid dates
print(list(set(invalid_dates))) #unique invalid dates
train_merged2.drop('date_ext', axis=1, inplace=True)

since the only invalid date is 2003-02-29, then when converting to datetime,

we will first set invalid dates to NaT

then fill them with 2003-02-29

In [None]:
train_merged2['date_extracted'] = pd.to_datetime(train_merged2['date_extracted'], errors='coerce')
test_merged2['date_extracted'] = pd.to_datetime(test_merged2['date_extracted'])
train_merged2['date_extracted'].fillna('2003-02-29')

In [None]:
def set_index(df):
  df.drop('date', inplace=True, axis=1)
  df.set_index('date_extracted', inplace=True)
set_index(train_merged2)
set_index(test_merged2)

## Drop Duplicates

In [None]:
train_merged2.drop_duplicates(inplace=True)
test_merged2.drop_duplicates(inplace=True)

In [None]:
train = train_merged2
test = test_merged2

## Impute Missing Values

In [None]:
print(train.isnull().sum())
print(test.isnull().sum())

# Exploratory Data Analysis: EDA

## Hypothesis Validation
**H0**: holidays have a big effect on sales, hence the sales data is seasonal.

**H1**: holidays don't affect sales, hence sales data is stationary.

In [None]:
# Bar chart of sales by holiday type
train.groupby('holiday_type')['target'].sum().plot(kind='bar')
plt.xlabel('Holiday Type')
plt.ylabel('Sales')
plt.title('Total Sales by Holiday Type')
plt.show()

In [None]:
# Box plot of sales during holidays vs non-holidays
train.boxplot(column='target', by='is_holiday', figsize=(8, 6))
plt.xlabel('is_it_a_Holiday')
plt.ylabel('Sales')
plt.title('Sales During Holidays vs Non-Holidays')
plt.suptitle('')
plt.show()

## Answering Questions

1. Is the train data complete?

Yes. The output below shows that our train data is incomplete.

In [None]:
# create a function to check for missing extracted dates
def get_missing_dates(df):
  col = df.index
  missing_dates = (pd.date_range(

      start=col.min(), #start date
      end=col.max())   #end_date
      .difference(col))
  print(f"we have {len(missing_dates)} dates missing out of {len(col)}")
  return missing_dates

In [None]:
get_missing_dates(train)

In [None]:
get_missing_dates(test)

2. Do we have seasonality in our sales?

In [None]:
# Assuming your time series data is stored in the variable 'sales_data'
sales_data = train['target']

In [None]:
# Perform KPSS test
kpss_result = kpss(sales_data)
kpss_statistic = kpss_result[0]
kpss_pvalue = kpss_result[1]
kpss_critical_values = kpss_result[3]

In [None]:
print("\nKPSS Test:")
print("KPSS Statistic:", kpss_statistic)
print("p-value:", kpss_pvalue)

stationary if p-value > 0.05

series is stationary since 0.01 < 0.05

In [None]:
def check_stationarity(df, date_col, target_col, window=12):
    # Calculate rolling statistics
    rolling_std = df[target_col].rolling(window=window).std()
    rolling_mean = df[target_col].rolling(window=window).mean()

    # Plot original series and rolling statistics
    plt.figure(figsize=(10, 6))
    plt.plot(df.index, df[target_col], color='blue', label='Original Series')
    plt.plot(df.index, rolling_std, color='green', label='Rolling Std')
    plt.plot(df.index, rolling_mean, color='red', label='Rolling Mean')
    plt.legend()
    plt.title('Rolling Statistics')
    plt.xlabel('Date')
    plt.ylabel('Target(sales)')
    plt.tight_layout()  # Adjusts plot spacing
    plt.show()

# Example usage
df = sales_data  # Assuming the sales data is stored in a dataframe called sales_data
target_col = 'sales'  # Column containing the sales data

check_stationarity(train, 'date_extracted', 'target')

### Checking for Stationarity of the Train Dataset

In [None]:
# Perform seasonal decomposition
result = seasonal_decompose(train['target'], model='additive', period=12)  # Adjust the period as needed

# Plot the decomposed components
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(10, 8))
result.observed.plot(ax=ax1)
ax1.set_ylabel('Observed')
result.trend.plot(ax=ax2)
ax2.set_ylabel('Trend')
result.seasonal.plot(ax=ax3)
ax3.set_ylabel('Seasonal')
result.resid.plot(ax=ax4)
ax4.set_ylabel('Residual')
plt.tight_layout()
plt.show()

Observed values: These are the actual values of the time series. They represent the data points that are observed or recorded over a period of time. In the context of sales data, the observed values would be the actual sales figures recorded at different time intervals.

Trend: The trend component represents the long-term pattern or direction of the time series. It captures the underlying growth or decline in the data over an extended period. The trend component helps identify whether the series is increasing, decreasing, or remaining relatively stable over time.

Seasonal: The seasonal component represents the periodic patterns or fluctuations that occur within a time series. It captures the regular and repetitive variations that happen within specific time periods, such as daily, weekly, monthly, or yearly cycles. In sales data, seasonal patterns may include higher sales during holiday seasons or lower sales during certain months of the year.

Residual: The residual component, also known as the irregular or random component, represents the remaining variation in the time series after removing the trend and seasonal components. It includes any unpredictable or random fluctuations that are not accounted for by the trend or seasonal patterns. The residual component is often assumed to be noise or measurement error.

In [None]:
def time_plot(data, y_col, title):
    fig, ax = plt.subplots(figsize=(15,5))
    data.resample('M')[y_col].sum().plot(ax=ax, color='mediumblue', label='Total Sales')
    data.resample('M')[y_col].mean().plot(ax=ax, color='red', label='Mean Sales')

    ax.set(xlabel="Date",
           ylabel="Sales",
           title=title)

    ax.legend()
    sns.despine()

# Example usage with your specific details
time_plot(train, 'target', 'Monthly Sales Over the Years')

3. Are there outliers in our dataset?

4. What is the difference between RMSLE, RMSE and MSE?

## Univariate Analysis

## Bivariate Analysis

In [None]:
# Calculate the correlation matrix
correlation_matrix = train.corr()

# Find the moderately correlated variables
moderate_correlation = (correlation_matrix.abs() > 0.5) & (correlation_matrix != 1) & (correlation_matrix <0.8)

# Get the variable pairs with moderate correlation
moderate_correlation_pairs = [(i, j) for i in moderate_correlation.columns for j in moderate_correlation.columns if moderate_correlation.loc[i, j]]

# Print the moderately correlated variables
for pair in moderate_correlation_pairs:
    var1, var2 = pair
    correlation_value = correlation_matrix.loc[var1, var2]
    print(f"{var1} and {var2} are moderately correlated (correlation value: {correlation_value})")

These columns are all boolean, so let's look at others

In [None]:
# Set the threshold for high correlation
threshold = 0.8

# Find the highly correlated variables
high_correlation = (correlation_matrix.abs() > threshold) & (correlation_matrix != 1)

# Get the variable pairs with high correlation
high_correlation_pairs = [(i, j) for i in high_correlation.columns for j in high_correlation.columns if high_correlation.loc[i, j]]

# Print the highly correlated variables
for pair in high_correlation_pairs:
    var1, var2 = pair
    correlation_value = correlation_matrix.loc[var1, var2]
    print(f"{var1} and {var2} are highly correlated (correlation value: {correlation_value})")

In [None]:
# Specify the column pairs and their correlation values
column_pairs = [('year', 'year_weekofyear', 0.9884229388238451),
                ('month', 'dayofyear', 0.9964919406599103),
                ('month', 'weekofyear', 0.9658303008707717),
                ('month', 'quarter', 0.9713815220940318),
                ('dayofyear', 'month', 0.9964919406599103),
                ('dayofyear', 'weekofyear', 0.9669203951023091),
                ('dayofyear', 'quarter', 0.9685365398989686),
                ('weekofyear', 'month', 0.9658303008707717),
                ('weekofyear', 'dayofyear', 0.9669203951023091),
                ('weekofyear', 'quarter', 0.9426460215490194),
                ('quarter', 'month', 0.9713815220940318),
                ('quarter', 'dayofyear', 0.9685365398989686),
                ('quarter', 'weekofyear', 0.9426460215490194),
                ('year_weekofyear', 'year', 0.9884229388238451)]

# Create a grid layout for the scatter plots
num_pairs = len(column_pairs)
num_cols = 3  # Number of columns in the grid layout
num_rows = (num_pairs + num_cols - 1) // num_cols  # Number of rows in the grid layout

fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, 12))

# Create scatter plots for each column pair
for i, pair in enumerate(column_pairs):
    x_col, y_col, correlation_value = pair
    row = i // num_cols
    col = i % num_cols

    # Select the appropriate subplot for the scatter plot
    ax = axes[row, col] if num_rows > 1 else axes[col]

    # Create the scatter plot
    ax.scatter(train[x_col], train[y_col], alpha=0.5)
    ax.set_xlabel(x_col)
    ax.set_ylabel(y_col)
    ax.set_title(f"Scatter plot: {x_col} vs {y_col}\nCorrelation value: {correlation_value:.4f}")

# Adjust the spacing between subplots
fig.tight_layout()

# Display the grid of scatter plots
plt.show()

Let's leave these because more date_features will help our ML models accuracy in this case

In [None]:
# First format how figures apper in the notebook
pd.options.display.float_format = '{:.2f}'.format

**Summary of Our Sales and Number of Transactions**

In [None]:
# Calculate summary statistics
summary_stats = train[['target', 'nbr_of_transactions']].describe()
print(summary_stats)

**Histogram of Sales**

In [None]:
# Histogram of 'target'
train['target'].plot(kind='hist')
plt.xlabel('Target')
plt.ylabel('Frequency')
plt.title('Distribution of Target')
plt.show()

**Correlation Between Sales and number of Transactions**

In [None]:
# Correlation matrix
corr_matrix = train[['target', 'nbr_of_transactions']].corr()
print(corr_matrix)

# Scatter plot
plt.scatter(train['target'], train['nbr_of_transactions'])
plt.xlabel('Target')
plt.ylabel('Number of Transactions')
plt.title('Scatter Plot of Target vs Number of Transactions')
plt.show()

The correlation coefficient between 'target' and 'nbr_of_transactions' is 0.24. This indicates a positive correlation between the two variables, but the correlation is relatively weak.

It suggests that there is a weak tendency for the 'target' and 'nbr_of_transactions' to increase together, but the relationship is not very strong.

Therefore, based on the correlation coefficient of 0.24, there is a weak positive correlation between the 'target' and 'nbr_of_transactions' columns in our dataset.

**Observing Sales over time**

In [None]:
# Line plot of sales over time
train['target'].plot()
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Sales Over Time')
plt.show()

In [None]:
# Resample the data by day and calculate the total sales for each day
sales_daily = train['target'].resample('D').sum()

# Create a line plot of the "sales" column
plt.plot(sales_daily.index, sales_daily)

# Set the title and axis labels
plt.title("Total Sales by Day")
plt.xlabel("Date")
plt.ylabel("Total Sales")

# Display the plot
plt.show()

**Holiday Impact on Sales**

In [None]:
# Bar chart of sales by holiday type
train.groupby('holiday_type')['target'].sum().plot(kind='bar')
plt.xlabel('Holiday Type')
plt.ylabel('Sales')
plt.title('Total Sales by Holiday Type')
plt.show()

In [None]:
# Box plot of sales during holidays vs non-holidays
train.boxplot(column='target', by='is_holiday', figsize=(8, 6))
plt.xlabel('is_it_a_Holiday')
plt.ylabel('Sales')
plt.title('Sales During Holidays vs Non-Holidays')
plt.suptitle('')
plt.show()

**Stores Performance**

In [None]:
# Bar chart of sales by store
train.groupby('store_id')['target'].sum().plot(kind='bar')
plt.xlabel('Store ID')
plt.ylabel('Sales')
plt.title('Total Sales by Store')
plt.show()

In [None]:
# Bar chart of sales by category
train.groupby('category_id')['target'].sum().plot(kind='bar')
plt.xlabel('Category ID')
plt.ylabel('Sales')
plt.title('Total Sales by Category')
plt.show()

**Promotion Analysis**

In [None]:
# Separate data for promotion and non-promotion
promotion_data = train[train['onpromotion'] == 1]
non_promotion_data = train[train['onpromotion'] == 0]

# Calculate average sales per day for promotion and non-promotion
promotion_avg_sales = promotion_data.groupby(promotion_data.index)['target'].mean()
non_promotion_avg_sales = non_promotion_data.groupby(non_promotion_data.index)['target'].mean()

# Line plot of average sales with and without promotion
plt.plot(promotion_avg_sales.index, promotion_avg_sales, label='Promotion', color='blue')
plt.plot(non_promotion_avg_sales.index, non_promotion_avg_sales, label='No Promotion', color='red')
plt.xlabel('Date')
plt.ylabel('Average Sales')
plt.title('Average Sales with and without Promotion Over Time')
plt.legend()
plt.show()

**Monthly Statistics**

In [None]:
# Filter data for the period from 2001 to 2003
# sales_2001_to_2003 = train['2001':'2003']

# Group by month and calculate sum of sales
monthly_sales = train.groupby(train.index.month)['target'].sum()

# Find the month with the highest sales
highest_sales_month = monthly_sales.idxmax()

# Print the month with the highest sales
print("The month with the highest sales is:", highest_sales_month)

In [None]:
# Filter data for the period from 2001 to 2003
# sales_2001_to_2003 = train['2001':'2003']

# Group by month and calculate sum of sales
monthly_sales = train.groupby(train.index.month)['target'].sum()

# Find the month with the lowest sales
lowest_sales_month = monthly_sales.idxmin()

# Print the month with the lowest sales
print("The month with the lowest sales is:", lowest_sales_month)

In [None]:
# Resample to monthly frequency and calculate sum of sales
monthly_sales = train['target'].resample('M').sum()

# Plot the monthly sales data
monthly_sales.plot()
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Monthly Sales Over the Years')
plt.show()

## Multivariate Analysis

# Feature Engineering

## Creating New Features

In [None]:
def getDateFeatures(df):

    df["is_weekend"] = df["dayofweek"] > 4

    # Define the criteria for each season
    seasons = {'Winter': [12, 1, 2], 'Spring': [3, 4, 5], 'Summer': [6, 7, 8], 'Autumn': [9, 10, 11]}

    # Create the 'season' column based on the 'date' column
    df['season'] = df["month"].map({month: season for season, months in seasons.items() for month in months})

    return df

In [None]:
getDateFeatures(train)
getDateFeatures(test)

In [None]:
weekly_sum = train.groupby([pd.Grouper(freq='D'), 'store_id', 'category_id']).agg({'target': 'sum', 'onpromotion': 'sum', 'nbr_of_transactions': 'sum', 'city': 'first', 'store_type': 'first', 'cluster': 'first', 'holiday_type': 'first', 'is_holiday': 'first', 'year': 'first', 'month': 'first', 'dayofmonth': 'first', 'dayofweek': 'first', 'dayofyear': 'first', 'weekofyear': 'first', 'quarter': 'first', 'is_month_start': 'first', 'is_month_end': 'first', 'is_quarter_start': 'first', 'is_quarter_end': 'first', 'is_year_start': 'first', 'is_year_end': 'first', 'year_weekofyear': 'first', 'sin(dayofyear)': 'first', 'cos(dayofyear)': 'first', 'is_weekend': 'first', 'season': 'first'}).reset_index().set_index('date_extracted')
train = weekly_sum
weekly_sum1 = test.groupby([pd.Grouper(freq='D'), 'store_id', 'category_id']).agg({'onpromotion': 'sum', 'city': 'first', 'store_type': 'first', 'cluster': 'first', 'holiday_type': 'first', 'is_holiday': 'first', 'year': 'first', 'month': 'first', 'dayofmonth': 'first', 'dayofweek': 'first', 'dayofyear': 'first', 'weekofyear': 'first', 'quarter': 'first', 'is_month_start': 'first', 'is_month_end': 'first', 'is_quarter_start': 'first', 'is_quarter_end': 'first', 'is_year_start': 'first', 'is_year_end': 'first', 'year_weekofyear': 'first', 'sin(dayofyear)': 'first', 'cos(dayofyear)': 'first', 'is_weekend': 'first', 'season': 'first'}).reset_index().set_index('date_extracted')
test = weekly_sum1

In [None]:
# Selecting relevant columns and creating ID column
weekly_sum1['ID'] = 'year_week_' + weekly_sum1['year_weekofyear'].astype(str) + '_' + weekly_sum1['store_id'] + '_' + weekly_sum1['category_id']

## Features Encoding & scaling

In [None]:
numeric_columns = train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categoric_columns = [col for col in train.columns if col not in numeric_columns]
categoric_columns

In [None]:
numeric_columns.remove('target')
numeric_columns.remove('nbr_of_transactions')
# categoric_columns.remove('ID')
print(numeric_columns)

In [None]:
encoder = BinaryEncoder(drop_invariant=False, return_df=True,)
encoder.fit(train[categoric_columns])

In [None]:
scaler = StandardScaler()
scaler.set_output(transform="pandas")
scaler.fit(train[numeric_columns])

In [None]:
# import pickle

# with open('encoder.pkl', 'wb') as f:
#     pickle.dump(encoder, f)

# with open('scaler.pkl', 'wb') as f:
#     pickle.dump(scaler, f)

In [None]:
scaled_num = scaler.transform(train[numeric_columns])
scaled_num_test = scaler.transform(test[numeric_columns])

In [None]:
encoded_cat = encoder.transform(train[categoric_columns])
encoded_cat_test = encoder.transform(test[categoric_columns])

In [None]:
train = pd.concat([scaled_num, encoded_cat, train['target']], axis=1)
test = pd.concat([scaled_num_test, encoded_cat_test], axis=1)

## Resampling

In [None]:
# resampled = train.resample('W').mean()
# resampled_test = test.resample('W').mean()
# train = resampled
# test = resampled_test

**dataframe for the traditional time series models**

In [None]:
train1 = train[['target']].copy()

In [None]:
train1.head()

In [None]:
# Split data into parts
x = train.drop(['target'], axis = 1)
y = train['target']

In [None]:
len(train)-len(test)

In [None]:
# Split data into Train Test
X_train, X_test, y_train, y_test = x[len(train)-len(test):], x[:len(train)-len(test)], y[len(train)-len(test):], y[:len(train)-len(test)]

# Machine Learning Modeling

# Non-Traditional Time Series Models

### DecisionTreeRegressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor()
model_tree = tree.fit(X_train, y_train)

# Make prediction on X_test
tree_pred = model_tree.predict(X_test)

In [None]:
# feature importance for decision tree
plt.figure(figsize=(12,7))
plt.barh(X_train.columns, model_tree.feature_importances_)

In [None]:
plt.figure(figsize=(8,4))
plt.plot(y_test, label ='Actual Sales')
plt.plot(tree_pred, label='DecisionTreeRegressor')
plt.legend(loc='best')
plt.title('DecisionTreeRegressor Prediction')
plt.show()

In [None]:
mse = mean_squared_error(y_test, tree_pred )
rmse = np.sqrt(mean_squared_error(y_test, tree_pred )).round(2)
rmsle = np.sqrt(mean_squared_log_error(y_test, tree_pred)).round(2)
msle = mean_squared_log_error(y_test, tree_pred).round(2)


results = pd.DataFrame([['DecisionTree', mse, msle, rmse, rmsle]], columns = ['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
results

### KNN

In [None]:
from sklearn.neighbors import KNeighborsRegressor
neigh = KNeighborsRegressor(n_neighbors=1)
# fit model no training data
neigh.fit(X_train, y_train)

# make predictions for test data
neigh_pred = neigh.predict(X_test)

In [None]:
# feature importance for decision tree
plt.figure(figsize=(12,7))
plt.barh(X_train.columns, neigh.feature_importances_)

In [None]:
plt.figure(figsize=(8,4))
plt.plot(y_test, label ='Actual Sales')
plt.plot(neigh_pred, label='KNeighborsRegressor')
plt.legend(loc='best')
plt.title('KNeighborsRegressor Prediction')
plt.show()

In [None]:
mse = mean_squared_error(y_test, neigh_pred )
msle = mean_squared_log_error(y_test, neigh_pred)
rmse = np.sqrt(mean_squared_error(y_test, neigh_pred )).round(2)
rmsle = np.sqrt(mean_squared_log_error(y_test, neigh_pred)).round(5)

# model_results = pd.DataFrame([['lightGBM', mse, rmse]], columns = ['Model', 'MSE', 'RMSE'])
model_results = pd.DataFrame([['KNN', mse, msle, rmse, rmsle]], columns = ['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
results = results.append(model_results, ignore_index = True)
results

### RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
# Initialize and fit the Random Forest Regressor
forest = RandomForestRegressor()
model_forest = forest.fit(X_train, y_train)

# Make predictions on X_test
forest_pred = model_forest.predict(X_test)

In [None]:
plt.figure(figsize=(8, 4))
plt.plot(y_test, label='Actual Sales')
plt.plot(forest_pred, label='RandomForestRegressor')
plt.legend(loc='best')
plt.title('RandomForestRegressor Prediction')
plt.show()

In [None]:
mse = mean_squared_error(y_test, forest_pred)
msle = mean_squared_log_error(y_test, forest_pred)
rmse = np.sqrt(mse).round(2)
rmsle = np.sqrt(msle).round(5)

# Append the results to the DataFrame
model_results = pd.DataFrame([['Random Forest', mse, msle, rmse, rmsle]],
                             columns=['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
results = results.append(model_results, ignore_index=True)
results

### Support Vector Regression (SVR)

In [None]:
from sklearn.svm import SVR

# Initialize and fit the SVR model
svr = SVR()
model_svr = svr.fit(X_train, y_train)

# Make predictions on X_test
svr_pred = model_svr.predict(X_test)

In [None]:
plt.figure(figsize=(8, 4))
plt.plot(y_test, label='Actual Sales')
plt.plot(svr_pred, label='Support Vector Regression')
plt.legend(loc='best')
plt.title('Support Vector Regression Prediction')
plt.show()

In [None]:
# Append the results to the DataFrame
mse = mean_squared_error(y_test, svr_pred)
msle = mean_squared_log_error(y_test, svr_pred)
rmse = np.sqrt(mse).round(2)
rmsle = np.sqrt(msle).round(5)

model_results = pd.DataFrame([['SVR', mse, msle, rmse, rmsle]],
                             columns=['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
results = results.append(model_results, ignore_index=True)

### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize and fit the Gradient Boosting model
gbr = GradientBoostingRegressor()
model_gbr = gbr.fit(X_train, y_train)

# Make predictions on X_test
gbr_pred = model_gbr.predict(X_test)

In [None]:
plt.figure(figsize=(8, 4))
plt.plot(y_test, label='Actual Sales')
plt.plot(gbr_pred, label='Gradient Boosting')
plt.legend(loc='best')
plt.title('Gradient Boosting Prediction')
plt.show()

In [None]:
# Append the results to the DataFrame
mse = mean_squared_error(y_test, gbr_pred)
msle = mean_squared_log_error(y_test, gbr_pred)
rmse = np.sqrt(mse).round(2)
rmsle = np.sqrt(msle).round(5)

model_results = pd.DataFrame([['Gradient Boosting', mse, msle, rmse, rmsle]],
                             columns=['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
results = results.append(model_results, ignore_index=True)

### XGBoost

In [None]:
import xgboost as xgb

# Initialize and fit the XGBoost model
xgboost = xgb.XGBRegressor()
model_xgboost = xgboost.fit(X_train, y_train)

# Make predictions on X_test
xgboost_pred = model_xgboost.predict(X_test)

# Append the results to the DataFrame
mse = mean_squared_error(y_test, xgboost_pred)
msle = mean_squared_log_error(y_test, xgboost_pred)
rmse = np.sqrt(mse).round(2)
rmsle = np.sqrt(msle).round(5)

model_results = pd.DataFrame([['XGBoost', mse, msle, rmse, rmsle]],
                             columns=['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
results = results.append(model_results, ignore_index=True)

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

# Initialize and fit the Linear Regression model
linear_reg = LinearRegression()
model_linear_reg = linear_reg.fit(X_train, y_train)

# Make predictions on X_test
linear_reg_pred = model_linear_reg.predict(X_test)

# Append the results to the DataFrame
mse = mean_squared_error(y_test, linear_reg_pred)
msle = mean_squared_log_error(y_test, linear_reg_pred)
rmse = np.sqrt(mse).round(2)
rmsle = np.sqrt(msle).round(5)

model_results = pd.DataFrame([['Linear Regression', mse, msle, rmse, rmsle]],
                             columns=['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
results = results.append(model_results, ignore_index=True)

## Models Comparison

In [None]:
results

## Model Evaluation (Backtests)

In [None]:
# Backtests with KNN
scores = {}

for idx, period in enumerate(backtests):

    _train = train.reset_index()[train.reset_index()['date_extracted'] < backtests[period][0]]
    _test = train.reset_index()[(train.reset_index()['date_extracted'] >= backtests[period][0]) & (train.reset_index()['date_extracted'] <= backtests[period][1])]

    Xtrain, ytrain = _train.set_index(['date_extracted']).drop(columns=['target']), _train.target
    Xtest, ytest = _test.set_index(['date_extracted']).drop(columns=['target']), _test.target

    knn_model = KNeighborsRegressor(n_neighbors=1).fit(Xtrain, ytrain)

    ypred = knn_model.predict(Xtest)

    scores[period] = np.sqrt(mean_squared_error(ytest, ypred))

print(scores)

# Hyperparameter Tuning

### predicting sales in our test

In [None]:
test_pred = neigh.predict(test)
test_pred

In [None]:
weekly_sum1['target'] = test_pred
sub = weekly_sum1[['ID', 'target']]
sub

In [None]:
# Save sample submission
# test_sales[[ 'sales']].to_csv('submission.csv', index=False)