# Kaggle Getting Started Prediction Competition: Store Sales - Time Series Forecasting

In this [competition](https://www.kaggle.com/competitions/store-sales-time-series-forecasting), we will use time-series forecasting to forecast store sales on data from Corporación Favorita, a large Ecuadorian-based grocery retailer. The notebook is a buildup of hands-on-exercises presented in Kaggle Learn course of [Time Series Course](https://www.kaggle.com/learn/time-series) where you will learn to leverage periodic trends for forecasting as well as combine different models such as linear regression and XGBoost to perfect your forecasting. For the purpose of this tutorial we are looking at periodic trend for forecasting.

## Install necessary packages

We can install the necessary package by either running `pip install --user <package_name>` or include everything in a `requirements.txt` file and run `pip install --user -r requirements.txt`. We have put the dependencies in a `requirements.txt` file so we will use the former method.

Restart the kernel after installation

In [None]:
!pip install --user -r requirements.txt

## Download Data from Kaggle

Download relevant data from kaggle by running the below code cell. Follow the initial steps information mentioned in Github README.md to get the Kaggle username and key for authentication of Kaggle Public API. There's no need of secret to be created for the following step. The credentials will be present in the kaggle.json file. This cell needs to be run before starting Kale pipeline from  Kale deployment panel. Please ensure that you run the cell only once so you don't create nested directories. Restart the kernel before running the code cell again. 

In [None]:
import os

# Get the Kaggle Username and password from the kaggle.json file
# and paste it in place of KAGGLE_USERNAME AND KAGGLE_KEY on right hand side

os.environ['KAGGLE_USERNAME'] = "KAGGLE_USERNAME"
os.environ['KAGGLE_KEY'] = "KAGGLE_KEY"

path = "data/"

print("Current directory: %s" % os.getcwd())
os.chdir(os.getcwd())
os.system("mkdir " + path)
os.chdir(path)

import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

# Getting the files list from Kaggle using Kaggle api
file_list = api.competition_list_files('store-sales-time-series-forecasting')
# print(file_list)

# Download the entire dataset   
api.competition_download_files('store-sales-time-series-forecasting')

print("Unzipping the files ...")

# Get the path of the directory where the files are downloaded
path_dir = os.getcwd()
print("Data file path: %s" % path_dir)

from zipfile import ZipFile 

# # Extracting all files from competition zip file
zipfile = ZipFile(path_dir + '/store-sales-time-series-forecasting.zip', 'r')
zipfile.extractall()
zipfile.close()
    
print("Checking the files are extracted properly ...")
print("For the current dataset, we are only looking at bigger files as those are ones that take longer time for extraction")

for file in os.listdir(path_dir):
     filename = os.fsdecode(file)
     if filename.endswith(".csv"):
        file_size = os.path.getsize(path_dir + "/" + filename)
        if file_size< 1e9:
            file_size = str(round(file_size/(1024*1024))) + "MB"
        else:
            file_size = str(round(file_size/(1024*1024*1024))) + "GB"
        for file in file_list:
            if file.name == filename and file.size == file_size:
                print(file.name,file.size, file_size)

print("All files are downloaded and unzipped inside the data directory. Please move on to next step")

## Imports

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from sklearn.linear_model import LinearRegression
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

## Load the data

In [None]:
path = 'data'
# path = os.getcwd()

train_data_filepath = path + "/train.csv"
test_data_filepath = path + "/test.csv"
holidays_filepath = path + "/holidays_events.csv"

# Read the csv files into dataframes
# Training data
train_sales = pd.read_csv(train_data_filepath,
    usecols=['store_nbr', 'family', 'date', 'sales'],
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'sales': 'float32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
train_sales['date'] = train_sales.date.dt.to_period('D')
train_sales = train_sales.set_index(['store_nbr', 'family', 'date']).sort_index()

# Holiday features dataset
holidays_events = pd.read_csv(
    holidays_filepath,
    dtype={
        'type': 'category',
        'locale': 'category',
        'locale_name': 'category',
        'description': 'category',
        'transferred': 'bool',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
holidays_events = holidays_events.set_index('date').to_period('D')


# Test data id required for submission of forecast sales
df_test = pd.read_csv(
    test_data_filepath,
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'onpromotion': 'uint32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
df_test['date'] = df_test.date.dt.to_period('D')
df_test = df_test.set_index(['store_nbr', 'family', 'date']).sort_index()

In [None]:
train_sales.head()

In [None]:
holidays_events.head()

In [None]:
df_test.head()

## Create features

1. indicators for weekly seasons
2. Fourier features of order 4 for monthly seasons
3. Creating holiday features provided in the Store Sales Dataset

In [None]:
# National and regional holidays of Ecuador in the training set
# Holiday features
holidays = (
    holidays_events
    .query("locale in ['National', 'Regional']")
    .loc['2017':'2017-08-15', ['description']]
    .assign(description=lambda x: x.description.cat.remove_unused_categories())
)

In [None]:
print(holidays)

In [None]:
# Create training data features

y = train_sales.unstack(['store_nbr', 'family']).loc["2017"]

# Using CalendarFourier to create fourier features 
fourier = CalendarFourier(freq='M', order=4)

# Using DeterministicProcess to create indicators for both 
# weekly and monthly seasons
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    seasonal=True,               # weekly seasonality (indicators)
    additional_terms=[fourier],  # annual seasonality (fourier)
    drop=True,
)

# `in_sample` creates features for the dates given in the `index` argument
X = dp.in_sample()

ohe = OneHotEncoder(sparse=False)

X_holidays = pd.DataFrame(
    ohe.fit_transform(holidays),
    index=holidays.index,
    columns=holidays.description.unique(),
)

X_holidays = pd.get_dummies(holidays)

# Join holiday features to training data
X_2= X.join(X_holidays, on='date').fillna(0.0)

## Train and Evaluate the Model

In [None]:
# Split the data to train and valid datasets
X_train, X_valid, y_train, y_valid = train_test_split(X_2, y, test_size=0.1, shuffle=False)

# Train the model
model = LinearRegression(fit_intercept=False)
model.fit(X_train, y_train)

# Get the training and valid data predictions
y_train_pred = pd.DataFrame(model.predict(X_train), index=X_train.index, columns=y.columns)
y_valid_pred = pd.DataFrame(model.predict(X_valid), index=X_valid.index, columns=y.columns)
# Evaluate the model using mean_squared_log_error
print(mean_absolute_error(y_valid, y_valid_pred))

## Forecast sales

In [None]:
# Create features for test set
# "out of sample" refers to times outside of the observation period of the training data.
# We are forecasting for next 16 days from the end of the training data date
test = dp.out_of_sample(steps=16)
test.index.name = 'date'
X_test = test.join(X_holidays, on='date').fillna(0.0)
y_forecast = pd.DataFrame(model.predict(X_test), index=X_test.index, columns=y.columns)
y_forecast

## Submission

In [None]:
y_submit = y_forecast.stack(['store_nbr', 'family'])
y_submit = y_submit.join(df_test.id).reindex(columns=['id', 'sales'])
y_submit.to_csv('submission.csv', index=False)