# Store Sales - Multiple Features Forecasting

https://www.kaggle.com/competitions/store-sales-time-series-forecasting

***In this project, forecasting will be processed with considering 'promotion' featureas well.<br>
Also, we will implement forecasting per each product family***

The evaluation metric for this competition is Root Mean Squared Logarithmic Error.

The RMSLE is calculated as:
$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$
where:

𝑛 is the total number of instances,<br>
𝑦̂ 𝑖 is the predicted value of the target for instance (i),<br>
𝑦𝑖 is the actual value of the target for instance (i), and,<br>
log is the natural logarithm.

The training data; <br>
***store_nbr*** identifies the store at which the products are sold.<br>
***family*** identifies the type of product sold.<br>
***sales*** gives the total sales for a product family at a particular store at a given date.
Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).<br>
***onpromotion*** gives the total number of items in a product family that were being promoted at a store at a given date.

## Blue print

1. Investigate the dataset. (unique values, data type etc)
2. How to numerize *store_nbr* and *family* features?
3. How to convert *date* to time features?
4. Split *train* dataset to *ourtrain* and *ourtest* for pre-validation.
5. Apply various ML models. (Trend, Periodtogram, Cycles, Hybrid)
6. Choose the best model and apply to our test set.
7. Apply and make csv file for submition.


## Preprocessing

In [1]:
# Import packages
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
import datetime
import math
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess

# Ignore Future Warning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
# Load dataset
train = pd.read_csv('train.csv', parse_dates=["date"])
test = pd.read_csv('test.csv', parse_dates=["date"])

In [3]:
train

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.000,0
1,1,2013-01-01,1,BABY CARE,0.000,0
2,2,2013-01-01,1,BEAUTY,0.000,0
3,3,2013-01-01,1,BEVERAGES,0.000,0
4,4,2013-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
3000883,3000883,2017-08-15,9,POULTRY,438.133,0
3000884,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
3000885,3000885,2017-08-15,9,PRODUCE,2419.729,148
3000886,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


#### - Error function : 
$\sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}$

𝑛 is the total number of instances,<br>
𝑦̂ 𝑖 is the predicted value of the target for instance (i),<br>
𝑦𝑖 is the actual value of the target for instance (i)

In [35]:
# Error Function (RMSLE)
def error(y_p, y_t):
    pred_log = np.array([math.log(i+1) for i in np.nditer(y_p)])
    act_log = np.array([math.log(i+1) for i in np.nditer(y_t)])
    dum_error = sum((pred_log - act_log)**2)/len(pred_log)
    linear_error = np.power(dum_error, 1/2)
    return round(linear_error, 4)

## 1. Data investigation

#### - train dataset
* shape : 3000888 × 6
* null : none
<br><br>
* *date* : timestamp. 2013-01-01 ~ 2017-08-15
* *store_nbr* : numpy. 1 ~ 54
* *family* : str. ['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS',
       'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS',
       'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE',
       'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES',
       'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE',
       'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE',
       'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY',
       'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES',
       'SEAFOOD']
* *sales* : numpy. 0 ~ 124717
* *onpromotion* : numpy. 0 ~ 741


#### - Correlation

In [34]:
# Check correlation
train.corr()

Unnamed: 0,id,store_nbr,sales,onpromotion
id,1.0,0.000301,0.085784,0.20626
store_nbr,0.000301,1.0,0.041196,0.007286
sales,0.085784,0.041196,1.0,0.427923
onpromotion,0.20626,0.007286,0.427923,1.0


## 2. Numerize *'store_nbr'* and *'family'* features

## 3. Convert *'date'* to time features

## 4. Split *train* dataset to *ourtrain* and *ourtest*

## 5. Apply various ML models

## 6. Proceed forecasting with the best model

## 7. Generate csv file