# Intro
Welcome to the [Store Sales - Time Series Forecasting](https://www.kaggle.com/c/store-sales-time-series-forecasting/data) competition.
![](https://storage.googleapis.com/kaggle-competitions/kaggle/29781/logos/header.png)

In this competition, we have to predict sales for the thousands of product families sold at Favorita stores located in [Ecuador](https://en.wikipedia.org/wiki/Ecuador).

<span style="color: royalblue;">Please vote the notebook up if it helps you. Feel free to leave a comment above the notebook. Thank you. </span>

# Libraries

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import preprocessing
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

# Path

In [2]:
path = './'
os.listdir(path)

['storesales-ts-starter.ipynb',
 'test.csv',
 '__init__.py',
 'README.md',
 'storesales-ts-starter_original.ipynb',
 '.ipynb_checkpoints',
 'train_sampled.csv']

# Load Data

In [5]:
train_data = pd.read_csv(path+'train.csv', index_col=0)
test_data = pd.read_csv(path+'test.csv', index_col=0)


# Overview

In [6]:
print('Number of train samples: ', len(train_data.index))
print('Number of test samples: ', len(test_data.index))
print('Number of features: ', len(train_data.columns))

Number of train samples:  3000888
Number of test samples:  28512
Number of features:  5


In [7]:
train_data.head()

Unnamed: 0_level_0,date,store_nbr,family,sales,onpromotion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,2013-01-01,1,BABY CARE,0.0,0
2,2013-01-01,1,BEAUTY,0.0,0
3,2013-01-01,1,BEVERAGES,0.0,0
4,2013-01-01,1,BOOKS,0.0,0


In [8]:
test_data.head()

Unnamed: 0_level_0,date,store_nbr,family,onpromotion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3000888,2017-08-16,1,AUTOMOTIVE,0
3000889,2017-08-16,1,BABY CARE,0
3000890,2017-08-16,1,BEAUTY,2
3000891,2017-08-16,1,BEVERAGES,20
3000892,2017-08-16,1,BOOKS,0


# Exploratory Data Analysis

## Feature family
The feature family has 33 categorical values which we have to encode later. The values are evenly distributed.

In [9]:
train_data['family'].value_counts()[0:3]

AUTOMOTIVE                    90936
HOME APPLIANCES               90936
SCHOOL AND OFFICE SUPPLIES    90936
Name: family, dtype: int64

# Feature Engineering

In [15]:
features = ['store_nbr', 'family', 'onpromotion']
target = 'sales'

## Create Feature Weekday
Based on the feature date we can create the features weekday, month or year.

In [16]:
def extract_weekday(s):
    return s.dayofweek

def extract_month(s):
    return s.month

def extract_year(s):
    return s.year

In [17]:
train_data['date'] = pd.to_datetime(train_data['date'])
train_data['weekday'] = train_data['date'].apply(extract_weekday)
train_data['year'] = train_data['date'].apply(extract_year)
train_data['month'] = train_data['date'].apply(extract_month)

test_data['date'] = pd.to_datetime(test_data['date'])
test_data['weekday'] = test_data['date'].apply(extract_weekday)
test_data['year'] = test_data['date'].apply(extract_year)
test_data['month'] = test_data['date'].apply(extract_month)

In [18]:
features.append('weekday')
features.append('year')
features.append('month')

## Encode Categorical Labels

In [19]:
enc = preprocessing.LabelEncoder()
enc.fit(train_data['family'])

LabelEncoder()

In [20]:
train_data['family'] = enc.transform(train_data['family'])
test_data['family'] = enc.transform(test_data['family'])

# Define Train, Val And Test Data

In [21]:
X_train = train_data[features]
y_train = train_data[target]
X_test = test_data[features]

In [22]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.33, random_state=2021)

# Simple Model
First we start with a simple model based on the feature in the train and test data.

XGB Regression:

In [23]:
model = XGBRegressor(objective='reg:squaredlogerror', n_estimators=200)
model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)
y_val_pred = np.where(y_val_pred<0, 0, y_val_pred)
print('Root Mean Squared Logaritmic Error:', np.sqrt(mean_squared_log_error(y_val, y_val_pred)))

Root Mean Squared Logaritmic Error: 1.5719910314401742


Linear Regression:

In [24]:
reg = LinearRegression(normalize=True).fit(X_train, y_train)
y_val_pred = reg.predict(X_val)
y_val_pred = np.where(y_val_pred<0, 0, y_val_pred)
print('Root Mean Squared Logaritmic Error:', np.sqrt(mean_squared_log_error(y_val, y_val_pred)))

Root Mean Squared Logaritmic Error: 3.6752759148942804
