# Intro
Welcome to the [Store Sales - Time Series Forecasting](https://www.kaggle.com/c/store-sales-time-series-forecasting/data) competition.
![](https://storage.googleapis.com/kaggle-competitions/kaggle/29781/logos/header.png)

In this competition, we have to predict sales for the thousands of product families sold at Favorita stores located in [Ecuador](https://en.wikipedia.org/wiki/Ecuador).

<span style="color: royalblue;">Please vote the notebook up if it helps you. Feel free to leave a comment above the notebook. Thank you. </span>

# Libraries

In [1]:
import os
import sys

module_path = os.path.abspath(os.path.join('..'))
sys.path.append(module_path)

from layersdk import Layer,dataset,model

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import preprocessing
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import train_test_split


import warnings
warnings.filterwarnings("ignore")

# Path

In [2]:
path = './'
os.listdir(path)

['storesales-ts-starter.ipynb',
 'test.csv',
 '__init__.py',
 'README.md',
 'storesales-ts-starter_original.ipynb',
 '.ipynb_checkpoints',
 'train_sampled.csv']

# Load Data

In [3]:
layer = Layer("store-sales")

In [15]:
@dataset("train")
def build_train_data():
    return pd.read_csv(path+'train_sampled.csv', index_col=0)

@dataset("test")
def build_test_data():
    return pd.read_csv(path+'test.csv', index_col=0)

# layer.run([build_train_data, build_test_data])
# train_data = layer.get_dataset("train").to_pandas()
# test_data = layer.get_dataset("test").to_pandas()

build_train_data()

* Building train...


Unnamed: 0_level_0,date,store_nbr,family,sales,onpromotion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2890663,2017-06-15,16,POULTRY,48.267002,0
590393,2013-11-28,24,MAGAZINES,0.000000,0
663785,2014-01-09,33,MAGAZINES,0.000000,0
2772047,2017-04-09,38,HARDWARE,0.000000,0
817057,2014-04-05,34,EGGS,51.000000,0
...,...,...,...,...,...
177723,2013-04-10,45,HOME CARE,0.000000,0
923059,2014-06-03,9,HOME AND KITCHEN II,0.000000,0
1811584,2015-10-16,39,HOME AND KITCHEN II,29.000000,0
2204880,2016-05-25,24,HOME CARE,240.000000,0


# Overview

In [5]:


print('Number of train samples: ', len(train_data.index))
print('Number of test samples: ', len(test_data.index))
print('Number of features: ', len(train_data.columns))

Number of train samples:  50000
Number of test samples:  28512
Number of features:  5


In [6]:
train_data.head()

Unnamed: 0_level_0,date,store_nbr,family,sales,onpromotion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2890663,2017-06-15,16,POULTRY,48.267002,0
590393,2013-11-28,24,MAGAZINES,0.0,0
663785,2014-01-09,33,MAGAZINES,0.0,0
2772047,2017-04-09,38,HARDWARE,0.0,0
817057,2014-04-05,34,EGGS,51.0,0


In [7]:
test_data.head()

Unnamed: 0_level_0,date,store_nbr,family,onpromotion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3000888,2017-08-16,1,AUTOMOTIVE,0
3000889,2017-08-16,1,BABY CARE,0
3000890,2017-08-16,1,BEAUTY,2
3000891,2017-08-16,1,BEVERAGES,20
3000892,2017-08-16,1,BOOKS,0


# Exploratory Data Analysis

## Feature family
The feature family has 33 categorical values which we have to encode later. The values are evenly distributed.

In [8]:
train_data['family'].value_counts()[0:3]

HOME AND KITCHEN II    1615
PERSONAL CARE          1565
FROZEN FOODS           1553
Name: family, dtype: int64

# Feature Engineering

In [9]:
features = ['store_nbr', 'family', 'onpromotion']
target = 'sales'

## Create Feature Weekday
Based on the feature date we can create the features weekday, month or year.

In [10]:
def extract_weekday(s):
    return s.dayofweek

def extract_month(s):
    return s.month

def extract_year(s):
    return s.year

In [11]:
enc = preprocessing.LabelEncoder()
enc.fit(train_data['family'])

@dataset("train_features")
def build_train_features():
    train_data = layer.get_dataset("train").to_pandas()
    print(train_data.head())
    train_data['date'] = pd.to_datetime(train_data['date'])
    train_data['weekday'] = train_data['date'].apply(extract_weekday)
    train_data['year'] = train_data['date'].apply(extract_year)
    train_data['month'] = train_data['date'].apply(extract_month)
    train_data['family'] = enc.transform(train_data['family'])
    return train_data

@dataset("test_features")
def build_test_features():
    test_data = layer.get_dataset("test").to_pandas()
    test_data['date'] = pd.to_datetime(test_data['date'])
    test_data['weekday'] = test_data['date'].apply(extract_weekday)
    test_data['year'] = test_data['date'].apply(extract_year)
    test_data['month'] = test_data['date'].apply(extract_month)
    test_data['family'] = enc.transform(test_data['family'])
    return test_data
    
layer.run([build_test_features, build_train_features])
# build_test_features().head()

--- Layer Infra: Running Project: store-sales ---
* Building test_features...
* Building train_features...
               date  store_nbr     family      sales  onpromotion
id                                                               
2890663  2017-06-15         16    POULTRY  48.267002            0
590393   2013-11-28         24  MAGAZINES   0.000000            0
663785   2014-01-09         33  MAGAZINES   0.000000            0
2772047  2017-04-09         38   HARDWARE   0.000000            0
817057   2014-04-05         34       EGGS  51.000000            0
--- Layer Infra: Run Complete! ---


In [12]:
features.append('weekday')
features.append('year')
features.append('month')

# Simple Model
First we start with a simple model based on the feature in the train and test data.

XGB Regression:

In [16]:
@model("model")
def train():
    train_data = build_test_features()
    test_data = layer.get_dataset("test_features").to_pandas()
    X_train = train_data[features]
    y_train = train_data[target]
    X_test = test_data[features]
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.33, random_state=2021)

    # XGB Regression
    model = XGBRegressor(objective='reg:squaredlogerror', n_estimators=200)
    model.fit(X_train, y_train)
    y_val_pred = model.predict(X_val)
    y_val_pred = np.where(y_val_pred<0, 0, y_val_pred)
    layer.log_metric('XGB_rmse', np.sqrt(mean_squared_log_error(y_val, y_val_pred)))
    
    # Linear Regression
    reg = LinearRegression(normalize=True).fit(X_train, y_train)
    y_val_pred = reg.predict(X_val)
    y_val_pred = np.where(y_val_pred<0, 0, y_val_pred)
    layer.log_metric('Linear_rmse', np.sqrt(mean_squared_log_error(y_val, y_val_pred)))

# layer.run(train)
train()

* Training model...
	model > Metric >XGB_rmse:1.7053408067392832
	model > Metric >Linear_rmse:3.6693205218294525
