# Intro
Hi kagglers, I was inspired to write this article by the outstanding works of <a href='https://www.kaggle.com/code/kelde9/darts-ensemble-stores-sales-forecasting'>Tom Keldenich</a> and <a href="https://www.kaggle.com/code/ferdinandberr/darts-forecasting-deep-learning-global-models">Ferdinand Berr</a>. Thanks to them, I picked up interesting ideas, their development gave a good score: 0.3804.

The key difference lies in the technique of combining model predictions. While he employed blending in work, I opted for a different approach - stacking. Also, I used another type of encoding and hyperparameters for the models.

First of all, I will focus on a detailed description of the used approach. If you need more information about EDA and data preparation, I suggest reading the works of others authors.

They comprehensively described these steps, so there is no need to dwell on them again.

But first things first.Before using  Darts with stacking, I tried other techniques and I'm happy to share the results. Perhaps this will help save time by avoiding going in the wrong direction.

<b>Here some score of them</b>:
- Prophet 1782 timeseries with exogenus seasonality by years and covariates: 0.59676
- SARIMAX 1782 timeseries with exogenus seasonality by years: 0.46409
- as a previous plus target variable log1p transformation : 0.49116
- hybrid models: SARIMAX prediction for main data and LightGBM training on residuals:  0.48831
- CatBoost 54 timeseries grouped by families: 0.53593
- LightGBM 54 timeseries grouped by families: 0.47366
- as a previous plus target variable log1p transformation : 0.42178

Darts:
- LightGBM + One hot + Blending 4 lags: 0.38127
- LightGBM + Ordinal: 0.3839
- LightGBM + Ordinal + without Scaler: 0.38697
- LightGBM + Ordinal+ Blending 4 lags: 0.38107
- LightGBM + Ordinal + Blending 6 lags: 0.38055
- LightGBM + Ordinal + Blending 6 lags + Robust Scaler: 0.39995

## Setup

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/store-sales-time-series-forecasting/oil.csv
/kaggle/input/store-sales-time-series-forecasting/sample_submission.csv
/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv
/kaggle/input/store-sales-time-series-forecasting/stores.csv
/kaggle/input/store-sales-time-series-forecasting/train.csv
/kaggle/input/store-sales-time-series-forecasting/test.csv
/kaggle/input/store-sales-time-series-forecasting/transactions.csv


In [2]:
!pip install darts &> /dev/null

In [3]:
import darts
print(darts.__version__)

0.27.1


In [4]:
import warnings
warnings.filterwarnings('ignore')

# Import Data
Our objective is to forecast 16 days sales of 54 stores in Ecuador based on data from the period of January 01, 2013, to August 15, 2017.

### train

In [5]:
import pandas as pd
import numpy as np

In [6]:
df_train = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/train.csv')
display(df_train.head())

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


Please note that we do not have sales data for December 25 of each year. We will need to fill in the missing data:

In [7]:
df_train.iloc[np.where(df_train['date'].str.contains('12-25'))]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion


###  holidays_events

In [8]:
df_holidays_events = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv')
display(df_holidays_events.head())

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [9]:
df_holidays_events.groupby(['type', 'locale'])['locale_name'].count()

type        locale  
Additional  Local        11
            National     40
Bridge      National      5
Event       National     56
Holiday     Local       137
            National     60
            Regional     24
Transfer    Local         4
            National      8
Work Day    National      5
Name: locale_name, dtype: int64

### oil

In [10]:
df_oil = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/oil.csv')
display(df_oil.head())

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


### store

In [11]:
df_stores = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/stores.csv')
display(df_stores.head())

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


### transactions

In [12]:
df_transactions = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/transactions.csv')
display(df_transactions.head())

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


Note: a transaction is a receipt created after a customer’s purchase

###  test and sample_submission

In [13]:
df_test = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/test.csv')
df_sample_submission = pd.read_csv('/kaggle/input/store-sales-time-series-forecasting/sample_submission.csv')
display(df_test.head())
display(df_sample_submission.head())

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


Unnamed: 0,id,sales
0,3000888,0.0
1,3000889,0.0
2,3000890,0.0
3,3000891,0.0
4,3000892,0.0


# Preprocessing

In [14]:
family_list = df_train['family'].unique()
store_list = df_stores['store_nbr'].unique()
display(family_list)
display(store_list)

array(['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS',
       'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS',
       'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE',
       'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES',
       'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE',
       'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE',
       'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY',
       'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES',
       'SEAFOOD'], dtype=object)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54])

Let's combine the training and test datasets. So we can find covariates and extract IDs for prediction on the test set.

In [15]:
train_merged = pd.merge(df_train, df_stores, on ='store_nbr')
train_merged = train_merged.sort_values(["store_nbr","family","date"])
train_merged = train_merged.astype({"store_nbr":'str', "family":'str', "city":'str',
                          "state":'str', "type":'str', "cluster":'str'})

display(train_merged.head())

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13
33,1782,2013-01-02,1,AUTOMOTIVE,2.0,0,Quito,Pichincha,D,13
66,3564,2013-01-03,1,AUTOMOTIVE,3.0,0,Quito,Pichincha,D,13
99,5346,2013-01-04,1,AUTOMOTIVE,3.0,0,Quito,Pichincha,D,13
132,7128,2013-01-05,1,AUTOMOTIVE,5.0,0,Quito,Pichincha,D,13


## TimeSeries

In [16]:
from darts import TimeSeries
from tqdm import tqdm

In [17]:
family_TS_dict = {}

for family in tqdm(family_list):
    df_family = train_merged.loc[train_merged['family'] == family]

    list_of_TS_family = TimeSeries.from_group_dataframe(
                                df_family,
                                time_col="date",
                                group_cols=["store_nbr","family"], # columns for grouping time series
                                static_cols=["city","state","type","cluster"], # static covariates
                                value_cols="sales", # target
                                fill_missing_dates=True, # filling missing dates, remember Dec 25th
                                freq='D' # days
                                )
    for ts in list_of_TS_family:
            ts = ts.astype(np.float32)

    list_of_TS_family = sorted(list_of_TS_family, key=lambda ts: int(ts.static_covariates_values()[0,0]))
    family_TS_dict[family] = list_of_TS_family

100%|██████████| 33/33 [00:44<00:00,  1.35s/it]


Let's talk more about TimeSeries. This is the main object of the Darts library. Understanding its specifics will greatly simplify the comprehension of the subsequent code. We have created a dictionary with Multivariate TimeSeries
Dictionary structure:

In [18]:
family_TS_dict.keys()

dict_keys(['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS', 'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS', 'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE', 'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES', 'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE', 'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE', 'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY', 'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES', 'SEAFOOD'])

Timeseries for family 'AUTOMOTIVE' and store_nbr 1
date: number (range) of days in the time series

component: data from the columns that we passed in the parameter 'value_cols'. In our case, the number of daily sales.

In [19]:
family_TS_dict['AUTOMOTIVE'][0]

Constant values for all observations in the group:

In [20]:
family_TS_dict['AUTOMOTIVE'][0].static_covariates

static_covariates,store_nbr,family,city,state,type,cluster
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
sales,1,AUTOMOTIVE,Quito,Pichincha,D,13


We can select specific dates with slicing:

In [21]:
family_TS_dict['AUTOMOTIVE'][0][:10]

### Normalizing time series
Let's create a pipeline to automate our actions.
We will use the following data preprocessing:
- Filling missing values.
- Encoding of static covariates. In my example, this is Ordinal Encoding. You can use any other one from the sklearn. It must have methods: fit(), transform() , inverse_transform().
- Target variable transformation. log1p significantly improves prediction accuracy.
- Scaler. Default is Min Max Scaler.

In [22]:
from darts.dataprocessing import Pipeline
from darts.dataprocessing.transformers import Scaler, StaticCovariatesTransformer, MissingValuesFiller, InvertibleMapper
from sklearn.preprocessing import OrdinalEncoder

In [23]:
family_pipeline_dict = {}
family_TS_transformed_dict = {}

for key in family_TS_dict:
    train_filler = MissingValuesFiller(verbose=False, n_jobs=-1, name="Fill NAs")
    static_cov_transformer = StaticCovariatesTransformer(verbose=False, transformer_cat = OrdinalEncoder(), name="Encoder")
    log_transformer = InvertibleMapper(np.log1p, np.expm1, verbose=False, n_jobs=-1, name="Log-Transform")   
    train_scaler = Scaler(verbose=False, n_jobs=-1, name="Scaling")

    train_pipeline = Pipeline([train_filler,
                             static_cov_transformer,
                             log_transformer,
                             train_scaler])

    training_transformed = train_pipeline.fit_transform(family_TS_dict[key])
    family_pipeline_dict[key] = train_pipeline
    family_TS_transformed_dict[key] = training_transformed

In [24]:
family_TS_transformed_dict['AUTOMOTIVE'][0][:10]

## Covariates

> **A covariate** is a variable that helps to predict a target variable.

This covariate can be dependent on the target variable. For example, the type of store, `type`, where the sales are made. But it can also be independent. For example, the price of oil on the day of the sale of a product.

This covariate can be known in advance, for example in our dataset we have the price of oil from January 1, 2013 to August 31, 2017. In this case, we talk about a **future covariate**.

There are also **past covariates**. These are covariates that are not known in advance. For example in our dataset, the transactions are known for the dates January 1, 2013 to August 15, 2017.

### Date

In [25]:
from darts.utils.timeseries_generation import datetime_attribute_timeseries

In [26]:
full_time_period = pd.date_range(start='2013-01-01', end='2017-08-31', freq='D')


year = datetime_attribute_timeseries(time_index = full_time_period, attribute="year")
month = datetime_attribute_timeseries(time_index = full_time_period, attribute="month")
day = datetime_attribute_timeseries(time_index = full_time_period, attribute="day")
dayofyear = datetime_attribute_timeseries(time_index = full_time_period, attribute="dayofyear")
weekday = datetime_attribute_timeseries(time_index = full_time_period, attribute="dayofweek")
weekofyear = datetime_attribute_timeseries(time_index = full_time_period, attribute="weekofyear")
timesteps = TimeSeries.from_times_and_values(times=full_time_period,
                                             values=np.arange(len(full_time_period)),
                                             columns=["linear_increase"])

time_cov = year.stack(month).stack(day).stack(dayofyear).stack(weekday).stack(weekofyear).stack(timesteps)
time_cov = time_cov.astype(np.float32)

In [27]:
display(print(time_cov.components.values))
display(time_cov[100])

['year' 'month' 'day' 'dayofyear' 'dayofweek' 'weekofyear'
 'linear_increase']


None

In [28]:
time_cov_scaler = Scaler(verbose=False, n_jobs=-1, name="Scaler")
time_cov_train, time_cov_val = time_cov.split_before(pd.Timestamp('20170816'))
time_cov_scaler.fit(time_cov_train)
time_cov_transformed = time_cov_scaler.transform(time_cov)

### Oil

In [29]:
from darts.models.filtering.moving_average_filter import MovingAverageFilter

In [30]:
# Oil Price

oil = TimeSeries.from_dataframe(df_oil, 
                                time_col = 'date', 
                                value_cols = ['dcoilwtico'],
                                freq = 'D')

oil = oil.astype(np.float32)

# Transform
oil_filler = MissingValuesFiller(verbose=False, n_jobs=-1, name="Filler")
oil_scaler = Scaler(verbose=False, n_jobs=-1, name="Scaler")
oil_pipeline = Pipeline([oil_filler, oil_scaler])
oil_transformed = oil_pipeline.fit_transform(oil)

# Moving Averages for Oil Price
oil_moving_average_7 = MovingAverageFilter(window=7)
oil_moving_average_28 = MovingAverageFilter(window=28)

oil_moving_averages = []

ma_7 = oil_moving_average_7.filter(oil_transformed).astype(np.float32)
ma_7 = ma_7.with_columns_renamed(col_names=ma_7.components, col_names_new="oil_ma_7")
ma_28 = oil_moving_average_28.filter(oil_transformed).astype(np.float32)
ma_28 = ma_28.with_columns_renamed(col_names=ma_28.components, col_names_new="oil_ma_28")
oil_moving_averages = ma_7.stack(ma_28)

In [31]:
display(oil_moving_averages[100])

### Holidays

In [32]:
def holiday_list(df_stores):

    listofseries = []
    
    for i in range(0,len(df_stores)):        
            df_holiday_dummies = pd.DataFrame(columns=['date'])
            df_holiday_dummies["date"] = df_holidays_events["date"]
    
            df_holiday_dummies["national_holiday"] = np.where(((df_holidays_events["type"] == "Holiday") & (df_holidays_events["locale"] == "National")), 1, 0)

            df_holiday_dummies["earthquake_relief"] = np.where(df_holidays_events['description'].str.contains('Terremoto Manabi'), 1, 0)

            df_holiday_dummies["christmas"] = np.where(df_holidays_events['description'].str.contains('Navidad'), 1, 0)

            df_holiday_dummies["football_event"] = np.where(df_holidays_events['description'].str.contains('futbol'), 1, 0)

            df_holiday_dummies["national_event"] = np.where(((df_holidays_events["type"] == "Event") & (df_holidays_events["locale"] == "National") & (~df_holidays_events['description'].str.contains('Terremoto Manabi')) & (~df_holidays_events['description'].str.contains('futbol'))), 1, 0)

            df_holiday_dummies["work_day"] = np.where((df_holidays_events["type"] == "Work Day"), 1, 0)

            df_holiday_dummies["local_holiday"] = np.where(((df_holidays_events["type"] == "Holiday") & ((df_holidays_events["locale_name"] == df_stores['state'][i]) | (df_holidays_events["locale_name"] == df_stores['city'][i]))), 1, 0)
                     
            listofseries.append(df_holiday_dummies)

    return listofseries

In [33]:
def remove_0_and_duplicates(holiday_list):

    listofseries = []
    
    for i in range(0,len(holiday_list)):         
            df_holiday_per_store = list_of_holidays_per_store[i].set_index('date')

            df_holiday_per_store = df_holiday_per_store.loc[~(df_holiday_per_store==0).all(axis=1)]
            
            df_holiday_per_store = df_holiday_per_store.groupby('date').agg({'national_holiday':'max', 'earthquake_relief':'max', 
                                   'christmas':'max', 'football_event':'max', 
                                   'national_event':'max', 'work_day':'max', 
                                   'local_holiday':'max'}).reset_index()

            listofseries.append(df_holiday_per_store)

    return listofseries

In [34]:
def holiday_TS_list_54(holiday_list):
    listofseries = []
    
    for i in range(0,54):
            holidays_TS = TimeSeries.from_dataframe(list_of_holidays_per_store[i], 
                                        time_col = 'date',
                                        fill_missing_dates=True,
                                        fillna_value=0,
                                        freq='D')
            
            holidays_TS = holidays_TS.slice(pd.Timestamp('20130101'),pd.Timestamp('20170831'))
            holidays_TS = holidays_TS.astype(np.float32)
            listofseries.append(holidays_TS)

    return listofseries

In [35]:
list_of_holidays_per_store = holiday_list(df_stores)
list_of_holidays_per_store = remove_0_and_duplicates(list_of_holidays_per_store)   
list_of_holidays_store = holiday_TS_list_54(list_of_holidays_per_store)

holidays_filler = MissingValuesFiller(verbose=False, n_jobs=-1, name="Filler")
holidays_scaler = Scaler(verbose=False, n_jobs=-1, name="Scaler")

holidays_pipeline = Pipeline([holidays_filler, holidays_scaler])
holidays_transformed = holidays_pipeline.fit_transform(list_of_holidays_store)

In [36]:
display(len(holidays_transformed))
display(holidays_transformed[0].components.values)
display(holidays_transformed[0][100])

54

array(['national_holiday', 'earthquake_relief', 'christmas',
       'football_event', 'national_event', 'work_day', 'local_holiday'],
      dtype=object)

### Promotion

In [37]:
from tqdm import tqdm

In [38]:
df_promotion = pd.concat([df_train, df_test], axis=0)
df_promotion = df_promotion.sort_values(["store_nbr","family","date"])
df_promotion.tail()

family_promotion_dict = {}

for family in tqdm(family_list):
    df_family = df_promotion.loc[df_promotion['family'] == family]

    list_of_TS_promo = TimeSeries.from_group_dataframe(
                                df_family,
                                time_col="date",
                                group_cols=["store_nbr","family"],
                                value_cols="onpromotion",
                                fill_missing_dates=True,
                                freq='D')

    for ts in list_of_TS_promo:
        ts = ts.astype(np.float32)

    family_promotion_dict[family] = list_of_TS_promo

100%|██████████| 33/33 [00:45<00:00,  1.37s/it]


In [39]:
display(family_promotion_dict['AUTOMOTIVE'][0])

In [40]:
promotion_transformed_dict = {}

for key in tqdm(family_promotion_dict):
    promo_filler = MissingValuesFiller(verbose=False, n_jobs=-1, name="Fill NAs")
    promo_scaler = Scaler(verbose=False, n_jobs=-1, name="Scaling")

    promo_pipeline = Pipeline([promo_filler,
                             promo_scaler])

    promotion_transformed = promo_pipeline.fit_transform(family_promotion_dict[key])

    # Moving Averages for Promotion Family Dictionaries
    promo_moving_average_7 = MovingAverageFilter(window=7)
    promo_moving_average_28 = MovingAverageFilter(window=28)

    promotion_covs = []

    for ts in promotion_transformed:
        ma_7 = promo_moving_average_7.filter(ts)
        ma_7 = TimeSeries.from_series(ma_7.pd_series())  
        ma_7 = ma_7.astype(np.float32)
        ma_7 = ma_7.with_columns_renamed(col_names=ma_7.components, col_names_new="promotion_ma_7")
        ma_28 = promo_moving_average_28.filter(ts)
        ma_28 = TimeSeries.from_series(ma_28.pd_series())  
        ma_28 = ma_28.astype(np.float32)
        ma_28 = ma_28.with_columns_renamed(col_names=ma_28.components, col_names_new="promotion_ma_28")
        promo_and_mas = ts.stack(ma_7).stack(ma_28)
        promotion_covs.append(promo_and_mas)

    promotion_transformed_dict[key] = promotion_covs

100%|██████████| 33/33 [01:30<00:00,  2.73s/it]


In [41]:
display(promotion_transformed_dict['AUTOMOTIVE'][0].components.values)
display(promotion_transformed_dict['AUTOMOTIVE'][0][1])

array(['onpromotion', 'promotion_ma_7', 'promotion_ma_28'], dtype=object)

### Grouping the covariates

In [42]:
general_covariates = time_cov_transformed.stack(oil_transformed).stack(oil_moving_averages)

In [43]:
store_covariates_future = []

for store in range(0,len(store_list)):
    stacked_covariates = holidays_transformed[store].stack(general_covariates)  
    store_covariates_future.append(stacked_covariates)

In [44]:
future_covariates_dict = {}

for key in tqdm(promotion_transformed_dict):
    promotion_family = promotion_transformed_dict[key]
    covariates_future = [promotion_family[i].stack(store_covariates_future[i]) for i in range(0,len(promotion_family))]
    future_covariates_dict[key] = covariates_future

100%|██████████| 33/33 [00:06<00:00,  4.90it/s]


In [45]:
display(future_covariates_dict['AUTOMOTIVE'][0].components)

Index(['onpromotion', 'promotion_ma_7', 'promotion_ma_28', 'national_holiday',
       'earthquake_relief', 'christmas', 'football_event', 'national_event',
       'work_day', 'local_holiday', 'year', 'month', 'day', 'dayofyear',
       'dayofweek', 'weekofyear', 'linear_increase', 'dcoilwtico', 'oil_ma_7',
       'oil_ma_28'],
      dtype='object', name='component')

### Transactions – Past Covariates

In [46]:
df_transactions.sort_values(["store_nbr","date"], inplace=True)

TS_transactions_list = TimeSeries.from_group_dataframe(
                                df_transactions,
                                time_col="date",
                                group_cols=["store_nbr"],
                                value_cols="transactions",
                                fill_missing_dates=True,
                                freq='D')

transactions_list = []

for ts in TS_transactions_list:
            series = TimeSeries.from_series(ts.pd_series())
            series = series.astype(np.float32)
            transactions_list.append(series)

transactions_list[24] = transactions_list[24].slice(start_ts=pd.Timestamp('20130102'), end_ts=pd.Timestamp('20170815'))

from datetime import datetime, timedelta

transactions_list_full = []

for ts in transactions_list:
    if ts.start_time() > pd.Timestamp('20130101'):
        end_time = (ts.start_time() - timedelta(days=1))
        delta = end_time - pd.Timestamp('20130101')
        zero_series = TimeSeries.from_times_and_values(
                                  times=pd.date_range(start=pd.Timestamp('20130101'), 
                                  end=end_time, freq="D"),
                                  values=np.zeros(delta.days+1))
        ts = zero_series.append(ts)
        ts = ts.with_columns_renamed(col_names=ts.components, col_names_new="transactions")
        transactions_list_full.append(ts)

transactions_filler = MissingValuesFiller(verbose=False, n_jobs=-1, name="Filler")
transactions_scaler = Scaler(verbose=False, n_jobs=-1, name="Scaler")

transactions_pipeline = Pipeline([transactions_filler, transactions_scaler])
transactions_transformed = transactions_pipeline.fit_transform(transactions_list_full)

In [47]:
display(transactions_transformed[0])

### Create dataframe with id and reduce memory usage

In [48]:
df_indexes = pd.concat([df_train, df_test])
df_indexes = df_indexes.drop(['onpromotion'], axis=1)
df_indexes = df_indexes.sort_values(by=['store_nbr','family',])
df_indexes.date = pd.to_datetime(df_indexes.date)
df_indexes.shape

(3029400, 5)

In [49]:
df_indexes = df_indexes.set_index('date')
df_indexes.head()

Unnamed: 0_level_0,id,store_nbr,family,sales
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2013-01-01,0,1,AUTOMOTIVE,0.0
2013-01-02,1782,1,AUTOMOTIVE,2.0
2013-01-03,3564,1,AUTOMOTIVE,3.0
2013-01-04,5346,1,AUTOMOTIVE,3.0
2013-01-05,7128,1,AUTOMOTIVE,5.0


In [50]:
date_range = pd.date_range(start=df_indexes.index.min(), end=df_indexes.index.max(), freq="D")
df_indexes_filled = pd.DataFrame(columns=df_indexes.columns)

for family in tqdm(family_list):
    for store in store_list:
        temp_df = df_indexes.iloc[np.where((df_indexes.family == family)&(df_indexes.store_nbr == store))]
        temp_df = temp_df.reindex(date_range).fillna({'id': np.nan, 'store_nbr': store, 'family':family, 'sales': np.nan})
        df_indexes_filled = pd.concat([df_indexes_filled, temp_df])

df_indexes_filled.head()

100%|██████████| 33/33 [15:41<00:00, 28.54s/it]


Unnamed: 0,id,store_nbr,family,sales
2013-01-01,0.0,1.0,AUTOMOTIVE,0.0
2013-01-02,1782.0,1.0,AUTOMOTIVE,2.0
2013-01-03,3564.0,1.0,AUTOMOTIVE,3.0
2013-01-04,5346.0,1.0,AUTOMOTIVE,3.0
2013-01-05,7128.0,1.0,AUTOMOTIVE,5.0


In [51]:
df_indexes_filled.index.name = 'date'
df_indexes_filled = df_indexes_filled.reset_index()
df_indexes_filled = df_indexes_filled.sort_values(['store_nbr', 'family'])
df_indexes_filled.head()

Unnamed: 0,date,id,store_nbr,family,sales
0,2013-01-01,0.0,1.0,AUTOMOTIVE,0.0
1,2013-01-02,1782.0,1.0,AUTOMOTIVE,2.0
2,2013-01-03,3564.0,1.0,AUTOMOTIVE,3.0
3,2013-01-04,5346.0,1.0,AUTOMOTIVE,3.0
4,2013-01-05,7128.0,1.0,AUTOMOTIVE,5.0


In [52]:
last_train_date = pd.to_datetime(df_train.date.max())

In [53]:
import gc

In [54]:
del(df_train)
del(df_test)
del(df_stores)
del(df_holidays_events)
del(df_oil)
del(df_transactions)
del(df_indexes)
del(train_merged)

gc.collect()

178

# Model

In [55]:
from darts.models import LightGBMModel

We will use a general predictions function for blending and stacking.

In [56]:
'''
The function takes two arguments as input:
model_params: model hyperparameters, tuning them can improve prediction accuracy.
val_df_size: number of days in the validation set. The parameter is needed to determine
the size of the validation sample in the stacking. For blending, set the default value to zero.
'''

def lgbm_predictions(model_params, val_df_size = 0):
    l_train_date = last_train_date - np.timedelta64(val_df_size, 'D')
    local_df_indexes = df_indexes_filled.iloc[np.where(df_indexes_filled.date > l_train_date)]
    
    submission_kaggle_list = []    
    cnt = 1
    
    for params in model_params:
        LGBM_Models_Submission = {}
        display("Training...")
            
        # Fit Model
        print(f'Start fit model {cnt}')
        for family in tqdm(family_list):        
            sales_family = family_TS_transformed_dict[family]
            # training_data: represents the number of sales in the training sample minus the sales for the val
            training_data = [ts[:1688-val_df_size] for ts in sales_family]
            # TCN_covariates: represents the future covariates associated with the target product family
            TCN_covariates = future_covariates_dict[family]
            # train_sliced: represents the number of sales associated with the target product family.
            # slice_intersect: function that you can see used simply ensures that the components span the same time interval. 
            # In the case of different time intervals an error message will appear if we try to combine them.
            train_sliced = [training_data[i].slice_intersect(TCN_covariates[i]) for i in range(0,len(training_data))]
            

            LGBM_Model_Submission = LightGBMModel(lags = params["lags"],
                                                  lags_future_covariates = params["lags_future_covariates"],
                                                  lags_past_covariates = params["lags_past_covariates"],
                                                  output_chunk_length=1,
                                                  random_state=2022,
                                                  gpu_use_dp= "false")


            LGBM_Model_Submission.fit(series=train_sliced, 
                                  future_covariates=TCN_covariates,
                                  # transactions_transformed: the past covariates do not need to be indexed on the target 
                                  # family because there is only one global `TimeSeries` per store.
                                  past_covariates=transactions_transformed)

            LGBM_Models_Submission[family] = LGBM_Model_Submission

        display("Predictions...")
        LGBM_Forecasts_Families_Submission = {}

        # Predict
        print(f'Start predict model {cnt}')
        for family in tqdm(family_list):
            sales_family = family_TS_transformed_dict[family]
            training_data = [ts[:1688-val_df_size] for ts in sales_family]
            LGBM_covariates = future_covariates_dict[family]
            train_sliced = [training_data[i].slice_intersect(TCN_covariates[i]) for i in range(0,len(training_data))]

            forecast_LGBM = LGBM_Models_Submission[family].predict(
                                                  n= 16 + val_df_size,
                                                  series=train_sliced,
                                                  future_covariates=LGBM_covariates,
                                                  past_covariates=transactions_transformed
                                                 )

            LGBM_Forecasts_Families_Submission[family] = forecast_LGBM

        # Transform Back
        print(f'Start transform Back {cnt}')
        LGBM_Forecasts_Families_back_Submission = {}

        for family in tqdm(family_list):
            LGBM_Forecasts_Families_back_Submission[family] = family_pipeline_dict[family].inverse_transform(LGBM_Forecasts_Families_Submission[family], partial=True)

        # Prepare Submission in Correct Format
        print(f'Start Prepare Submission {cnt}')
        for family in tqdm(LGBM_Forecasts_Families_back_Submission):
            for n in range(0,len(LGBM_Forecasts_Families_back_Submission[family])):
                if (family_TS_dict[family][n].univariate_values()[-21:] == 0).all():
                    LGBM_Forecasts_Families_back_Submission[family][n] = LGBM_Forecasts_Families_back_Submission[family][n].map(lambda x: x * 0)

        listofseries = []

        for store in tqdm(range(0,54)):
            for family in family_list:
                oneforecast = LGBM_Forecasts_Families_back_Submission[family][store].pd_dataframe()
                oneforecast.columns = ['y_pred']
                listofseries.append(oneforecast)

        df_forecasts = pd.concat(listofseries) 
        df_forecasts.reset_index(drop=True, inplace=True)

        # No Negative Forecasts
        print(f'Start No Negative Forecasts {cnt}')
        df_forecasts[df_forecasts < 0] = 0
        forecasts_kaggle = pd.concat([local_df_indexes['id'], df_forecasts.set_index(local_df_indexes.index)], axis=1)
        forecasts_kaggle = forecasts_kaggle.reset_index(drop=True)

        # Submission
        print(f'Start Submission {cnt}')
        submission_kaggle_list.append(forecasts_kaggle)
        cnt += 1
    
    return submission_kaggle_list, local_df_indexes

# Blending
The idea of blending is extremely simple and intuitive, which does not prevent the technique from showing good results.

Suppose if one model makes an error in one direction, then another model can correct this error if its predictions are biased in the other direction.

As a result, the mean value of the predictions may be closer to the true value than the predictions of individual models.

However, it should be noted that averaging the results of models can only be effective if the models are diverse and independent of each other.

If the models are highly correlated or have similar errors, averaging may not provide a meaningful improvement.

Let's set significantly different model hyperparameters to reduce the chance of error correlation:

In [57]:
model_params = [
    {"lags" : 63, "lags_future_covariates" : (14,1), "lags_past_covariates" : [-16,-17,-18,-19,-20,-21,-22]},
    {"lags" : 7, "lags_future_covariates" : (16,1), "lags_past_covariates" : [-16,-17,-18,-19,-20,-21,-22]},  
    {"lags" : 31, "lags_future_covariates" : (14,1), "lags_past_covariates" : [-16,-17,-18,-19,-20,-21,-22]},
    {"lags" : 365, "lags_future_covariates" : (14,1), "lags_past_covariates" : [-16,-17,-18,-19,-20,-21,-22]}, 
    {"lags" : 730, "lags_future_covariates" : (14,1), "lags_past_covariates" : [-16,-17,-18,-19,-20,-21,-22]}, 
    {"lags" : 1095, "lags_future_covariates" : (14,1), "lags_past_covariates" : [-16,-17,-18,-19,-20,-21,-22]}
]

In [58]:
submission_kaggle_list, clipped_indexes = lgbm_predictions(model_params)

'Training...'

Start fit model 1


100%|██████████| 33/33 [07:46<00:00, 14.14s/it]


'Predictions...'

Start predict model 1


100%|██████████| 33/33 [00:30<00:00,  1.09it/s]


Start transform Back 1


100%|██████████| 33/33 [00:33<00:00,  1.02s/it]


Start Prepare Submission 1


100%|██████████| 33/33 [00:00<00:00, 36.83it/s]
100%|██████████| 54/54 [00:01<00:00, 27.97it/s]


Start No Negative Forecasts 1
Start Submission 1


'Training...'

Start fit model 2


100%|██████████| 33/33 [07:04<00:00, 12.87s/it]


'Predictions...'

Start predict model 2


100%|██████████| 33/33 [00:28<00:00,  1.16it/s]


Start transform Back 2


100%|██████████| 33/33 [00:34<00:00,  1.03s/it]


Start Prepare Submission 2


100%|██████████| 33/33 [00:00<00:00, 36.16it/s]
100%|██████████| 54/54 [00:01<00:00, 48.45it/s]


Start No Negative Forecasts 2
Start Submission 2


'Training...'

Start fit model 3


100%|██████████| 33/33 [06:54<00:00, 12.57s/it]


'Predictions...'

Start predict model 3


100%|██████████| 33/33 [00:29<00:00,  1.12it/s]


Start transform Back 3


100%|██████████| 33/33 [00:33<00:00,  1.02s/it]


Start Prepare Submission 3


100%|██████████| 33/33 [00:00<00:00, 37.59it/s]
100%|██████████| 54/54 [00:02<00:00, 26.03it/s]


Start No Negative Forecasts 3
Start Submission 3


'Training...'

Start fit model 4


100%|██████████| 33/33 [15:16<00:00, 27.76s/it]


'Predictions...'

Start predict model 4


100%|██████████| 33/33 [00:30<00:00,  1.08it/s]


Start transform Back 4


100%|██████████| 33/33 [00:35<00:00,  1.08s/it]


Start Prepare Submission 4


100%|██████████| 33/33 [00:00<00:00, 35.22it/s]
100%|██████████| 54/54 [00:01<00:00, 51.11it/s]


Start No Negative Forecasts 4
Start Submission 4


'Training...'

Start fit model 5


100%|██████████| 33/33 [19:51<00:00, 36.10s/it]


'Predictions...'

Start predict model 5


100%|██████████| 33/33 [00:31<00:00,  1.06it/s]


Start transform Back 5


100%|██████████| 33/33 [00:34<00:00,  1.06s/it]


Start Prepare Submission 5


100%|██████████| 33/33 [00:00<00:00, 35.94it/s]
100%|██████████| 54/54 [00:01<00:00, 46.83it/s]


Start No Negative Forecasts 5
Start Submission 5


'Training...'

Start fit model 6


100%|██████████| 33/33 [19:42<00:00, 35.83s/it]


'Predictions...'

Start predict model 6


100%|██████████| 33/33 [00:32<00:00,  1.03it/s]


Start transform Back 6


100%|██████████| 33/33 [00:34<00:00,  1.06s/it]


Start Prepare Submission 6


100%|██████████| 33/33 [00:00<00:00, 36.66it/s]
100%|██████████| 54/54 [00:01<00:00, 50.91it/s]


Start No Negative Forecasts 6
Start Submission 6


We average the obtained predictions:

In [59]:
submissions = submission_kaggle_list[0].copy()
submissions = submissions.rename(columns={'y_pred': 'y_pred_0'})

if len(submission_kaggle_list) > 1:
    for i in range(1, len(submission_kaggle_list)):
        y_pred = submission_kaggle_list[i]
        y_pred = y_pred.rename(columns={'y_pred': f'y_pred_{i}'})
        submissions = pd.concat([submissions, y_pred.drop(['id'], axis=1)], axis=1)

submissions['sales'] = submissions.loc[:, submissions.columns!='id'].mean(axis=1)
submissions.head()

Unnamed: 0,id,y_pred_0,y_pred_1,y_pred_2,y_pred_3,y_pred_4,y_pred_5,sales
0,3000888.0,3.466254,3.328092,3.50758,3.068898,3.731862,4.001414,3.51735
1,3002670.0,2.890935,2.603473,3.216886,3.253258,3.68185,4.065727,3.285355
2,3004452.0,3.972446,2.813861,4.004264,3.85236,3.61572,2.8426,3.516875
3,3006234.0,5.056782,3.124398,5.022578,4.502551,5.044515,4.717491,4.578052
4,3008016.0,1.801655,1.05393,1.46369,1.787588,2.311553,2.026005,1.740737


In [60]:
submission = submissions[['id', 'sales']]
submission = submission.sort_values('id')
submission.id = submission.id.astype('int32')
submission.head()

Unnamed: 0,id,sales
0,3000888,3.51735
16,3000889,0.0
32,3000890,4.315871
48,3000891,2288.981609
64,3000892,0.031684


In [61]:
submission.to_csv('/kaggle/working/submission.csv', index=False)

# Stacking

Stacking is also an ensemble learning method, but it is based on a slightly different approach.
The idea sounds like this - let's use the predictions of weak and maximally different models as predictors of the meta-model.

We will split the training dataset into training and validation data.
Darts models will be trained on the training data and make predictions for the validation and test data.

In [62]:
val_df_size = 100
submission_kaggle_list, clipped_indexes = lgbm_predictions(model_params, val_df_size)

'Training...'

Start fit model 1


100%|██████████| 33/33 [07:22<00:00, 13.40s/it]


'Predictions...'

Start predict model 1


100%|██████████| 33/33 [00:41<00:00,  1.25s/it]


Start transform Back 1


100%|██████████| 33/33 [00:35<00:00,  1.07s/it]


Start Prepare Submission 1


100%|██████████| 33/33 [00:00<00:00, 38.40it/s]
100%|██████████| 54/54 [00:01<00:00, 50.48it/s]


Start No Negative Forecasts 1
Start Submission 1


'Training...'

Start fit model 2


100%|██████████| 33/33 [06:48<00:00, 12.39s/it]


'Predictions...'

Start predict model 2


100%|██████████| 33/33 [00:40<00:00,  1.24s/it]


Start transform Back 2


100%|██████████| 33/33 [00:34<00:00,  1.06s/it]


Start Prepare Submission 2


100%|██████████| 33/33 [00:00<00:00, 36.46it/s]
100%|██████████| 54/54 [00:01<00:00, 49.04it/s]


Start No Negative Forecasts 2
Start Submission 2


'Training...'

Start fit model 3


100%|██████████| 33/33 [06:32<00:00, 11.90s/it]


'Predictions...'

Start predict model 3


100%|██████████| 33/33 [00:42<00:00,  1.28s/it]


Start transform Back 3


100%|██████████| 33/33 [00:35<00:00,  1.08s/it]


Start Prepare Submission 3


100%|██████████| 33/33 [00:00<00:00, 34.61it/s]
100%|██████████| 54/54 [00:02<00:00, 20.03it/s]


Start No Negative Forecasts 3
Start Submission 3


'Training...'

Start fit model 4


100%|██████████| 33/33 [14:41<00:00, 26.73s/it]


'Predictions...'

Start predict model 4


100%|██████████| 33/33 [00:44<00:00,  1.33s/it]


Start transform Back 4


100%|██████████| 33/33 [00:33<00:00,  1.03s/it]


Start Prepare Submission 4


100%|██████████| 33/33 [00:00<00:00, 35.14it/s]
100%|██████████| 54/54 [00:01<00:00, 48.12it/s]


Start No Negative Forecasts 4
Start Submission 4


'Training...'

Start fit model 5


100%|██████████| 33/33 [18:35<00:00, 33.79s/it]


'Predictions...'

Start predict model 5


100%|██████████| 33/33 [00:44<00:00,  1.34s/it]


Start transform Back 5


100%|██████████| 33/33 [00:35<00:00,  1.09s/it]


Start Prepare Submission 5


100%|██████████| 33/33 [00:00<00:00, 34.77it/s]
100%|██████████| 54/54 [00:01<00:00, 44.83it/s]


Start No Negative Forecasts 5
Start Submission 5


'Training...'

Start fit model 6


100%|██████████| 33/33 [17:08<00:00, 31.17s/it]


'Predictions...'

Start predict model 6


100%|██████████| 33/33 [00:43<00:00,  1.33s/it]


Start transform Back 6


100%|██████████| 33/33 [00:35<00:00,  1.07s/it]


Start Prepare Submission 6


100%|██████████| 33/33 [00:00<00:00, 35.30it/s]
100%|██████████| 54/54 [00:01<00:00, 49.51it/s]


Start No Negative Forecasts 6
Start Submission 6


In [63]:
submissions = submission_kaggle_list[0].copy()
submissions = submissions.rename(columns={'y_pred': 'y_pred_0'})

if len(submission_kaggle_list) > 1:
    for i in range(1, len(submission_kaggle_list)):
        y_pred = submission_kaggle_list[i]
        y_pred = y_pred.rename(columns={'y_pred': f'y_pred_{i}'})
        submissions = pd.concat([submissions, y_pred.drop(['id'], axis=1)], axis=1)

submissions.head()

Unnamed: 0,id,y_pred_0,y_pred_1,y_pred_2,y_pred_3,y_pred_4,y_pred_5
0,2822688.0,3.676266,3.417199,3.199446,3.579452,3.115974,3.06506
1,2824470.0,3.232229,3.578848,2.721255,3.442423,3.270835,2.882628
2,2826252.0,3.599256,3.569002,3.272049,3.206762,3.518412,2.981064
3,2828034.0,4.008501,3.567155,2.867772,3.685296,3.150041,3.790595
4,2829816.0,3.639679,2.998901,2.978183,3.13212,3.369821,3.219728


In [64]:
clipped_indexes.head()

Unnamed: 0,date,id,store_nbr,family,sales
1588,2017-05-08,2822688.0,1.0,AUTOMOTIVE,5.0
1589,2017-05-09,2824470.0,1.0,AUTOMOTIVE,2.0
1590,2017-05-10,2826252.0,1.0,AUTOMOTIVE,2.0
1591,2017-05-11,2828034.0,1.0,AUTOMOTIVE,4.0
1592,2017-05-12,2829816.0,1.0,AUTOMOTIVE,4.0


In [65]:
submissions = pd.concat([submissions, clipped_indexes[['date', 'sales']].reset_index(drop=True)], axis=1)
submissions.head()

Unnamed: 0,id,y_pred_0,y_pred_1,y_pred_2,y_pred_3,y_pred_4,y_pred_5,date,sales
0,2822688.0,3.676266,3.417199,3.199446,3.579452,3.115974,3.06506,2017-05-08,5.0
1,2824470.0,3.232229,3.578848,2.721255,3.442423,3.270835,2.882628,2017-05-09,2.0
2,2826252.0,3.599256,3.569002,3.272049,3.206762,3.518412,2.981064,2017-05-10,2.0
3,2828034.0,4.008501,3.567155,2.867772,3.685296,3.150041,3.790595,2017-05-11,4.0
4,2829816.0,3.639679,2.998901,2.978183,3.13212,3.369821,3.219728,2017-05-12,4.0


In [66]:
del(submission_kaggle_list)
del(df_indexes_filled)
del(clipped_indexes)

gc.collect()

689

Retrieve future covariates data from the TimeSeries:

In [67]:
future_covariates_df = pd.DataFrame(columns=['store_nbr', 'family'] + list(future_covariates_dict['AUTOMOTIVE'][0].columns))

for store in tqdm(range(0, 54)):
    for family in list(family_TS_dict.keys()):
        fut_cov_temp = pd.DataFrame(future_covariates_dict[family][store][1688-val_df_size:].values(),
                                    index=future_covariates_dict[family][store].time_index[1688-val_df_size:],
                                    columns=future_covariates_dict[family][store].columns)
        fut_cov_temp['store_nbr'] = store + 1
        fut_cov_temp['family'] = family
        future_covariates_df = pd.concat([future_covariates_df, fut_cov_temp])
        
future_covariates_df.head()

100%|██████████| 54/54 [00:19<00:00,  2.71it/s]


Unnamed: 0,store_nbr,family,onpromotion,promotion_ma_7,promotion_ma_28,national_holiday,earthquake_relief,christmas,football_event,national_event,...,year,month,day,dayofyear,dayofweek,weekofyear,linear_increase,dcoilwtico,oil_ma_7,oil_ma_28
2017-05-08,1,AUTOMOTIVE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.363636,0.233333,0.347945,0.0,0.346154,0.941316,0.240081,0.241925,0.261599
2017-05-09,1,AUTOMOTIVE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.363636,0.266667,0.350685,0.166667,0.346154,0.941909,0.232737,0.244632,0.262407
2017-05-10,1,AUTOMOTIVE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.363636,0.3,0.353425,0.333333,0.346154,0.942502,0.249793,0.247791,0.26321
2017-05-11,1,AUTOMOTIVE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.363636,0.333333,0.356164,0.5,0.346154,0.943094,0.25607,0.2514,0.263959
2017-05-12,1,AUTOMOTIVE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.363636,0.366667,0.358904,0.666667,0.346154,0.943687,0.256307,0.255461,0.263794


Enrich model predictions with future covariate data:

In [68]:
data_with_preds = pd.concat([submissions, future_covariates_df.reset_index(drop=True)], axis=1)
data_with_preds = data_with_preds.set_index('date')
data_with_preds.head()

Unnamed: 0_level_0,id,y_pred_0,y_pred_1,y_pred_2,y_pred_3,y_pred_4,y_pred_5,sales,store_nbr,family,...,year,month,day,dayofyear,dayofweek,weekofyear,linear_increase,dcoilwtico,oil_ma_7,oil_ma_28
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-05-08,2822688.0,3.676266,3.417199,3.199446,3.579452,3.115974,3.06506,5.0,1,AUTOMOTIVE,...,1.0,0.363636,0.233333,0.347945,0.0,0.346154,0.941316,0.240081,0.241925,0.261599
2017-05-09,2824470.0,3.232229,3.578848,2.721255,3.442423,3.270835,2.882628,2.0,1,AUTOMOTIVE,...,1.0,0.363636,0.266667,0.350685,0.166667,0.346154,0.941909,0.232737,0.244632,0.262407
2017-05-10,2826252.0,3.599256,3.569002,3.272049,3.206762,3.518412,2.981064,2.0,1,AUTOMOTIVE,...,1.0,0.363636,0.3,0.353425,0.333333,0.346154,0.942502,0.249793,0.247791,0.26321
2017-05-11,2828034.0,4.008501,3.567155,2.867772,3.685296,3.150041,3.790595,4.0,1,AUTOMOTIVE,...,1.0,0.363636,0.333333,0.356164,0.5,0.346154,0.943094,0.25607,0.2514,0.263959
2017-05-12,2829816.0,3.639679,2.998901,2.978183,3.13212,3.369821,3.219728,4.0,1,AUTOMOTIVE,...,1.0,0.363636,0.366667,0.358904,0.666667,0.346154,0.943687,0.256307,0.255461,0.263794


In [69]:
data_with_preds.store_nbr = data_with_preds.store_nbr.astype('int32')

Let's create a validation and test dataset.

The meta-model will use the predictions as training parameters and will attempt to identify dependencies between the predictions and the original targets.

In [70]:
val = data_with_preds.iloc[np.where(data_with_preds.index < '2017-08-16')]
test = data_with_preds.iloc[np.where(data_with_preds.index >= '2017-08-16')]
print(val.index.min(), val.index.max())
print(test.index.min(), test.index.max())

2017-05-08 00:00:00 2017-08-15 00:00:00
2017-08-16 00:00:00 2017-08-31 00:00:00


In [71]:
from lightgbm import LGBMRegressor

In [72]:
feature_cols = list(val.columns.drop(['id', 'sales', 'family']))

In [73]:
result = pd.DataFrame(columns=['id', 'sales'])

for family in tqdm(family_list):
    temp_val = val.iloc[np.where(val.family.values == family)]
    temp_test = test.iloc[np.where(test.family.values == family)]

    temp_val['sales'] = np.log1p(temp_val['sales'])

    lgbm = LGBMRegressor()
    lgbm.fit(temp_val[feature_cols], temp_val['sales'])
    y_pred = lgbm.predict(temp_test[feature_cols])
    y_pred = np.expm1(y_pred)

    temp_test = pd.concat([temp_test['id'].reset_index(drop=True), pd.Series(y_pred, name='sales')], axis=1)
    result = pd.concat([result, temp_test])

result['sales'][result['sales'] < 0] = 0

100%|██████████| 33/33 [00:07<00:00,  4.44it/s]


In [74]:
submission_stack = result.sort_values('id')
submission_stack.id = submission_stack.id.astype('int32')
submission_stack.head()

Unnamed: 0,id,sales
0,3000888,4.265958
0,3000889,0.0
0,3000890,5.824962
0,3000891,2268.181973
0,3000892,0.0


In [75]:
submission_stack.to_csv('/kaggle/working/submission_stack.csv', index=False)