<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#LGBM---Rolling-Windows" data-toc-modified-id="LGBM---Rolling-Windows-1">LGBM - Rolling Windows</a></span><ul class="toc-item"><li><span><a href="#Defining-metrics" data-toc-modified-id="Defining-metrics-1.1">Defining metrics</a></span></li><li><span><a href="#Building-our-dataset" data-toc-modified-id="Building-our-dataset-1.2">Building our dataset</a></span></li><li><span><a href="#Feature-building" data-toc-modified-id="Feature-building-1.3">Feature building</a></span></li><li><span><a href="#Maximum-error" data-toc-modified-id="Maximum-error-1.4">Maximum error</a></span></li><li><span><a href="#Dataset-Splitting-(Train-until-week-3-/-Val.-week-2/-Test-week-1)" data-toc-modified-id="Dataset-Splitting-(Train-until-week-3-/-Val.-week-2/-Test-week-1)-1.5">Dataset Splitting (Train until week 3 / Val. week 2/ Test week 1)</a></span></li><li><span><a href="#Dataset-Splitting-(Train-until-week-2-and-test-with-week-1)" data-toc-modified-id="Dataset-Splitting-(Train-until-week-2-and-test-with-week-1)-1.6">Dataset Splitting (Train until week 2 and test with week 1)</a></span><ul class="toc-item"><li><span><a href="#Utilities" data-toc-modified-id="Utilities-1.6.1">Utilities</a></span></li></ul></li></ul></li><li><span><a href="#Now-change-the-new-items-preds..." data-toc-modified-id="Now-change-the-new-items-preds...-2">Now change the new items preds...</a></span></li></ul></div>

# LGBM - Rolling Windows

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import sys
import xgboost as xgb
import lightgbm as lgb
from datetime import datetime

NUMBER_OF_LAGS = 4

sys.path.append("../main/datasets/")
!ls  ../main/datasets/

!cp ../dora/models/utils.py .

from utils import *

1.0v  1.0v.zip


<hr>

## Defining metrics

Baseline_score function

In [2]:
def baseline_score(prediction, target, simulatedPrice):
    prediction = prediction.astype(int)

    return np.sum((prediction - np.maximum(prediction - target, 0) * 1.6)  * simulatedPrice)

Evaluation Metric

In [3]:
def feval(prediction, dtrain):
    
    prediction = prediction.astype(int)
    target = dtrain.get_label()

    simulatedPrice = dtrain.get_weight()
    
    return 'feval', np.sum((prediction - np.maximum(prediction - target, 0) * 1.6)  * simulatedPrice), True

Objective Metric

In [4]:
def gradient(predt, dtrain):
    y = dtrain.get_label()
    sp = dtrain.get_weight()
    return -2 * (predt - np.maximum(predt - y, 0) * 1.6) * (1 - (predt > y) * 1.6) * sp

def hessian(predt, dtrain):
    y = dtrain.get_label()
    sp = dtrain.get_weight() 
    return -2 * ((1 - (predt > y) * 1.6) ** 2) * sp

def objective(predt, dtrain):
    grad = gradient(predt, dtrain)
    hess = hessian(predt, dtrain)
    return grad, hess

<hr>

## Building our dataset
This notebook makes this step cleaner than the previous versions. So It'll be tidier and shorter than before!

In [5]:
infos, items, orders = read_data("../main/datasets/")
print("Sanity checks...", infos.shape, items.shape, orders.shape)

Sanity checks... (10463, 3) (10463, 8) (2181955, 5)


In [6]:
# Changing our time signatures
process_time(orders)

In [7]:
df = dataset_builder(orders, items)

<hr>

## Feature building

In [8]:
# percentage_accum_cat_3 feature...
df = cumulative_sale_by_category(df)

In [9]:
# Encoding our weeks as a series of sines and cosines...
# This function will consider our period as a semester in a year,
# so we can try other types of time encoding later!
df = time_encoder(df, 'group_backwards', 26)

In [10]:
# This cell lags and diffs our feature 'orderSum'
shifting = df.copy()

for i in range(1, NUMBER_OF_LAGS + 1):
    # Carrying the data of weeks t-1
    shifting[f'orderSum_{i}'] = shifting.groupby('itemID')['orderSum'].shift(i)

    
    # Getting the difference of the orders and promotions between weeks t-1 and t-2...
    shifting[f'orderSum_diff_{i}'] = shifting.groupby('itemID')[f'orderSum_{i}'].diff()
    

In [11]:
%%time
# This cell creates rolling-window features based on 'orderSum' in our dataset!
item_group = shifting.groupby(["itemID", "group_backwards"]).agg({'orderSum':'sum'})

# We'll .shift(-1) because it sorts our "group_backwards", 
# so doing .shift(1) would cause a HUGE dataleak.
aux_shifting = item_group.groupby('itemID')[['orderSum']].shift(-1)

aux_shifting.sort_values(['itemID', 'group_backwards'], ascending=[True, False], inplace=True)

for i in range(3):
    rolled_window = aux_shifting.groupby(['itemID'], as_index=False)[['orderSum']].rolling(2 ** i).mean()
    rolled_window.rename(columns={'orderSum':f"orderSum_mean_rolled_{i}"}, inplace=True)
    shifting = pd.merge(shifting, rolled_window, left_on=['itemID', 'group_backwards'], right_on=['itemID', 'group_backwards'])

In [12]:
# LGBM Says on docs that it automatically handles zero values as NaN,
# so we'll keep this standard...
shifting.fillna(0, inplace=True)

<hr>

## Maximum error
The maximum error we could get in this dataset would be just guessing the mean of our sales from weeks 1 to 12, and that's what the cell below is computing.

In [13]:
worst_possible_prediction = shifting.loc[shifting.group_backwards < 13]['orderSum'].mean()
prediction = np.full(shifting.loc[shifting.group_backwards == 13]['orderSum'].shape, worst_possible_prediction) # Array filled with the mean...
target = shifting.loc[shifting.group_backwards == 13]['orderSum']
print("Guessing the mean of 'orderSum' for all items in target", mse(target, prediction) ** 0.5)

Guessing the mean of 'orderSum' for all items in target 90.29706562119341


<hr>

## Dataset Splitting (Train until week 3 / Val. week 2/ Test week 1)
All my experiments will use weeks 13 to 3 as a train set, week 2 as our validation set and week 1 as a test set.

In [14]:
train = shifting.loc[shifting.group_backwards >= 3]
val = shifting.loc[shifting.group_backwards == 2]
test = shifting.loc[shifting.group_backwards == 1]

weights = infos.set_index('itemID')['simulationPrice'].to_dict()

w_train = train['itemID'].map(weights)
w_val = val['itemID'].map(weights)

In [15]:
# I recommend to the other members of the team keeping the
# datatypes of our datasets as Pandas DataFrames instead of Numpy,
# since It will easier to use Boosting Analysis frameworks
y_train = train['orderSum']
y_val = val['orderSum']
X_train = train.drop(columns=["orderSum"])
X_val = val.drop(columns=["orderSum"])

In [16]:
params = {
#           "objective" : "poisson",
          "objective" : "l1",
          "metric" :"rmse",
          "learning_rate" : 0.1,
          'verbosity': 1,
          'max_depth': 6,
          'num_leaves': 15,
          "min_data_in_leaf":2000,
         }

lgbtrain = lgb.Dataset(X_train, label = y_train, weight=w_train)
lgbvalid = lgb.Dataset(X_val, label = y_val, weight=w_val)

num_round = 1000
model = lgb.train(params,
                  lgbtrain,
                  num_round,
                  valid_sets = [lgbtrain, lgbvalid], 
                  verbose_eval=5,
                  early_stopping_rounds=5,
#                   fobj=objective,
                  feval=feval,
                 )

Training until validation scores don't improve for 5 rounds
[5]	training's rmse: 39.9271	training's feval: 73304.1	valid_1's rmse: 44.9012	valid_1's feval: 14943.3
[10]	training's rmse: 39.9159	training's feval: 147112	valid_1's rmse: 44.884	valid_1's feval: 25599.5
[15]	training's rmse: 39.8983	training's feval: 254284	valid_1's rmse: 44.8611	valid_1's feval: 45427.1
[20]	training's rmse: 39.8859	training's feval: 350212	valid_1's rmse: 44.8481	valid_1's feval: 58560.5
[25]	training's rmse: 39.8759	training's feval: 408437	valid_1's rmse: 44.836	valid_1's feval: 69395.5
[30]	training's rmse: 39.8711	training's feval: 429720	valid_1's rmse: 44.8281	valid_1's feval: 72359.6
[35]	training's rmse: 39.8663	training's feval: 454300	valid_1's rmse: 44.8238	valid_1's feval: 77421.1
[40]	training's rmse: 39.8629	training's feval: 480205	valid_1's rmse: 44.8202	valid_1's feval: 79878.7
[45]	training's rmse: 39.8611	training's feval: 493234	valid_1's rmse: 44.8178	valid_1's feval: 82474.9
[50]	t

<hr>

## Dataset Splitting (Train until week 2 and test with week 1)
All my experiments will use weeks 13 to 2 as a train set and week 1 as test

In [26]:
train = shifting.loc[shifting.group_backwards >= 2]
val = shifting.loc[shifting.group_backwards == 1]
test = shifting.loc[shifting.group_backwards == 1]

weights = infos.set_index('itemID')['simulationPrice'].to_dict()

w_train = train['itemID'].map(weights)
w_val = val['itemID'].map(weights)

In [27]:
# I recommend to the other members of the team keeping the
# datatypes of our datasets as Pandas DataFrames instead of Numpy,
# since It will easier to use Boosting Analysis frameworks
y_train = train['orderSum']
y_val = val['orderSum']
X_train = train.drop(columns=["orderSum"])
X_val = val.drop(columns=["orderSum"])

In [28]:
params = {
#           "objective" : "poisson",
          "objective" : "l1",
          "metric" :"rmse",
          "learning_rate" : 0.5,
          'verbosity': 1,
          'max_depth': 6,
          'num_leaves': 32,
          "min_data_in_leaf":3000,
         }

lgbtrain = lgb.Dataset(X_train, label = y_train, weight=w_train)
lgbvalid = lgb.Dataset(X_val, label = y_val, weight=w_val)

num_round = 1000
model = lgb.train(params,
                  lgbtrain,
                  num_round,
                  valid_sets = [lgbtrain, lgbvalid], 
                  verbose_eval=5,
                  early_stopping_rounds=5,
#                   fobj=objective,
                  feval=feval,
                  
                 )

Training until validation scores don't improve for 5 rounds
[5]	training's rmse: 40.303	training's feval: 583152	valid_1's rmse: 43.6203	valid_1's feval: 73970.8
[10]	training's rmse: 40.2892	training's feval: 640370	valid_1's rmse: 43.5985	valid_1's feval: 84756
[15]	training's rmse: 40.2854	training's feval: 661741	valid_1's rmse: 43.5928	valid_1's feval: 85855.1
Early stopping, best iteration is:
[13]	training's rmse: 40.2859	training's feval: 662275	valid_1's rmse: 43.5931	valid_1's feval: 90997


<hr>

### Utilities

**Predicting at test time**

In [29]:
y_test = test['orderSum']
X_test = test.drop(columns=["orderSum"])
final_predictions = model.predict(X_test)

In [30]:
final_predictions

array([0.27124023, 1.        , 0.5       , ..., 0.        , 0.        ,
       0.        ])

In [31]:
final_predictions[final_predictions < 0] = 0

In [43]:
final_predictions

array([0.27124023, 1.        , 0.5       , ..., 0.        , 0.        ,
       0.        ])

# Now change the new items preds...

In [38]:
# Weekpair is negative, so this works:
first_fortnight_item = orders.sort_values("group_backwards",
                                     ascending=False)\
                          .groupby(["itemID"])["group_backwards"].first()
first_fortnight_item.head()

itemID
1    12
2     9
3    13
4    12
5    13
Name: group_backwards, dtype: int64

In [53]:
new_items_value = 10  # The mode seems too high, so go with 10...
idx = X_test["itemID"].isin(first_fortnight_item[first_fortnight_item == 1].index)
final_predictions[idx] = new_items_value

**Baseline calculation**

In [54]:
baseline_score(final_predictions, y_test.values, infos['simulationPrice'])

486009.164