# XGBoost Enhanced Features
This notebook is being created after the addition of Promotion feature to the dataset and the main goal is to submit the predictions of this notebook in our private Kaggle Leaderboard

In [36]:
import numpy as np
import pandas as pd
from utils import read_data, process_time, merge_data, promo_detector, promotionAggregation
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import sys
import xgboost as xgb
import lightgbm as lgb
from datetime import datetime

NUMBER_OF_LAGS = 4

sys.path.append("../../main/datasets/")
!ls  ../../main/datasets/

1.0v.zip


## Preparing our dataset
These steps were already seen on ```../pre-processing-features``` notebooks.

In [37]:
infos, items, orders = read_data("../../main/datasets/")
print("Sanity checks...", infos.shape, items.shape, orders.shape)

Sanity checks... (10463, 3) (10463, 8) (2181955, 5)


In [38]:
# Changing our time signatures, 
# adding our promotion feature 
# and aggregating our data by weeks...
process_time(orders)
orders = promo_detector(orders)
df = promotionAggregation(orders, items)

In [39]:
def prepareOrders(orders, items):
    """This function is responsible for adding in our 'orders' dataframe
    the items that were not sold. THIS IS NOT MODULARIZED, THUS YOU
    SHOULD CHANGE THE CODE TO BETTER SUIT YOUR DATASET FEATURES
    """
    
    df = orders.copy()
    
    # Getting the IDs that were never sold
    not_sold_items = items[np.logical_not(
        items.itemID.isin(sorted(orders['itemID'].unique())))]

    new_rows = []
    weeks_database = orders['group_backwards'].unique()

    for idd in df['itemID'].unique():
        orders_id = df[df.itemID == idd]
        example = orders_id.iloc[0]

        # finding weeks without itemID sales
        weeks_id = orders_id['group_backwards'].unique()
        weeks_without_id = np.setdiff1d(weeks_database, weeks_id)

        # creating new row
        for w in weeks_without_id:
            new_rows.append({'itemID': idd,
                             'group_backwards': w,
                             'salesPrice_mean': 0,
                             'customerRating': example['customerRating'],
                             'category1': example['category1'],
                             'category2': example['category2'],
                             'category3': example['category3'],
                             'recommendedRetailPrice': example['recommendedRetailPrice'],
                             'orderSum': 0,
                             'manufacturer': example['manufacturer'],
                             'brand': example['brand'],
                             'promotion_mean': 0
                             })
    #  Adding rows in every week with the IDs of the
    # items that were never sold.
    df = df.append(new_rows)
    not_sold_orders = pd.DataFrame()
    for i in range(1, 14):
        aux = not_sold_items.copy()
        aux['group_backwards'] = i
        aux['salesPrice_mean'] = 0
        aux['promotion_mean'] = 0
        aux['orderSum'] = 0
        not_sold_orders = pd.concat([not_sold_orders, aux], axis=0)
    df = pd.concat([df, not_sold_orders], axis=0).sort_values(
        ['group_backwards', 'itemID'], ascending=[False, True], ignore_index=True)
    return df

In [40]:
df = prepareOrders(df, items)

In [41]:
# This cell lags and diffs our features 'orderSum' and "promotion"

shifting = df.copy()

for i in range(1, NUMBER_OF_LAGS + 1):
    # Carrying the data of weeks t-1
    shifting[f'orderSum_{i}'] = shifting.groupby('itemID')['orderSum'].shift(i)
    shifting[f'promotion_mean_{i}'] = shifting.groupby('itemID')['promotion_mean'].shift(i)
    
    # Getting the difference of the orders and promotions between weeks t-1 and t-2...
    shifting[f'orderSum_diff_{i}'] = shifting.groupby('itemID')[f'orderSum_{i}'].diff()
    shifting[f'promotion_mean_diff_{i}'] = shifting.groupby('itemID')[f'promotion_mean_{i}'].diff()
shifting.fillna(0, inplace=True)

In [42]:
# cat_1_one_hot = pd.get_dummies(shifting['category1']).rename(columns={i : f'category1_{i}' for i in shifting['category1'].unique()})
# cat_2_one_hot = pd.get_dummies(shifting['category2']).rename(columns={i : f'category2_{i}' for i in shifting['category2'].unique()})
# cat_3_one_hot = pd.get_dummies(shifting['category3']).rename(columns={i : f'category3_{i}' for i in shifting['category3'].unique()})
# shifting = pd.concat([shifting, cat_1_one_hot, cat_2_one_hot, cat_3_one_hot], axis=1).drop(columns=['category1', 'category2', 'category3'])
shifting.drop(columns=['category1', 'category2', 'category3'], inplace=True)

## Maximum error
The maximum error we could get in this dataset would be just guessing the mean of our sales from weeks 1 to 12, and that's what the cell below is computing.

In [43]:
worst_possible_prediction = shifting.loc[shifting.group_backwards < 13]['orderSum'].mean()
prediction = np.full(shifting.loc[shifting.group_backwards == 13]['orderSum'].shape, worst_possible_prediction) # Array filled with the mean...
target = shifting.loc[shifting.group_backwards == 13]['orderSum']
print("Guessing the mean of 'orderSum' for all items in target", mse(target, prediction) ** 0.5)

Guessing the mean of 'orderSum' for all items in target 90.29706562119341


## Dataset Splitting
All my experiments will use weeks 13 to 3 as a train set, week 2 as our validation set and week 1 as a test set.

In [44]:
train = shifting.loc[shifting.group_backwards >= 3]
val = shifting.loc[shifting.group_backwards == 2]
test = shifting.loc[shifting.group_backwards == 1]

In [45]:
# I recommend to the other members of the team keeping the
# datatypes of our datasets as Pandas DataFrames instead of Numpy,
# since It will easier to use Boosting Analysis frameworks
y_train = train['orderSum']
y_val = val['orderSum']
X_train = train.drop(columns=["orderSum"])
X_val = val.drop(columns=["orderSum"])

In [46]:
dtrain = xgb.DMatrix(X_train, y_train, missing=np.inf)
dval = xgb.DMatrix(X_val, y_val, missing=np.inf)

param = {'max_depth':6, 'eta':0.01, 'objective':'reg:squarederror' }
num_round = 1000
bst = xgb.train(param, dtrain,
                num_round, early_stopping_rounds = 5,
                evals = [(dtrain, 'train'), (dval, 'val')])

[0]	train-rmse:104.39217	val-rmse:110.94049
Multiple eval metrics have been passed: 'val-rmse' will be used for early stopping.

Will train until val-rmse hasn't improved in 5 rounds.
[1]	train-rmse:104.14262	val-rmse:110.59388
[2]	train-rmse:103.89733	val-rmse:110.25309
[3]	train-rmse:103.65630	val-rmse:109.91911
[4]	train-rmse:103.41954	val-rmse:109.59083
[5]	train-rmse:103.18694	val-rmse:109.26918
[6]	train-rmse:102.95802	val-rmse:108.95335
[7]	train-rmse:102.72651	val-rmse:108.63429
[8]	train-rmse:102.50555	val-rmse:108.33076
[9]	train-rmse:102.28893	val-rmse:108.03239
[10]	train-rmse:102.06860	val-rmse:107.73109
[11]	train-rmse:101.85931	val-rmse:107.44031
[12]	train-rmse:101.65256	val-rmse:107.15924
[13]	train-rmse:101.45213	val-rmse:106.91387
[14]	train-rmse:101.25304	val-rmse:106.64259
[15]	train-rmse:101.03923	val-rmse:106.36432
[16]	train-rmse:100.84583	val-rmse:106.10020
[17]	train-rmse:100.62122	val-rmse:105.80709
[18]	train-rmse:100.44150	val-rmse:105.59830
[19]	train-rmse

[185]	train-rmse:85.22122	val-rmse:88.71352
[186]	train-rmse:85.13678	val-rmse:88.66602
[187]	train-rmse:85.05695	val-rmse:88.61488
[188]	train-rmse:84.97874	val-rmse:88.56102
[189]	train-rmse:84.90112	val-rmse:88.51645
[190]	train-rmse:84.82471	val-rmse:88.47242
[191]	train-rmse:84.74535	val-rmse:88.43655
[192]	train-rmse:84.66621	val-rmse:88.39513
[193]	train-rmse:84.59262	val-rmse:88.36338
[194]	train-rmse:84.51842	val-rmse:88.33091
[195]	train-rmse:84.44527	val-rmse:88.30372
[196]	train-rmse:84.37719	val-rmse:88.25838
[197]	train-rmse:84.30309	val-rmse:88.22696
[198]	train-rmse:84.23191	val-rmse:88.19476
[199]	train-rmse:84.15820	val-rmse:88.17103
[200]	train-rmse:84.12161	val-rmse:88.17120
[201]	train-rmse:84.05265	val-rmse:88.13444
[202]	train-rmse:83.97991	val-rmse:88.10317
[203]	train-rmse:83.90916	val-rmse:88.08021
[204]	train-rmse:83.84542	val-rmse:88.04211
[205]	train-rmse:83.77728	val-rmse:88.02424
[206]	train-rmse:83.70693	val-rmse:87.96723
[207]	train-rmse:83.63897	val-rm

### Utilities

**Predicting at test time**

In [12]:
y_test = test['orderSum']
X_test = xgb.DMatrix(test.drop(columns=["orderSum"]))
final_predictions = bst.predict(X_test)

In [15]:
final_predictions[final_predictions < 0].shape

(2025,)

**Creating our Kaggle CSV**

In [47]:
final = pd.Series(0, index=np.arange(1, len(items)+1))
final[items.itemID] = final_predictions.astype(int)

final.to_csv("xgb_kaggle_df.csv", header=["demandPrediction"],
            index_label="itemID", sep="|")

**Saving our model in disk**

In [50]:
now = datetime.now().strftime("%d-%m-%Y-%Hh%Mm%Ss")
modelName = 'xgb-' + now
bst.save_model(modelName)