<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#CatBoost-Baseline-with-price" data-toc-modified-id="CatBoost-Baseline-with-price-1">CatBoost Baseline with price</a></span><ul class="toc-item"><li><span><a href="#Preparing-our-dataset" data-toc-modified-id="Preparing-our-dataset-1.1">Preparing our dataset</a></span></li><li><span><a href="#Maximum-error" data-toc-modified-id="Maximum-error-1.2">Maximum error</a></span></li><li><span><a href="#Dataset-Splitting" data-toc-modified-id="Dataset-Splitting-1.3">Dataset Splitting</a></span></li></ul></li><li><span><a href="#BRUNO'S-CHANGES" data-toc-modified-id="BRUNO'S-CHANGES-2">BRUNO'S CHANGES</a></span></li><li><span><a href="#retrain-with-best-results" data-toc-modified-id="retrain-with-best-results-3">retrain with best results</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Utilities" data-toc-modified-id="Utilities-3.0.1">Utilities</a></span></li></ul></li></ul></li></ul></div>

# CatBoost Baseline with price

Original notebook by Dora - I (Bruno) added the price/weight


In [1]:
import sys
sys.path.append("../dora/models")  # For using Dora's utils files

In [2]:
import numpy as np
import pandas as pd
from utils import read_data, process_time, merge_data, promo_detector, promotionAggregation
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import sys
import xgboost as xgb
import lightgbm as lgb
from datetime import datetime
from catboost import CatBoostRegressor, Pool, cv

NUMBER_OF_LAGS = 4

# sys.path.append("../../main/datasets/")
!ls  ../main/datasets/

1.0v  1.0v.zip


## Preparing our dataset
These steps were already seen on ```../pre-processing-features``` notebooks.

In [3]:
infos, items, orders = read_data("../main/datasets/")
print("Sanity checks...", infos.shape, items.shape, orders.shape)

Sanity checks... (10463, 3) (10463, 8) (2181955, 5)


In [4]:
# Changing our time signatures, 
# adding our promotion feature 
# and aggregating our data by weeks...
process_time(orders)
orders = promo_detector(orders)
df = promotionAggregation(orders, items)

In [5]:
def prepareOrders(orders, items):
    """This function is responsible for adding in our 'orders' dataframe
    the items that were not sold. THIS IS NOT MODULARIZED, THUS YOU
    SHOULD CHANGE THE CODE TO BETTER SUIT YOUR DATASET FEATURES
    """
    
    df = orders.copy()
    
    # Getting the IDs that were never sold
    not_sold_items = items[np.logical_not(
        items.itemID.isin(sorted(orders['itemID'].unique())))]

    new_rows = []
    weeks_database = orders['group_backwards'].unique()

    for idd in df['itemID'].unique():
        orders_id = df[df.itemID == idd]
        example = orders_id.iloc[0]

        # finding weeks without itemID sales
        weeks_id = orders_id['group_backwards'].unique()
        weeks_without_id = np.setdiff1d(weeks_database, weeks_id)

        # creating new row
        for w in weeks_without_id:
            new_rows.append({'itemID': idd,
                             'group_backwards': w,
                             'salesPrice_mean': 0,
                             'customerRating': example['customerRating'],
                             'category1': example['category1'],
                             'category2': example['category2'],
                             'category3': example['category3'],
                             'recommendedRetailPrice': example['recommendedRetailPrice'],
                             'orderSum': 0,
                             'manufacturer': example['manufacturer'],
                             'brand': example['brand'],
                             'promotion_mean': 0
                             })
    #  Adding rows in every week with the IDs of the
    # items that were never sold.
    df = df.append(new_rows)
    not_sold_orders = pd.DataFrame()
    for i in range(1, 14):
        aux = not_sold_items.copy()
        aux['group_backwards'] = i
        aux['salesPrice_mean'] = 0
        aux['promotion_mean'] = 0
        aux['orderSum'] = 0
        not_sold_orders = pd.concat([not_sold_orders, aux], axis=0)
    df = pd.concat([df, not_sold_orders], axis=0).sort_values(
        ['group_backwards', 'itemID'], ascending=[False, True], ignore_index=True)
    return df

In [6]:
df = prepareOrders(df, items)

In [7]:
# This cell lags and diffs our features 'orderSum' and "promotion"

shifting = df.copy()

for i in range(1, NUMBER_OF_LAGS + 1):
    # Carrying the data of weeks t-1
    shifting[f'orderSum_{i}'] = shifting.groupby('itemID')['orderSum'].shift(i)
    shifting[f'promotion_mean_{i}'] = shifting.groupby('itemID')['promotion_mean'].shift(i)
    
    # Getting the difference of the orders and promotions between weeks t-1 and t-2...
    shifting[f'orderSum_diff_{i}'] = shifting.groupby('itemID')[f'orderSum_{i}'].diff()
    shifting[f'promotion_mean_diff_{i}'] = shifting.groupby('itemID')[f'promotion_mean_{i}'].diff()
shifting.fillna(0, inplace=True)

## Maximum error
The maximum error we could get in this dataset would be just guessing the mean of our sales from weeks 1 to 12, and that's what the cell below is computing.

In [8]:
worst_possible_prediction = shifting.loc[shifting.group_backwards < 13]['orderSum'].mean()
prediction = np.full(shifting.loc[shifting.group_backwards == 13]['orderSum'].shape, worst_possible_prediction) # Array filled with the mean...
target = shifting.loc[shifting.group_backwards == 13]['orderSum']
print("Guessing the mean of 'orderSum' for all items in target", mse(target, prediction) ** 0.5)

Guessing the mean of 'orderSum' for all items in target 90.29706562119341


## Dataset Splitting
All my experiments will use weeks 13 to 3 as a train set, week 2 as our validation set and week 1 as a test set.

In [9]:
# CatBoost requires that all columns should
stringColumns = shifting.columns[3:]
shifting[stringColumns] = shifting[stringColumns].astype(str)

In [10]:
# Datatype conversion required by Catboost
train = shifting.loc[shifting.group_backwards >= 3]
val = shifting.loc[shifting.group_backwards == 2]
test = shifting.loc[shifting.group_backwards == 1]

---
<br>
<br>

# BRUNO'S CHANGES

Below is my added code:
- Change y_train to be price\*orderSum
- Pass that to the traning itself
- For saving the prediction, divide by orderSum and round

In [11]:
# Since I want to keep Dora's original code, I add it
#   by re-merging into the dataset
def add_weights(data):
     return pd.merge(data, infos[["itemID", "simulationPrice"]], 
                     on="itemID", validate="m:1")

In [12]:
# I recommend to the other members of the team keeping the
# datatypes of our datasets as Pandas DataFrames instead of Numpy,
# since It will easier to use Boosting Analysis frameworks

train = add_weights(train)
val = add_weights(val)
test = add_weights(test)

y_train = train['orderSum']*train["simulationPrice"]
y_val = val['orderSum']*val["simulationPrice"]

X_train = train.drop(columns=["orderSum", "simulationPrice"])
X_val = val.drop(columns=["orderSum", "simulationPrice"])

// Bruno's changes

In [13]:
# initialize Pool
train_pool = Pool(X_train, 
                  y_train, 
                  cat_features=[8,9,10],
)

val_pool = Pool(X_val, 
                  y_val, 
                  cat_features=[8,9,10],
)
# test_pool = Pool(test_data.astype(str), 
#                  cat_features=[8,9,10]) 

In [14]:
# specify the training parameters 
model = CatBoostRegressor(depth=6, 
                          learning_rate=0.1, 
                          loss_function='RMSE',
                          early_stopping_rounds=10)

model.fit(
    train_pool,
    eval_set=val_pool,
    logging_level='Verbose',  # you can uncomment this for text output
);

0:	learn: 3662.2517954	test: 3924.7121580	best: 3924.7121580 (0)	total: 89.4ms	remaining: 1m 29s
1:	learn: 3589.3676088	test: 3849.0611952	best: 3849.0611952 (1)	total: 133ms	remaining: 1m 6s
2:	learn: 3535.8020572	test: 3785.0352140	best: 3785.0352140 (2)	total: 166ms	remaining: 55.2s
3:	learn: 3491.2920082	test: 3747.3182486	best: 3747.3182486 (3)	total: 217ms	remaining: 53.9s
4:	learn: 3448.7563551	test: 3705.7734967	best: 3705.7734967 (4)	total: 251ms	remaining: 50s
5:	learn: 3414.4457897	test: 3676.3036774	best: 3676.3036774 (5)	total: 293ms	remaining: 48.5s
6:	learn: 3378.2463252	test: 3639.5573027	best: 3639.5573027 (6)	total: 351ms	remaining: 49.8s
7:	learn: 3352.6531462	test: 3621.3546012	best: 3621.3546012 (7)	total: 394ms	remaining: 48.8s
8:	learn: 3329.5756935	test: 3601.5816215	best: 3601.5816215 (8)	total: 431ms	remaining: 47.4s
9:	learn: 3310.0946176	test: 3591.0962627	best: 3591.0962627 (9)	total: 475ms	remaining: 47s
10:	learn: 3293.6683661	test: 3579.9222047	best: 357

# retrain with best results
(more Bruno changes)

In [15]:
full_train = shifting.loc[shifting.group_backwards >= 2]

full_train = add_weights(full_train)
full_y_train = full_train['orderSum']*full_train["simulationPrice"]
full_X_train = full_train.drop(columns=["orderSum", "simulationPrice"])

# initialize Pool
full_train_pool = Pool(full_X_train, 
                      full_y_train, 
                      cat_features=[8,9,10],
)

In [16]:
# specify the training parameters 
bst = CatBoostRegressor(
    depth=6, 
    learning_rate=0.1, 
    loss_function='RMSE',
    iterations=model.best_iteration_,
)

bst.fit(
    full_train_pool,
    #  logging_level='Verbose',  # you can uncomment this for text output
);

0:	learn: 3694.8035353	total: 28.5ms	remaining: 1.63s
1:	learn: 3625.9617382	total: 48.7ms	remaining: 1.36s
2:	learn: 3573.2949381	total: 71.5ms	remaining: 1.31s
3:	learn: 3519.1360698	total: 92.3ms	remaining: 1.25s
4:	learn: 3478.3399695	total: 114ms	remaining: 1.21s
5:	learn: 3435.7782636	total: 135ms	remaining: 1.17s
6:	learn: 3400.4612965	total: 154ms	remaining: 1.12s
7:	learn: 3374.2350480	total: 177ms	remaining: 1.1s
8:	learn: 3349.3459353	total: 197ms	remaining: 1.07s
9:	learn: 3333.0950520	total: 217ms	remaining: 1.04s
10:	learn: 3317.1712520	total: 238ms	remaining: 1.02s
11:	learn: 3299.2707795	total: 258ms	remaining: 991ms
12:	learn: 3291.6941872	total: 278ms	remaining: 962ms
13:	learn: 3276.1928162	total: 299ms	remaining: 940ms
14:	learn: 3267.4148188	total: 320ms	remaining: 918ms
15:	learn: 3258.0038131	total: 338ms	remaining: 887ms
16:	learn: 3240.2667213	total: 355ms	remaining: 856ms
17:	learn: 3230.0688984	total: 373ms	remaining: 828ms
18:	learn: 3214.6914945	total: 389m

### Utilities

**Predicting at test time**

In [17]:
# y_test = test['orderSum']
X_test = test.drop(columns=["orderSum", "simulationPrice"])
final_predictions = np.ceil(bst.predict(X_test) / test["simulationPrice"])

In [18]:
final_predictions[final_predictions <= 0].shape

(2859,)

In [19]:
# To fix a bug where we have "-0" due to rounding errors
final_predictions.loc[final_predictions <= 0] = 0

**Creating our Kaggle CSV**

In [20]:
final = pd.Series(0, index=np.arange(1, len(items)+1))
final[items.itemID] = final_predictions.astype(int)

final.to_csv("cat_with_price_kaggle_df.csv", header=["demandPrediction"],
            index_label="itemID", sep="|")