<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#CatBoost-Baseline-with-price" data-toc-modified-id="CatBoost-Baseline-with-price-1">CatBoost Baseline with price</a></span><ul class="toc-item"><li><span><a href="#Preparing-our-dataset" data-toc-modified-id="Preparing-our-dataset-1.1">Preparing our dataset</a></span></li><li><span><a href="#Maximum-error" data-toc-modified-id="Maximum-error-1.2">Maximum error</a></span></li><li><span><a href="#Dataset-Splitting" data-toc-modified-id="Dataset-Splitting-1.3">Dataset Splitting</a></span></li></ul></li><li><span><a href="#BRUNO'S-CHANGES" data-toc-modified-id="BRUNO'S-CHANGES-2">BRUNO'S CHANGES</a></span></li><li><span><a href="#retrain-with-best-results" data-toc-modified-id="retrain-with-best-results-3">retrain with best results</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Utilities" data-toc-modified-id="Utilities-3.0.1">Utilities</a></span></li></ul></li></ul></li></ul></div>

# CatBoost Baseline with price

Original notebook by Dora - I (Bruno) added the price/weight


In [1]:
import sys
sys.path.append("../dora/models")  # For using Dora's utils files

In [2]:
import numpy as np
import pandas as pd
from utils import read_data, process_time, merge_data, promo_detector, promotionAggregation
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import sys
import xgboost as xgb
import lightgbm as lgb
from datetime import datetime
from catboost import CatBoostRegressor, Pool, cv

NUMBER_OF_LAGS = 4

# sys.path.append("../../main/datasets/")
!ls  ../main/datasets/

1.0v  1.0v.zip


## Preparing our dataset
These steps were already seen on ```../pre-processing-features``` notebooks.

In [3]:
infos, items, orders = read_data("../main/datasets/")
print("Sanity checks...", infos.shape, items.shape, orders.shape)

Sanity checks... (10463, 3) (10463, 8) (2181955, 5)


In [4]:
# Changing our time signatures, 
# adding our promotion feature 
# and aggregating our data by weeks...
process_time(orders)
orders = promo_detector(orders)
df = promotionAggregation(orders, items)

In [5]:
def prepareOrders(orders, items):
    """This function is responsible for adding in our 'orders' dataframe
    the items that were not sold. THIS IS NOT MODULARIZED, THUS YOU
    SHOULD CHANGE THE CODE TO BETTER SUIT YOUR DATASET FEATURES
    """
    
    df = orders.copy()
    
    # Getting the IDs that were never sold
    not_sold_items = items[np.logical_not(
        items.itemID.isin(sorted(orders['itemID'].unique())))]

    new_rows = []
    weeks_database = orders['group_backwards'].unique()

    for idd in df['itemID'].unique():
        orders_id = df[df.itemID == idd]
        example = orders_id.iloc[0]

        # finding weeks without itemID sales
        weeks_id = orders_id['group_backwards'].unique()
        weeks_without_id = np.setdiff1d(weeks_database, weeks_id)

        # creating new row
        for w in weeks_without_id:
            new_rows.append({'itemID': idd,
                             'group_backwards': w,
                             'salesPrice_mean': 0,
                             'customerRating': example['customerRating'],
                             'category1': example['category1'],
                             'category2': example['category2'],
                             'category3': example['category3'],
                             'recommendedRetailPrice': example['recommendedRetailPrice'],
                             'orderSum': 0,
                             'manufacturer': example['manufacturer'],
                             'brand': example['brand'],
                             'promotion_mean': 0
                             })
    #  Adding rows in every week with the IDs of the
    # items that were never sold.
    df = df.append(new_rows)
    not_sold_orders = pd.DataFrame()
    for i in range(1, 14):
        aux = not_sold_items.copy()
        aux['group_backwards'] = i
        aux['salesPrice_mean'] = 0
        aux['promotion_mean'] = 0
        aux['orderSum'] = 0
        not_sold_orders = pd.concat([not_sold_orders, aux], axis=0)
    df = pd.concat([df, not_sold_orders], axis=0).sort_values(
        ['group_backwards', 'itemID'], ascending=[False, True], ignore_index=True)
    return df

In [6]:
df = prepareOrders(df, items)

In [7]:
# This cell lags and diffs our features 'orderSum' and "promotion"

shifting = df.copy()

for i in range(1, NUMBER_OF_LAGS + 1):
    # Carrying the data of weeks t-1
    shifting[f'orderSum_{i}'] = shifting.groupby('itemID')['orderSum'].shift(i)
    shifting[f'promotion_mean_{i}'] = shifting.groupby('itemID')['promotion_mean'].shift(i)
    
    # Getting the difference of the orders and promotions between weeks t-1 and t-2...
    shifting[f'orderSum_diff_{i}'] = shifting.groupby('itemID')[f'orderSum_{i}'].diff()
    shifting[f'promotion_mean_diff_{i}'] = shifting.groupby('itemID')[f'promotion_mean_{i}'].diff()
shifting.fillna(0, inplace=True)

## Maximum error
The maximum error we could get in this dataset would be just guessing the mean of our sales from weeks 1 to 12, and that's what the cell below is computing.

In [8]:
worst_possible_prediction = shifting.loc[shifting.group_backwards < 13]['orderSum'].mean()
prediction = np.full(shifting.loc[shifting.group_backwards == 13]['orderSum'].shape, worst_possible_prediction) # Array filled with the mean...
target = shifting.loc[shifting.group_backwards == 13]['orderSum']
print("Guessing the mean of 'orderSum' for all items in target", mse(target, prediction) ** 0.5)

Guessing the mean of 'orderSum' for all items in target 90.29706562119341


## Dataset Splitting
All my experiments will use weeks 13 to 3 as a train set, week 2 as our validation set and week 1 as a test set.

In [9]:
# CatBoost requires that all columns should
stringColumns = shifting.columns[3:]
shifting[stringColumns] = shifting[stringColumns].astype(str)

In [10]:
# Datatype conversion required by Catboost
train = shifting.loc[shifting.group_backwards >= 3]
val = shifting.loc[shifting.group_backwards == 2]
test = shifting.loc[shifting.group_backwards == 1]

In [11]:
# I recommend to the other members of the team keeping the
# datatypes of our datasets as Pandas DataFrames instead of Numpy,
# since It will easier to use Boosting Analysis frameworks
y_train = train['orderSum']
y_val = val['orderSum']
X_train = train.drop(columns=["orderSum"])
X_val = val.drop(columns=["orderSum"])

---
<br>
<br>

# BRUNO'S CHANGES

Below is my added code:
- Create a "weight" vector for catboost so it weights each instance according to it's price
- Pass that to the traning itself

In [12]:
# Since I want to keep Dora's original code, I add it
#   by re-merging into the dataset
def recreate_weights(data):
    weights = pd.merge(data["itemID"], infos[["itemID", "simulationPrice"]], 
                       on="itemID", validate="m:1")
    return weights["simulationPrice"]
train_weights = recreate_weights(train)
val_weights = recreate_weights(val)

// Bruno's changes

In [13]:
# initialize Pool
train_pool = Pool(X_train, 
                  y_train, 
                  cat_features=[8,9,10],
                  weight=train_weights)

val_pool = Pool(X_val, 
                  y_val, 
                  cat_features=[8,9,10],
                  weight=val_weights)
# test_pool = Pool(test_data.astype(str), 
#                  cat_features=[8,9,10]) 

In [14]:
# specify the training parameters 
model = CatBoostRegressor(depth=6, 
                          learning_rate=0.1, 
                          loss_function='RMSE',
                          early_stopping_rounds=5)

model.fit(
    train_pool,
    eval_set=val_pool,
    logging_level='Verbose',  # you can uncomment this for text output
);

0:	learn: 88.4535170	test: 46.2494859	best: 46.2494859 (0)	total: 95.3ms	remaining: 1m 35s
1:	learn: 87.5974460	test: 45.7490051	best: 45.7490051 (1)	total: 150ms	remaining: 1m 15s
2:	learn: 86.5484200	test: 45.1574819	best: 45.1574819 (2)	total: 195ms	remaining: 1m 4s
3:	learn: 85.7688129	test: 44.9588048	best: 44.9588048 (3)	total: 248ms	remaining: 1m 1s
4:	learn: 85.3971066	test: 44.8754718	best: 44.8754718 (4)	total: 299ms	remaining: 59.6s
5:	learn: 84.5121910	test: 44.7072428	best: 44.7072428 (5)	total: 355ms	remaining: 58.9s
6:	learn: 84.1558061	test: 44.7293003	best: 44.7072428 (5)	total: 400ms	remaining: 56.7s
7:	learn: 83.8177497	test: 44.8495779	best: 44.7072428 (5)	total: 452ms	remaining: 56.1s
8:	learn: 83.2108046	test: 44.5491347	best: 44.5491347 (8)	total: 497ms	remaining: 54.7s
9:	learn: 82.5838846	test: 44.5633689	best: 44.5491347 (8)	total: 540ms	remaining: 53.4s
10:	learn: 82.4951823	test: 44.3769895	best: 44.3769895 (10)	total: 575ms	remaining: 51.7s
11:	learn: 82.43

# retrain with best results
(more Bruno changes)

In [15]:
full_train = shifting.loc[shifting.group_backwards >= 2]
full_train_weights = recreate_weights(full_train)

full_y_train = full_train['orderSum']
full_X_train = full_train.drop(columns=["orderSum"])

# initialize Pool
full_train_pool = Pool(full_X_train, 
                      full_y_train, 
                      cat_features=[8,9,10],
                      weight=full_train_weights)

In [16]:
# specify the training parameters 
bst = CatBoostRegressor(
    depth=6, 
    learning_rate=0.1, 
    loss_function='RMSE',
    iterations=model.best_iteration_,
)

bst.fit(
    full_train_pool,
    #  logging_level='Verbose',  # you can uncomment this for text output
);

0:	learn: 118.9781037	total: 30.5ms	remaining: 1.77s
1:	learn: 118.2637877	total: 60.4ms	remaining: 1.72s
2:	learn: 116.8795915	total: 89.7ms	remaining: 1.67s
3:	learn: 116.2336209	total: 113ms	remaining: 1.55s
4:	learn: 115.7210617	total: 146ms	remaining: 1.57s
5:	learn: 115.5718148	total: 181ms	remaining: 1.59s
6:	learn: 114.8061068	total: 214ms	remaining: 1.59s
7:	learn: 114.7202624	total: 250ms	remaining: 1.59s
8:	learn: 114.4158380	total: 271ms	remaining: 1.51s
9:	learn: 114.2614520	total: 293ms	remaining: 1.43s
10:	learn: 113.7582839	total: 320ms	remaining: 1.4s
11:	learn: 113.7582836	total: 329ms	remaining: 1.29s
12:	learn: 113.4377759	total: 352ms	remaining: 1.24s
13:	learn: 113.3189882	total: 374ms	remaining: 1.2s
14:	learn: 112.9825687	total: 401ms	remaining: 1.18s
15:	learn: 112.5769610	total: 431ms	remaining: 1.16s
16:	learn: 112.5013714	total: 455ms	remaining: 1.12s
17:	learn: 110.8347226	total: 481ms	remaining: 1.09s
18:	learn: 110.7557037	total: 504ms	remaining: 1.06s
19

### Utilities

**Predicting at test time**

In [17]:
y_test = test['orderSum']
X_test = test.drop(columns=["orderSum"])
final_predictions = bst.predict(X_test)

In [18]:
final_predictions[final_predictions < 0].shape

(1761,)

**Creating our Kaggle CSV**

In [19]:
final = pd.Series(0, index=np.arange(1, len(items)+1))
final[items.itemID] = final_predictions.astype(int)

final.to_csv("cat_with_weights_kaggle_df.csv", header=["demandPrediction"],
            index_label="itemID", sep="|")