# CatBoost Baseline
This notebook is being created after the addition of Promotion feature to the dataset and the main goal is to submit the predictions of this notebook in our private Kaggle Leaderboard

In [1]:
import numpy as np
import pandas as pd
from utils import read_data, process_time, merge_data, promo_detector, promotionAggregation
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import sys
import xgboost as xgb
import lightgbm as lgb
from datetime import datetime
from catboost import CatBoost, CatBoostRegressor, Pool, cv

NUMBER_OF_LAGS = 4

sys.path.append("../../main/datasets/")
!ls  ../../main/datasets/

1.0v.zip


## Preparing our dataset
These steps were already seen on ```../pre-processing-features``` notebooks.

In [2]:
infos, items, orders = read_data("../../main/datasets/")
print("Sanity checks...", infos.shape, items.shape, orders.shape)

Sanity checks... (10463, 3) (10463, 8) (2181955, 5)


In [3]:
# Changing our time signatures, 
# adding our promotion feature 
# and aggregating our data by weeks...
process_time(orders)
orders = promo_detector(orders)
df = promotionAggregation(orders, items)

In [4]:
def prepareOrders(orders, items):
    """This function is responsible for adding in our 'orders' dataframe
    the items that were not sold. THIS IS NOT MODULARIZED, THUS YOU
    SHOULD CHANGE THE CODE TO BETTER SUIT YOUR DATASET FEATURES
    """
    
    df = orders.copy()
    
    # Getting the IDs that were never sold
    not_sold_items = items[np.logical_not(
        items.itemID.isin(sorted(orders['itemID'].unique())))]

    new_rows = []
    weeks_database = orders['group_backwards'].unique()

    for idd in df['itemID'].unique():
        orders_id = df[df.itemID == idd]
        example = orders_id.iloc[0]

        # finding weeks without itemID sales
        weeks_id = orders_id['group_backwards'].unique()
        weeks_without_id = np.setdiff1d(weeks_database, weeks_id)

        # creating new row
        for w in weeks_without_id:
            new_rows.append({'itemID': idd,
                             'group_backwards': w,
                             'salesPrice_mean': 0,
                             'customerRating': example['customerRating'],
                             'category1': example['category1'],
                             'category2': example['category2'],
                             'category3': example['category3'],
                             'recommendedRetailPrice': example['recommendedRetailPrice'],
                             'orderSum': 0,
                             'manufacturer': example['manufacturer'],
                             'brand': example['brand'],
                             'promotion_mean': 0
                             })
    #  Adding rows in every week with the IDs of the
    # items that were never sold.
    df = df.append(new_rows)
    not_sold_orders = pd.DataFrame()
    for i in range(1, 14):
        aux = not_sold_items.copy()
        aux['group_backwards'] = i
        aux['salesPrice_mean'] = 0
        aux['promotion_mean'] = 0
        aux['orderSum'] = 0
        not_sold_orders = pd.concat([not_sold_orders, aux], axis=0)
    df = pd.concat([df, not_sold_orders], axis=0).sort_values(
        ['group_backwards', 'itemID'], ascending=[False, True], ignore_index=True)
    return df

In [5]:
df = prepareOrders(df, items)

In [6]:
# # This cell lags and diffs our features 'orderSum' and "promotion"

shifting = df.copy()

for i in range(1, NUMBER_OF_LAGS + 1):
    # Carrying the data of weeks t-1
    shifting[f'orderSum_{i}'] = shifting.groupby('itemID')['orderSum'].shift(i)
    shifting[f'promotion_mean_{i}'] = shifting.groupby('itemID')['promotion_mean'].shift(i)
    
    # Getting the difference of the orders and promotions between weeks t-1 and t-2...
    shifting[f'orderSum_diff_{i}'] = shifting.groupby('itemID')[f'orderSum_{i}'].diff()
    shifting[f'promotion_mean_diff_{i}'] = shifting.groupby('itemID')[f'promotion_mean_{i}'].diff()
shifting.fillna(0, inplace=True)

## Maximum error
The maximum error we could get in this dataset would be just guessing the mean of our sales from weeks 1 to 12, and that's what the cell below is computing.

In [7]:
worst_possible_prediction = shifting.loc[shifting.group_backwards < 13]['orderSum'].mean()
prediction = np.full(shifting.loc[shifting.group_backwards == 13]['orderSum'].shape, worst_possible_prediction) # Array filled with the mean...
target = shifting.loc[shifting.group_backwards == 13]['orderSum']
print("Guessing the mean of 'orderSum' for all items in target", mse(target, prediction) ** 0.5)

Guessing the mean of 'orderSum' for all items in target 90.29706562119341


## Dataset Splitting
All my experiments will use weeks 13 to 3 as a train set, week 2 as our validation set and week 1 as a test set.

In [8]:
# CatBoost requires that all columns should
stringColumns = shifting.columns[3:]
shifting[stringColumns] = shifting[stringColumns].astype(str)

In [9]:
# Datatype conversion required by Catboost
train = shifting.loc[shifting.group_backwards >= 3]
val = shifting.loc[shifting.group_backwards == 2]
test = shifting.loc[shifting.group_backwards == 1]

In [10]:
# I recommend to the other members of the team keeping the
# datatypes of our datasets as Pandas DataFrames instead of Numpy,
# since It will easier to use Boosting Analysis frameworks
y_train = train['orderSum']
y_val = val['orderSum']
X_train = train.drop(columns=["orderSum"])
X_val = val.drop(columns=["orderSum"])

In [11]:
# initialize Pool
train_pool = Pool(X_train, 
                  y_train, 
                  cat_features=[8,9,10])

val_pool = Pool(X_val, 
                  y_val, 
                  cat_features=[8,9,10])
# test_pool = Pool(test_data.astype(str), 
#                  cat_features=[8,9,10]) 

In [12]:
# specify the training parameters 
model = CatBoostRegressor(depth=6, 
                          learning_rate=0.1, 
                          loss_function='RMSE',
                          early_stopping_rounds=5)

model.fit(
    train_pool,
    eval_set=val_pool,
    logging_level='Verbose',  # you can uncomment this for text output
);

0:	learn: 100.9859886	test: 105.3582307	best: 105.3582307 (0)	total: 193ms	remaining: 3m 12s
1:	learn: 99.1420325	test: 102.6691635	best: 102.6691635 (1)	total: 264ms	remaining: 2m 11s
2:	learn: 97.5669743	test: 99.9306748	best: 99.9306748 (2)	total: 326ms	remaining: 1m 48s
3:	learn: 96.3818764	test: 98.6021707	best: 98.6021707 (3)	total: 393ms	remaining: 1m 37s
4:	learn: 95.3057960	test: 97.1708837	best: 97.1708837 (4)	total: 507ms	remaining: 1m 40s
5:	learn: 94.1458137	test: 95.6902170	best: 95.6902170 (5)	total: 578ms	remaining: 1m 35s
6:	learn: 93.4752886	test: 94.9083519	best: 94.9083519 (6)	total: 634ms	remaining: 1m 30s
7:	learn: 92.9225411	test: 94.2611453	best: 94.2611453 (7)	total: 696ms	remaining: 1m 26s
8:	learn: 92.2531031	test: 93.4342248	best: 93.4342248 (8)	total: 752ms	remaining: 1m 22s
9:	learn: 91.8245055	test: 93.0000419	best: 93.0000419 (9)	total: 791ms	remaining: 1m 18s
10:	learn: 91.3963707	test: 92.6919112	best: 92.6919112 (10)	total: 853ms	remaining: 1m 16s
11:

### Utilities

**Predicting at test time**

In [21]:
y_test = test['orderSum']
X_test = test.drop(columns=["orderSum"])
final_predictions = model.predict(X_test)

In [22]:
final_predictions[final_predictions < 0].shape

(1810,)

**Creating our Kaggle CSV**

In [17]:
final = pd.Series(0, index=np.arange(1, len(items)+1))
final[items.itemID] = final_predictions.astype(int)

final.to_csv("cat_kaggle_df.csv", header=["demandPrediction"],
            index_label="itemID", sep="|")

**Saving our model in disk**

In [50]:
now = datetime.now().strftime("%d-%m-%Y-%Hh%Mm%Ss")
modelName = 'cat-' + now
bst.save_model(modelName)