# Accumulated Sum by Category
This notebook is a throw back in all my previous baselines. The main objective here is to be 100% sure that I'm not leaking in any part of my pipeline.

In [1]:
import numpy as np
import pandas as pd
from utils import read_data, process_time, merge_data, promo_detector, promo_detector_fixed, promotionAggregation, dataset_builder
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import sys
import xgboost as xgb
import lightgbm as lgb
from datetime import datetime

NUMBER_OF_LAGS = 4

sys.path.append("../../main/datasets/")
!ls  ../../main/datasets/

1.0v.zip


<hr>

## Defining metrics

Baseline_score function

In [2]:
def baseline_score(prediction, target, simulatedPrice):
    prediction = prediction.astype(int)

    return np.sum((prediction - np.maximum(prediction - target, 0) * 1.6)  * simulatedPrice)

Evaluation Metric

In [3]:
def feval(prediction, dtrain):
    
    prediction = prediction.astype(int)
    target = dtrain.get_label()

    simulatedPrice = dtrain.get_weight()
    
    return 'feval', np.sum((prediction - np.maximum(prediction - target, 0) * 1.6)  * simulatedPrice), True

Objective Metric

In [4]:
def gradient(predt, dtrain):
    y = dtrain.get_label()
    sp = dtrain.get_weight()
    return -2 * (predt - np.maximum(predt - y, 0) * 1.6) * (1 - (predt > y) * 1.6) * sp

def hessian(predt, dtrain):
    y = dtrain.get_label()
    sp = dtrain.get_weight() 
    return -2 * ((1 - (predt > y) * 1.6) ** 2) * sp

def objective(predt, dtrain):
    grad = gradient(predt, dtrain)
    hess = hessian(predt, dtrain)
    return grad, hess

<hr>

## Building our dataset
This notebook makes this step cleaner than the previous versions. So It'll be tidier and shorter than before!

In [5]:
infos, items, orders = read_data("../../main/datasets/")
print("Sanity checks...", infos.shape, items.shape, orders.shape)

Sanity checks... (10463, 3) (10463, 8) (2181955, 5)


In [6]:
# Changing our time signatures
process_time(orders)

In [7]:
df = dataset_builder(orders, items)

<hr>

## Feature building

The main objective here is to create a feature that represents the cummulative sales mean grouped by category (I'll call it cummulative sale). Apparently, Category 3 is the most important on model evaluation, so this feature tries to indicate to our model how important a certain item is inside Its group in Category 3.

In [175]:
def cumulative_sale_by_category(df):
    """
    This function add the percentage_acum_cat_3 in our dataset, which tries to describe how 
    important a certain item is inside Its group on category 3.

    Parameters: orders -> Orders DataFrame after "process_time" and "dataset_builder"

    Returns: our orders Dataframe with a new column (percentage_acum_cat_3)
    """
    acum = pd.DataFrame()
    for i in range(12, 0, -1):

        orders_per_item = df.loc[df.group_backwards > i].groupby(
            ['itemID', 'category3'], as_index=False).agg({'orderSum': 'sum'})
        orders_per_cat = df.loc[df.group_backwards > i].groupby(
            ['category3'], as_index=False).agg({'orderSum': 'sum'})

        # Mergin' the amount of sales by category
        # with the accumulated sales
        # of an item grouped by category
        # of the previous weeks
        cum_sum_mean = pd.merge(orders_per_item, orders_per_cat,
                                left_on='category3', right_on='category3', validate="m:1")

        # Calculating the mean of the accumulated sales...
        cum_sum_mean['percentage_accum_cat_3'] = cum_sum_mean['orderSum_x'] / \
            cum_sum_mean['orderSum_y'] * 100

        # These columns won't be useful anymore,
        # since they were used just to calculate our mean
        cum_sum_mean.drop(columns=['orderSum_x', 'orderSum_y'], inplace=True)

        feature_merge = pd.merge(df.loc[df.group_backwards == i], cum_sum_mean.drop(
            columns=['category3']), left_on='itemID', right_on='itemID')
        acum = pd.concat([acum, feature_merge])

    week_13 = df.loc[df.group_backwards == 13].copy()
    week_13['percentage_accum_cat_3'] = 0
    acum = pd.concat([week_13, acum])

    assert (acum.loc[acum.group_backwards == 13]['percentage_accum_cat_3'].sum(
    ) == 0), ("The values on week 13 should all be zero. Verify your inputs")
    
    acum.reset_index(drop=True, inplace=True)

    return acum