So, regarding the method of ranking the products in order to maximize the total number of orders, I wa thinking about two approaches:
1. Weighted Rank System (rule-based aproach)
2. Learning to Rank Listwise method

At first, I wasn't aware of the Learning to Rank (LTR) family of algorithms. However, after reading articles on Medium and Toward Data Science, I concluded that for my specific problem, a listwise approach might be the most appropriate one. Consequently, I opted to use LightGBM, a library developed by Microsoft, due to its simplicity and efficiency. While there may be other methods available, I decided to proceed with this one.

### Import libraries

In [1]:
import lightgbm as lgb
import pandas as pd
import numpy as np
import logging
import optuna
import json

from sklearn.model_selection import train_test_split
from sklearn.metrics import ndcg_score

### Read dataset

In [2]:
# Load the dataset
df = pd.read_csv('data/dataset_v1.csv')

Let's proceed by filtering out the records where all the predictors, including the dependent variable, are 0.

In [3]:
filtered_df = df[(df.drop(columns=['offer_id', 'category_id', 'base_price_with_vat', 'product_novelty', 'promo_price_with_vat', 'date']) == 0).all(axis=1)]
df = df[~df.index.isin(filtered_df.index)].reset_index(drop=True)

### Train test split and defining groups

Further, I am going to split the dataset into a train and test set while keeping the dates unique in both splits./

In [4]:
df = df.sort_values(by='date')

max_date = df['date'].max()

unique_dates = df['date'].unique()
split_date_index = int(len(unique_dates) * 0.8)
split_date = unique_dates[split_date_index]

# Split the DataFrame into training and testing sets based on the split date
train_df = df[df['date'] < split_date]
test_df = df[df['date'] >= split_date]

# Separate features and target variable for training and testing sets
X_train = train_df[['category_id', 'date', 'add_to_wishlist', 'add_to_cart', 'count_reviews', 'pageviews', 'rating']]
y_train = train_df['orders']

X_test = test_df[['category_id', 'date', 'add_to_wishlist', 'add_to_cart', 'count_reviews', 'pageviews', 'rating']]
y_test = test_df['orders']

Given the initial requirement to split the groups by category, I also considered the significance of the temporal variable. Recognizing that patterns could vary daily, I decided to split the groups based on both category and date. This approach is aimed at capturing any distinct patterns that emerge on a day-to-day basis.

In [5]:
# Calculate the size of each group (category) for both training and testing sets
train_groups = X_train.groupby(['category_id', 'date']).size().values.tolist()
test_groups = X_test.groupby(['category_id', 'date']).size().values.tolist()

# Remove 'category_id' from X_train and X_test as it's used for grouping, not as a feature
X_train_features = X_train.drop(columns=['category_id', 'date'])
X_test_features = X_test.drop(columns=['category_id', 'date'])

# Creating LightGBM datasets with correct group information
train_data = lgb.Dataset(X_train_features, label=y_train, group=train_groups)
test_data = lgb.Dataset(X_test_features, label=y_test, group=test_groups, reference=train_data)

Let's check the model's performance by using the default parameters

In [6]:
# Parameters for the ListNet model
params = {
    'objective': 'lambdarank',
    'metric': 'ndcg',  # NDCG for evaluation
    'verbose': -1,
    'label_gain': list(range(max(y_train.max(), y_test.max()) + 1)),
    'random_state': 42
}

### Model training

In [7]:
# Train the model
bst = lgb.train(
    params, 
    train_data, 
    valid_sets=[test_data]
)

### Model testing

In [8]:
# Predict scores for the test set
predictions = bst.predict(X_test_features)

In [9]:
results_df = test_df.copy()
results_df['rank'] = predictions

In [10]:
orders = np.array(results_df['orders']).reshape(1, -1)
rank = np.array(results_df['rank']).reshape(1, -1)

# Calculate NDCG
ndcg = ndcg_score(orders, rank)

print(f"NDCG Score: {round(ndcg, 4)}")

NDCG Score: 0.9387


### Hypertuning

In [12]:
# Configure the logging
logging.basicConfig(level=logging.INFO, filename='hypertuning/optuna_logs_listwise.txt', filemode='a')
logger = logging.getLogger(__name__)

# Define the objective function for Optuna
def objective(trial):
    params = {
        'objective': 'lambdarank',
        'metric': 'ndcg',
        'verbose': -1,
        'random_state': 42,
        'label_gain': list(range(max(y_train.max(), y_test.max()) + 1)),
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 1.0),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'max_depth': trial.suggest_int('max_depth', 1, 20),
    }

    # Create LightGBM dataset with group information
    train_data = lgb.Dataset(X_train_features, label=y_train, group=train_groups)
    test_data = lgb.Dataset(X_test_features, label=y_test, group=test_groups, reference=train_data)

    # Train the model with the current hyperparameters
    bst = lgb.train(params, train_data, valid_sets=[test_data])

    # Predict scores for the test set
    predictions = bst.predict(X_test_features)
    results_df = test_df.copy()
    results_df['rank'] = predictions

    # Reshape inputs for ndcg_score
    orders = np.array(results_df['orders']).reshape(1, -1)
    rank = np.array(results_df['rank']).reshape(1, -1)

    # Calculate NDCG
    ndcg = ndcg_score(orders, rank)
    
    logger.info(f"Trial {trial.number} - NDCG: {ndcg}, Hyperparameters: {trial.params}")
    return -ndcg  # Optuna minimizes, so we negate NDCG to maximize it

# Create an Optuna study
optuna.logging.set_verbosity(optuna.logging.WARNING)

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=60)

In [13]:
# Save the results
with open(f'hypertuning/best_params_listwise.json', 'w') as f:
    json.dump(study.best_params, f)
    
params = study.best_params
print(study.best_params)

{'learning_rate': 0.20815690907166637, 'num_leaves': 125, 'max_depth': 3}


### Let's check the results again

In [19]:
default_params = {
    'objective': 'lambdarank',
    'metric': 'ndcg',
    'verbose': -1,
    'random_state': 42,
    'label_gain': list(range(max(y_train.max(), y_test.max()) + 1)),
}

# Merge the additional hyperparameters into the params dictionary
params.update(default_params)

In [21]:
top_model = lgb.train(
    params, 
    train_data, 
    valid_sets=[test_data]
)

In [22]:
predictions = top_model.predict(X_test_features)
top_results_df = test_df.copy()
top_results_df['rank'] = predictions

In [23]:
orders = np.array(top_results_df['orders']).reshape(1, -1)
rank = np.array(top_results_df['rank']).reshape(1, -1)

# Calculate NDCG
ndcg = ndcg_score(orders, rank)

print(f"NDCG Score: {round(ndcg, 4)}")

NDCG Score: 0.9644


In [29]:
top_results_df.sort_values('rank', ascending=False).head(5)

Unnamed: 0,offer_id,category_id,add_to_wishlist,add_to_cart,base_price_with_vat,ordered_quantity,orders,product_novelty,count_reviews,pageviews,rating,promo_price_with_vat,date,rank
169848,58,3,506,1478,799.99,1038,1029,2016-10-26,911,40599,4.35,399.9901,2021-01-06,14.589817
176645,58,3,328,636,799.99,547,503,2016-10-26,913,19571,4.35,399.9901,2021-01-07,14.589817
183392,58,3,220,604,799.99,485,468,2016-10-26,918,13888,4.34,799.99,2021-01-08,14.305395
169873,348,3,313,425,949.99,339,315,2018-05-23,284,12138,4.36,549.99,2021-01-06,14.254103
166209,287,2,334,477,1099.9899,364,327,2018-03-16,260,20208,4.62,699.99,2021-01-06,13.175575
