# Version 1 - Using LGBM Ranker (Score 0.48477)

In this notebook, we introduce our first approach - training LGBM Ranker for our recommendation system.

Our goal is to predict which items users are most likely to interact with next (via clicks, carts, or orders) based on their session behavior.

To simplify experimentation and improve reproducibility, we work with a preprocessed version of the OTTO dataset provided in parquet format, eliminating the need to parse raw JSONL files. This setup allows for faster iteration and local validation.

We extract behavioral patterns and session dynamics by engineering features such as:

* The reverse chronological position of each action in the session
* The total number of actions per session
* A log-scaled recency score based on both the order and timestamp of interactions
* A type-weighted recency score, reflecting the importance of clicks, carts, and orders

These features are designed to capture both recency and interaction strength, helping the model learn user intent over time. After feature generation, we train a LightGBM Ranker to prioritize the most relevant items per session.

Some features such as total clicks, carts, and orders per session, session duration, session-level conversion rate, and aid interaction count were tested but did not lead to measurable improvements in model performance. However, we have kept these features in the notebook.


# Data Processing

In [1]:
!pip install polars==0.18.4
# This version is necessary

Collecting polars==0.18.4
  Downloading polars-0.18.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.9/18.9 MB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: polars
Successfully installed polars-0.18.4
[0m

In [2]:
import polars as pl

In [3]:
train = pl.read_parquet('../input/otto-train-and-test-data-for-local-validation/test.parquet')
train_labels = pl.read_parquet('../input/otto-train-and-test-data-for-local-validation/test_labels.parquet')
testing = pl.read_parquet('../input/otto-train-and-test-data-for-local-validation/train.parquet')

In [4]:
train.head()


session,aid,ts,type
i32,i32,i32,u8
11098528,11830,1661119200,0
11098529,1105029,1661119200,0
11098530,264500,1661119200,0
11098530,264500,1661119288,0
11098530,409236,1661119369,0


In [5]:
testing.head()

session,aid,ts,type
i32,i32,i32,u8
0,1517085,1659304800,0
0,1563459,1659304904,0
0,1309446,1659367439,0
0,16246,1659367719,0
0,1781822,1659367871,0


In [6]:
train_labels.head()

session,type,ground_truth
i64,str,list[i64]
11098528,"""clicks""",[1679529]
11098528,"""carts""",[1199737]
11098528,"""orders""","[990658, 950341, … 1033148]"
11098529,"""clicks""",[1105029]
11098530,"""orders""",[409236]


We are calculating the scores that we used for creating co-vistation matrices! We know they carry signal, so let's provde this information to our `LGBM Ranker`!

In [7]:
# Adds a column indicating the reverse chronological position of each action in the session
def add_action_num_reverse_chrono(df):
    return df.select([
        pl.col('*'),
        pl.col('session').cumcount().reverse().over('session').alias('action_num_reverse_chrono')
    ])

# Adds a column indicating the total number of actions in each session
def add_session_length(df):
    return df.select([
        pl.col('*'),
        pl.col('session').count().over('session').alias('session_length')
    ])

# Adds a column with the total number of 'click' (type=0) actions in each session
def add_session_total_click(df):
    return df.with_columns(
        (pl.col('type') == 0).sum().over('session').alias('session_total_click')
    )

# Adds a column with the total number of 'cart' (type=1) actions in each session
def add_session_total_cart(df):
    return df.with_columns(
        (pl.col('type') == 1).sum().over('session').alias('session_total_cart')
    )

# Adds a column with the total number of 'order' (type=2) actions in each session
def add_session_total_order(df):
    return df.with_columns(
        (pl.col('type') == 2).sum().over('session').alias('session_total_order')
    )

# Adds a column that scores recency using a logarithmic scale based on action order
def add_log_recency_score(df):
    # Higher scores are assigned to more recent actions (based on reverse action number)
    linear_interpolation = 0.1 + ((1 - 0.1) / (df['session_length'] - 1)) * (df['session_length'] - df['action_num_reverse_chrono'] - 1)
    return df.with_columns(pl.Series(2**linear_interpolation - 1).alias('log_recency_score')).fill_nan(1)

# Adds a column that combines interaction type weights with the log recency score
def add_type_weighted_log_recency_score(df):
    type_weights = {0: 1, 1: 6, 2: 3}  # Click=1, Cart=6, Order=3
    # Divide the recency score by the type weight for normalization
    type_weighted_log_recency_score = pl.Series(df['log_recency_score'] / df['type'].apply(lambda x: type_weights[x]))
    return df.with_columns(type_weighted_log_recency_score.alias('type_weighted_log_recency_score'))

# Adds the session duration (max timestamp - min timestamp) for each session
def add_session_duration(df):
    return df.with_columns(
        (pl.col('ts').max().over('session') - pl.col('ts').min().over('session')).alias('session_duration')
    )

# Adds a timestamp-based recency score using exponential scaling
def add_timestamp_log_recency_score(df):
    return df.with_columns(
        ((2 ** (1 + ((pl.col('ts').max().over('session') - pl.col('ts')) / 10000))) - 1)
        .alias('timestamp_log_recency_score')
    )

# Adds a column that combines type weights with the timestamp-based log recency score
def add_type_weighted_timestamp_log_recency_score(df):
    type_weights = {0: 1, 1: 6, 2: 3}
    type_weighted_timestamp_log_recency_score = pl.Series(df['timestamp_log_recency_score'] / df['type'].apply(lambda x: type_weights[x]))
    return df.with_columns(type_weighted_timestamp_log_recency_score.alias('type_weighted_timestamp_log_recency_score'))

# Adds a session-level conversion rate feature: orders / (clicks + carts)
def add_session_conversion_rate(df):
    # Calculate total clicks or carts in the session
    aids_clicked_or_cart = (
        (pl.col('type') == 0) | (pl.col('type') == 1)
    ).sum().over('session').alias('aids_clicked_or_cart')
    
    # Calculate total orders in the session
    aids_ordered = (pl.col('type') == 2).sum().over('session').alias('aids_ordered')
    
    # Compute conversion rate: orders divided by (clicks + carts)
    conversion_rate = (aids_ordered / aids_clicked_or_cart).fill_nan(0).alias('session_conversion_rate')
    
    return df.with_columns(conversion_rate)

# Adds a column counting how many times each aid appears in a session
def add_aid_interaction_count(df):
    aid_interaction_counts = (
        df.groupby(['session', 'aid'])
        .agg(pl.count().alias('aid_interaction_count'))
    )
    return df.join(aid_interaction_counts, on=['session', 'aid'])

# Utility to apply a list of feature functions sequentially to a dataframe
def apply(df, pipeline):
    for f in pipeline:
        df = f(df)
    return df


In [8]:
pipeline = [add_action_num_reverse_chrono, add_session_length, add_log_recency_score, add_type_weighted_log_recency_score,add_timestamp_log_recency_score]

In [9]:
train = apply(train, pipeline)

All done!

In [10]:
train.head()

session,aid,ts,type,action_num_reverse_chrono,session_length,log_recency_score,type_weighted_log_recency_score,timestamp_log_recency_score
i32,i32,i32,u8,u32,u32,f64,f64,f64
11098528,11830,1661119200,0,0,1,1.0,1.0,1.0
11098529,1105029,1661119200,0,0,1,1.0,1.0,1.0
11098530,264500,1661119200,0,5,6,0.071773,0.071773,1.193447
11098530,264500,1661119288,0,4,6,0.214195,0.214195,1.180109
11098530,409236,1661119369,0,3,6,0.375542,0.375542,1.167903


Now we need to process our labels a little bit and merge them onto our train set.

In [11]:
type2id = {"clicks": 0, "carts": 1, "orders": 2}

In [12]:
train_labels = train_labels.explode('ground_truth').with_columns([
    pl.col('ground_truth').alias('aid'),
    pl.col('type').apply(lambda x: type2id[x])
])[['session', 'type', 'aid']]

In [13]:
train_labels = train_labels.with_columns([
    pl.col('session').cast(pl.datatypes.Int32),
    pl.col('type').cast(pl.datatypes.UInt8),
    pl.col('aid').cast(pl.datatypes.Int32)
])

In [14]:
train_labels = train_labels.with_columns(pl.lit(1).alias('gt'))

In [15]:
train = train.join(train_labels, how='left', on=['session', 'type', 'aid']).with_columns(pl.col('gt').fill_null(0))

In [16]:
train.head(200)

session,aid,ts,type,action_num_reverse_chrono,session_length,log_recency_score,type_weighted_log_recency_score,timestamp_log_recency_score,gt
i32,i32,i32,u8,u32,u32,f64,f64,f64,i32
11098528,11830,1661119200,0,0,1,1.0,1.0,1.0,0
11098529,1105029,1661119200,0,0,1,1.0,1.0,1.0,1
11098530,264500,1661119200,0,5,6,0.071773,0.071773,1.193447,0
11098530,264500,1661119288,0,4,6,0.214195,0.214195,1.180109,0
11098530,409236,1661119369,0,3,6,0.375542,0.375542,1.167903,0
11098530,409236,1661119441,0,2,6,0.558329,0.558329,1.15711,0
11098530,409236,1661120165,0,1,6,0.765406,0.765406,1.05153,0
11098530,409236,1661120532,1,0,6,1.0,0.166667,1.0,0
11098531,452188,1661119200,0,23,24,0.071773,0.071773,1.077142,0
11098531,1239060,1661119227,0,22,24,0.101241,0.101241,1.073258,0


Ok, so we now have our preprocessed dataset, a column with ground truth, which means that the only thing we are missing for our Ranker is... information how to group individual rows into sessions!

In [17]:
def get_session_lenghts(df):
    return df.groupby('session').agg([
        pl.col('session').count().alias('session_length')
    ])['session_length'].to_numpy()

In [18]:
def get_sesssion_total_clicks(df):
    pass
    

In [19]:
session_lengths_train = get_session_lenghts(train)

In [20]:
print(session_lengths_train)
print(f"We have total {len(session_lengths_train)} sessions out of total .")

[1 1 1 ... 7 2 2]
We have total 1801251 sessions out of total .


# Model training

In [21]:
from lightgbm.sklearn import LGBMRanker

In [22]:
ranker = LGBMRanker(
    objective="lambdarank",
    metric="ndcg",
    boosting_type="dart",
    n_estimators=20,
    importance_type='gain',
)

In [23]:
train.columns

['session',
 'aid',
 'ts',
 'type',
 'action_num_reverse_chrono',
 'session_length',
 'log_recency_score',
 'type_weighted_log_recency_score',
 'timestamp_log_recency_score',
 'gt']

In [24]:
feature_cols = ['aid', 'type', 'action_num_reverse_chrono', 'session_length', 'log_recency_score', 'type_weighted_log_recency_score', 'timestamp_log_recency_score']
target = 'gt'

In [25]:
ranker = ranker.fit(
    train[feature_cols].to_pandas(),
    train[target].to_pandas(),
    group=session_lengths_train,
)

# Predict on test data

Let's load our test set, process it and predict on it.

In [26]:
test = pl.read_parquet('../input/otto-full-optimized-memory-footprint/test.parquet')
test = apply(test, pipeline)
print(test)

shape: (6_928_123, 9)
┌──────────┬─────────┬────────────┬──────┬───┬────────────┬────────────┬────────────┬──────────────┐
│ session  ┆ aid     ┆ ts         ┆ type ┆ … ┆ session_le ┆ log_recenc ┆ type_weigh ┆ timestamp_lo │
│ ---      ┆ ---     ┆ ---        ┆ ---  ┆   ┆ ngth       ┆ y_score    ┆ ted_log_re ┆ g_recency_sc │
│ i32      ┆ i32     ┆ i32        ┆ u8   ┆   ┆ ---        ┆ ---        ┆ cency_scor ┆ ore          │
│          ┆         ┆            ┆      ┆   ┆ u32        ┆ f64        ┆ e          ┆ ---          │
│          ┆         ┆            ┆      ┆   ┆            ┆            ┆ ---        ┆ f64          │
│          ┆         ┆            ┆      ┆   ┆            ┆            ┆ f64        ┆              │
╞══════════╪═════════╪════════════╪══════╪═══╪════════════╪════════════╪════════════╪══════════════╡
│ 12899779 ┆ 59625   ┆ 1661724000 ┆ 0    ┆ … ┆ 1          ┆ 1.0        ┆ 1.0        ┆ 1.0          │
│ 12899780 ┆ 1142000 ┆ 1661724000 ┆ 0    ┆ … ┆ 5          ┆ 0.071773 

In [27]:
scores = ranker.predict(test[feature_cols].to_pandas())
print(scores)

[ 0.23992961 -0.31695035 -0.12408719 ...  0.23897263  0.23992961
  0.23897263]


# Create submission

In [28]:
test = test.with_columns(pl.Series(name='score', values=scores))
test_predictions = test.sort(['session', 'score'], descending =True).groupby('session').agg([
    pl.col("aid").limit(20)
])

In [29]:
test.head()

session,aid,ts,type,action_num_reverse_chrono,session_length,log_recency_score,type_weighted_log_recency_score,timestamp_log_recency_score,score
i32,i32,i32,u8,u32,u32,f64,f64,f64,f64
12899779,59625,1661724000,0,0,1,1.0,1.0,1.0,0.23993
12899780,1142000,1661724000,0,4,5,0.071773,0.071773,1.021603,-0.31695
12899780,582732,1661724058,0,3,5,0.252664,0.252664,1.013492,-0.124087
12899780,973453,1661724109,0,2,5,0.464086,0.464086,1.006387,-0.242974
12899780,736515,1661724136,0,1,5,0.71119,0.71119,1.002636,0.200949


In [30]:
test_predictions.head()

session,aid
i32,list[i32]
14571581,[1100210]
14571580,[202353]
14571579,[739876]
14571578,[519105]
14571577,[1141710]


In [31]:
num_rows = test_predictions.shape[0]
print("Number of rows:", num_rows)

Number of rows: 1671803


In [32]:
session_types = []
labels = []

for session, preds in zip(test_predictions['session'].to_numpy(), test_predictions['aid'].to_numpy()):
    l = ' '.join(str(p) for p in preds)
    for session_type in ['clicks', 'carts', 'orders']:
        labels.append(l)
        session_types.append(f'{session}_{session_type}')

In [33]:
submission = pl.DataFrame({'session_type': session_types, 'labels': labels})
submission.write_csv('submission.csv')

In [34]:
submission.head(10)

session_type,labels
str,str
"""14571581_click…","""1100210"""
"""14571581_carts…","""1100210"""
"""14571581_order…","""1100210"""
"""14571580_click…","""202353"""
"""14571580_carts…","""202353"""
"""14571580_order…","""202353"""
"""14571579_click…","""739876"""
"""14571579_carts…","""739876"""
"""14571579_order…","""739876"""
"""14571578_click…","""519105"""
