<a href="https://colab.research.google.com/github/lapshinaaa/lapshinaaa/blob/main/RecSys4_NeuralRanking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Ranking

Source: https://github.com/yandexdataschool/recsys_course/tree/main

In [None]:
# !pip install -r requirements.txt

In [3]:
import typing as tp
import polars as pl
from tqdm import tqdm
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import roc_auc_score

## 1. Downsampled Criteo Kaggle Dataset

Source:
https://www.kaggle.com/datasets/dogrose/downsampled-criteo-kaggle-dataset/data

The Criteo Dataset is a large-scale benchmark for Click-Through Rate (CTR) prediction in online advertising. It contains real ad impression logs and is widely used for training and evaluating models that estimate the probability of a user clicking on an advertisement.

Dataset Characteristics
Feature Type | Count | Description
------------|-------|------------
Target | 1 | Binary click indicator (clicked / not clicked)
Numerical Features | 13 (I1–I13) | Count-based, require log transformation
Categorical Features | 26 (C1–C26) | Anonymized, hashed, high cardinality

Additional properties:
- Contains millions of samples across 7 days (6-day train, 1-day test)
- High sparsity and many missing values
- The Downsampled version used here is 10× smaller than the original dataset

The original dataset requires substantial preprocessing; this is implemented in the provided code.

Preprocessing Techniques Used in Literature
- Log transformation of numerical features
- Count encoding or one-hot encoding for categorical features
- Feature hashing for high-cardinality variables
- Feature interactions (manual or learned)

Modeling Approaches Used by Top Kaggle Solutions
Category | Example Models | Notes
---------|---------------|------
Gradient Boosting | XGBoost, LightGBM, CatBoost | Strong baselines
Deep Learning | Wide & Deep, DNNs | Capture non-linear interactions
Factorization Machines | FM, FFM | Good for sparse categorical data
Ensembles | Stacking multiple models | Often leads to top performance

Top solutions combine multiple categories and emphasize feature engineering + handling high cardinality.

In [4]:
!mkdir ./data
!curl -L -o ./data/downsampled-criteo-kaggle-dataset.zip\
  https://www.kaggle.com/api/v1/datasets/download/dogrose/downsampled-criteo-kaggle-dataset
!unzip ./data/downsampled-criteo-kaggle-dataset.zip -d ./data

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  208M  100  208M    0     0   149M      0  0:00:01  0:00:01 --:--:--  233M
Archive:  ./data/downsampled-criteo-kaggle-dataset.zip
  inflating: ./data/criteo_test_1day_downsampled.parquet  
  inflating: ./data/criteo_train_6days_downsampled.parquet  


In [5]:
class CriteoDatasetUtils:
    INT_COLS = [f'I{i + 1}' for i in range(13)]
    CAT_COLS = [f'C{i + 1}' for i in range(26)]
    LABEL_COl = 'label'

    @classmethod
    def preprocess_dense_features(cls, lf: pl.LazyFrame) -> pl.LazyFrame:
        """
        Preprocess dense features:
        - Fill missing values with 0
        - Apply log transformation: log(x+1) or log(x+4)
        """
        expressions = []
        for col in cls.INT_COLS:
            expressions.append(
                pl.col(col).fill_null(0).add(1 if col != 'I2' else 4).log()
            )
        lf = lf.with_columns(expressions)
        return lf

    @classmethod
    def preprocess_categorical_features(cls, lf: pl.LazyFrame) -> pl.LazyFrame:
        """
        Preprocess categorical features:
        - Fill missing values with zero string ("00000000")
        - Convert from hex to Int64
        """
        expressions = []
        for col in cls.CAT_COLS:
            expressions.append(
                pl.col(col).fill_null("00000000").str.to_integer(base=16)
            )
        lf = lf.with_columns(expressions)
        return lf

    @classmethod
    def read_and_preprocess(cls, path: str) -> pl.DataFrame:
        lf = pl.scan_parquet(path)
        lf = cls.preprocess_categorical_features(lf)
        lf = cls.preprocess_dense_features(lf)
        return lf.collect()

In [6]:
# DATASETS_PATH = '/content'
DATASETS_PATH = './data'
train_df = CriteoDatasetUtils.read_and_preprocess(f'{DATASETS_PATH}/criteo_train_6days_downsampled.parquet')
test_df = CriteoDatasetUtils.read_and_preprocess(f'{DATASETS_PATH}/criteo_test_1day_downsampled.parquet')
train_df.head(5)

label,I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,I11,I12,I13,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
0,0.693147,1.609438,1.791759,0.0,7.23201,1.609438,2.772589,1.098612,5.204007,0.693147,1.098612,0.0,1.098612,1761418852,2162322587,4220739894,2068259780,633879704,2114768079,3732510136,529118562,2805916944,2832028932,2999688344,935969124,673490422,450684655,2343089050,2300273383,3854202482,4114618041,568184265,2972002973,129309004,0,974593739,3318023300,3904386055,2535972118
0,0.0,1.791759,6.45047,0.0,10.946781,0.0,0.0,1.791759,4.189655,0.0,0.0,0.0,1.098612,98275684,73979506,2062028047,2161661274,633879704,2114768079,69696505,185940084,2093428418,990438539,2116836373,3486017542,2443294225,2995026422,1478826667,342520061,2003624857,187896596,568184265,1480633834,3421256155,0,974593739,3470446969,3935970412,2589289724
1,0.0,3.931826,0.0,0.0,8.764053,3.663562,2.995732,2.397895,4.969813,0.0,2.397895,0.0,1.94591,342162125,950618017,574295574,3394569756,633879704,2114768079,1765007041,1530472565,2805916944,990438539,2248945707,359635439,812864628,450684655,242755870,1606371972,3854202482,429505257,0,0,3736690781,0,851920782,3829933919,0,0
1,0.0,5.638355,0.0,1.386294,8.898229,3.218876,1.94591,1.386294,4.59512,0.0,0.693147,0.0,1.386294,2364568165,2598325497,779549537,186309082,1291264903,4268462821,1977395914,185940084,2805916944,990438539,2326518504,2558959783,3989772238,131152527,716094755,864816393,3854202482,3353078997,0,0,1624430225,0,974593739,2427856751,0,0
0,0.0,3.73767,1.098612,1.609438,8.045588,5.010635,4.174387,3.89182,4.941642,0.0,1.94591,1.94591,1.609438,98275684,648577312,3492987491,3247169921,1291264903,4222442646,1062127239,529118562,2805916944,1919877373,3299735832,3753576955,2245779768,131152527,68076599,2223606570,2399067775,1465486885,0,0,1360682,0,851920782,991444060,0,0


In [7]:
assert train_df.null_count().pipe(sum).item() == 0
assert test_df.null_count().pipe(sum).item() == 0

Taking a look at the number of unique values of categorical features.

In [8]:
unique_counts = {col: train_df[col].n_unique() for col in CriteoDatasetUtils.CAT_COLS}
sorted_unique_counts = dict(
    sorted(unique_counts.items(), key=lambda item: item[1], reverse=True)
)
sorted_unique_counts

{'C3': 1203747,
 'C12': 1047892,
 'C21': 925198,
 'C16': 759151,
 'C4': 378887,
 'C24': 90428,
 'C26': 59490,
 'C10': 50127,
 'C7': 11950,
 'C15': 11775,
 'C11': 5215,
 'C18': 4756,
 'C13': 3164,
 'C19': 2036,
 'C1': 1452,
 'C8': 628,
 'C2': 556,
 'C5': 303,
 'C25': 94,
 'C14': 26,
 'C6': 18,
 'C22': 17,
 'C23': 15,
 'C17': 10,
 'C20': 4,
 'C9': 3}

How much GPU will be needed.

In [9]:
uniq_ids = sum(sorted_unique_counts.values())
embedding_dim = 256
bytes_in_float = 4
mult = 4 # params + grads + moment1 + moment2
bytes_in_gb = 1024 * 1024 * 1024
print(f'{uniq_ids * embedding_dim * bytes_in_float * mult / bytes_in_gb} GB')

17.38335418701172 GB


## 2. Multisize Unified Embeddings

To encode categorical features, we will use Multisize Unified encoding from Google DeepMind: [Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems](https://arxiv.org/abs/2305.12102).

### General task

Given $D = \{(x_1, y_1), (x_2, y_2), \ldots, (x_{|D|}, y_{|D|})\}$ containing samples with $T$ categorical features with vocabularies $\{V_1, V_2, \ldots, V_T\}$. Each sample is of the form $x = [v_1, v_2, \ldots, v_T]$, where $v_i \in V_i$.

- Embedding matrix $\mathbf{E} \in \mathbb{R}^{M \times d}$ maps a sample to an embedding $g(\mathbf{x}; \mathbf{E})$.
- Hash function $h(v) : V \rightarrow [M]$ assigns a feature value to a row index (used in $g(\mathbf{x}; \mathbf{E})$).
- Model function $f(\mathbf{e}; \boldsymbol{\theta})$ converts embeddings into predictions.

Training objective:

$$\arg \min_{\mathbf{E}, \boldsymbol{\theta}} \mathcal{L}_D(\mathbf{E}, \boldsymbol{\theta}), \quad \text{where} \quad \mathcal{L}_D(\mathbf{E}, \boldsymbol{\theta}) = \sum_{(\mathbf{x},y) \in D} \ell(f(g(\mathbf{x}; \mathbf{E}); \boldsymbol{\theta}), y).$$

We use $h_t(v)$ for each feature $t \in [T]$. Let $\mathbf{e}_m$ denote the $m$-th row of $\mathbf{E}$ and $\mathbb{1}_{u,v}$ be the indicator of hash collisions between $u$ and $v$.

### How it works

For simplicity, assume $|T| = 2$.

<div style="width:90%; margin: auto;">

![](https://i.ibb.co/GKMKcvm/unified-embeddings.png)

</div>

### Why This Works (Intuition)

Consider a special case: we are solving binary classification using logistic regression:

$$y_i \in \{0, 1\}$$  
$$ D_0 = \{(x_i, y_i) \in D : y_i = 0\} $$
$$ D_1 = \{(x_i, y_i) \in D : y_i = 1\} $$
$$ C_{u,v,0} = |\{([u, v], y) \in D : y = 0\}| $$
$$ \sigma_\theta(z) = \frac{1}{1 + \exp(-\langle z, \theta \rangle)} $$
$$ z = g(x; \mathbf{E}) = [e_{h_1(x_1)}, e_{h_2(x_2)}] $$
$$ \theta = [\theta_1, \theta_2],~\theta_t \in \mathbb{R}^M $$

Binary cross-entropy loss:

$$ \mathcal{L}_D(\mathbf{E}, \theta) = - \sum_{(x,y)\in D_0} \log \left( \frac{1}{1 + \exp(-\langle \theta, g(x; \mathbf{E}) \rangle)} \right) - \sum_{(x,y)\in D_1} \log \left( \frac{1}{1 + \exp(\langle \theta, g(x; \mathbf{E}) \rangle)} \right) $$

Rewrite the loss function using co-occurrence frequencies:

$$ e_{u,v} = [e_{h_1(u)}, e_{h_2(v)}] $$

$$ \mathcal{L}_D(\mathbf{E}, \theta) = - \sum_{u\in V_1} \sum_{v\in V_2} C_{u,v,0} \log \left( \frac{1}{1 + \exp(\theta^\top e_{u,v})} \right) + C_{u,v,1} \log \left( \frac{1}{1 + \exp(\theta^\top e_{u,v})} \right) $$

After combining sigmoid terms:

$$ \mathcal{L}_D(\mathbf{E}, \theta) = - \sum_{u\in V_1} \sum_{v\in V_2} C_{u,v,0} \log \exp(\theta^\top e_{u,v}) - (C_{u,v,0} + C_{u,v,1}) \log(1 + \exp(\theta^\top e_{u,v})) $$

Now assume we train using SGD. Compute gradients with respect to embeddings. The full gradient, accounting for intra- and inter-feature interactions:

$$ \nabla_{E_{h(u)}} \mathcal{L}_D(\mathbf{E}, \theta) = $$
$$ \theta_1 \sum_{v\in V_2} C_{u,v,0} - (C_{u,v,0} + C_{u,v,1})\sigma_\theta(e_{u,v}) \tag{1}$$
$$ + \theta_1 \sum_{w\in V_1, w\neq u} \mathbb{1}_{u,w} \sum_{v\in V_2} C_{w,v,0} - (C_{w,v,0} + C_{w,v,1})\sigma_\theta(e_{u,v}) \tag{2}$$
$$ + \theta_2 \sum_{v\in V_2} \mathbb{1}_{u,v} \sum_{w\in V_1} C_{w,v,0} - (C_{w,v,0} + C_{w,v,1})\sigma_\theta(e_{w,u}) \tag{3}$$

Interpretation:

- (1) collisionless component  
- (2) intra-feature component  
- (3) inter-feature component  
- Components (2) and (3) bias the true gradient
- Intra-feature bias aligns with the collisionless gradient, so the model *cannot* remove it
- Under SGD, inter-feature bias can be mitigated if $\theta_1 \perp \theta_2 $

Reason: during SGD, $( e_{h(u)} $) is a linear combination of updates, meaning it decomposes into components along $\theta_1$ and $\theta_2$. Since:

 $$\langle\theta_1, \theta_2\rangle = 0$$

the projection $\theta_1^\top e_{h(u)}$ filters out the inter-feature component.

<div style="width:70%; margin: auto;">

![](https://i.ibb.co/xt6nbrZ3/theory-unified.png)

</div>

### Conclusion

Not all collisions are equally harmful.  
- Inter-feature collisions can be mitigated by the model, since different features have distinct learned parameters.  
- Intra-feature collisions are persistent and must be reduced via multiple hash functions.

In [10]:
class MultihashTransform:
    """
    Applys transformation to training sample
    """
    def __init__(self, cardinality, seeds=None, name='sparse'):
        assert seeds is not None
        self._cardinality = cardinality
        self._name = name
        self._seeds = torch.tensor(seeds)

    def __call__(self, sample: dict[str, tp.Any]) -> dict[str, tp.Any]:
        sample[self._name] = (
            (sample[self._name].unsqueeze(1) + self._seeds) % self._cardinality
        ).long().reshape(-1)
        return sample

In [10]:
seeds = [
    [2342 + 13 * i, 7777 + 17 * i]
    for i in range(26)
]
transform = MultihashTransform(10, seeds)
input = {
    'label': torch.tensor(1),
    'dense': torch.randn(13),
    'sparse': torch.arange(26)
}
output = transform(input)
assert output['sparse'].shape == (2 * 26,)

In [11]:
print(output['sparse'])

tensor([2, 7, 6, 5, 0, 3, 4, 1, 8, 9, 2, 7, 6, 5, 0, 3, 4, 1, 8, 9, 2, 7, 6, 5,
        0, 3, 4, 1, 8, 9, 2, 7, 6, 5, 0, 3, 4, 1, 8, 9, 2, 7, 6, 5, 0, 3, 4, 1,
        8, 9, 2, 7])


In [11]:
class UnifiedEmbeddings(nn.Module):
    def __init__(self, cardinality, embedding_dim):
        super().__init__()
        self._cardinality = cardinality
        self._embedding_dim = embedding_dim
        self.embeddings = nn.Embedding(
            num_embeddings=cardinality, embedding_dim=embedding_dim
        )

    def forward(self, ids: torch.tensor):
        # ids shape: [batch_size, num_features]
        return self.embeddings(ids)

## 3. Picewise Linear Encoding

# Piecewise Linear Encoding (PLE) for Numerical Features

We use *Piecewise Linear Encoding (PLE)* from Yandex Research:
“On Embeddings for Numerical Features in Tabular Deep Learning”
https://arxiv.org/abs/2203.05556

GitHub implementation:
https://github.com/yandex-research/rtdl-num-embeddings

---

## 1. Overview

Numerical features are embedded using functions of the form:

    z_i = f_i(x_i^(num)) ∈ R^{d_i}

Where:
- f_i(x) — embedding function for the i-th numerical feature
- z_i — resulting embedding vector
- d_i — embedding dimension for that feature

### Key properties:
- Each numerical feature is embedded independently
- In MLP architectures, embeddings are concatenated
- In Transformer architectures, embeddings are used directly (as tokens)

---

## 2. Piecewise Linear Encoding (PLE)

PLE divides the numerical feature range into T bins:

    [b_0, b_1], [b_1, b_2], …, [b_{T−1}, b_T]

The encoding is:

    PLE(x) = [e_1, e_2, ..., e_T] ∈ R^T

Each element e_t is defined as:

                0,                           if x < b_{t−1} and t > 1
    e_t(x) =    1,                           if x ≥ b_t     and t < T
                (x − b_{t−1}) / (b_t − b_{t−1}), otherwise

### Important properties:
- When T = 1, PLE reduces to the raw scalar (identity transform)
- Unlike categorical embeddings, PLE respects ordering
- PLE behaves similarly to *learnable preprocessing*

---

## 3. Using PLE in Transformer Models

Transformers require fixed-dimensional vector tokens.

For each bin B_t:
- we learn an embedding vector v_t ∈ R^d

The final embedding for a numerical feature is:

    f_i(x) = v_0 + Σ_{t=1}^T e_t(x) · v_t
           = Linear(PLE(x))

This makes PLE directly compatible with attention-based models.

---

## 4. How to Choose the Bin Boundaries

The standard approach is quantile binning:

    b_t = q_t( {x_j^(num)} over training set )

Where q_t is the empirical quantile.

This ensures:
- bins contain similar numbers of samples
- continuous features are encoded smoothly
- rare extreme values do not collapse into a single region

---

## Summary

PLE offers:
- Smooth and expressive numerical embeddings
- Better performance than raw numeric inputs or normalization
- Easy integration into both MLPs and Transformers
- Learnable representation that respects feature ordering

In [12]:
class PiecewiseLinearEncodingTransform:
    """
    Applys transformation to training sample
    """
    @staticmethod
    def compute_bins(
        X: torch.Tensor,
        n_bins: int,
    ) -> list[torch.Tensor]:
        bins = [
            q.unique() # 1D tensor of quantile boundaries
            for q in torch.quantile(
                X, torch.linspace(0.0, 1.0, n_bins + 1).to(X), dim=0
            ).T
        ]
        return bins

    def __init__(self, dense_train_df, n_bins=32, name='dense'):
        self._name = name
        self._bins = PiecewiseLinearEncodingTransform.compute_bins(dense_train_df.to_torch(), n_bins)
        n_features = len(self._bins)
        self._n_bins = [len(x) - 1 for x in self._bins]
        max_n_bins = max(self._n_bins)

        self.weight = torch.zeros(n_features, max_n_bins)
        self.bias = torch.zeros(n_features, max_n_bins)

        for i, bin_edges in enumerate(self._bins):
            bin_width = bin_edges.diff()
            w = 1.0 / bin_width
            b = -bin_edges[:-1] / bin_width
            self.weight[i, -1] = w[-1]
            self.bias[i, -1] = b[-1]
            self.weight[i, :self._n_bins[i] - 1] = w[:-1]
            self.bias[i, :self._n_bins[i] - 1] = b[:-1]

    @property
    def n_bins(self):
        return self._n_bins

    def __call__(self, sample: dict[str, tp.Any]) -> dict[str, tp.Any]:
        x = sample[self._name].to(torch.float32).unsqueeze(0)
        x = torch.addcmul(self.bias, self.weight, x[..., None])
        x = torch.cat(
            [
                x[..., :1].clamp_max(1.0), # leftmost bin
                x[..., 1:-1].clamp(0.0, 1.0), # middle bins
                x[..., -1:].clamp_min(0.0) # rightmost bin
            ],
            dim=-1,
        )
        x = x.flatten(-2).squeeze(0)
        sample[self._name] = x

        return sample

Example:

**weight** =
\begin{bmatrix}
\frac{1}{b_1 - b_0} & \frac{1}{b_2 - b_1} & \frac{1}{b_3 - b_2} & \frac{1}{b_4 - b_3} \\
\frac{1}{c_1 - c_0} & \frac{1}{c_2 - c_1} & \frac{1}{c_3 - c_2} & \frac{1}{c_4 - c_3}
\end{bmatrix}

**bias** =
\begin{bmatrix}
-\frac{b_0}{b_1 - b_0} & -\frac{b_1}{b_2 - b_1} & -\frac{b_2}{b_3 - b_2} & -\frac{b_3}{b_4 - b_3} \\
-\frac{c_0}{c_1 - c_0} & -\frac{c_1}{c_2 - c_1} & -\frac{c_2}{c_3 - c_2} & -\frac{c_3}{c_4 - c_3}
\end{bmatrix}


In [13]:
transform = PiecewiseLinearEncodingTransform(train_df[CriteoDatasetUtils.INT_COLS], name='dense')
input = {
    'label': torch.tensor(1),
    'dense': torch.randn(13),
    'sparse': torch.arange(26)
}
output = transform(input)
assert output['dense'].shape == (403,)

In [14]:
class PiecewiseLinearEncoding(nn.Identity):
    pass

## 4. DCN v2 - deep cross network

For the aggregation of categorical and numerical features into one scalar, we're going to be using DCNv2 from Google DeepMind: [DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/abs/2008.13535).

### Approach

<div style="width:50%; margin: auto;">

![](https://i.ibb.co/ZqfF5yf/dcn-v2.png)
![](https://i.ibb.co/SDYWNSMy/dcn-v2-equation.png)

</div>

Stacked variant:

- First, a sequence of cross layers:
  $$
  x_{i+1} = x_0 \odot (W x_i + b) + x_i.
  $$

- Then, a sequence of deep (fully connected) layers:
  $$
  h_{l+1} = f(W_l h_l + b_l).
  $$


### Why it works and what it’s for

- We know that explicit feature crosses (feature–feature interactions) are important in many recommendation and ranking tasks.
- A plain DNN mostly learns implicit interactions and can approximate dot-products and higher-order crosses only with deep, wide networks, which are expensive at inference time.
- CrossNet adds explicit, low-rank feature interactions via the cross layers.  
  This reduces the need for very deep DNN towers, making the model:
  - easier to train,
  - cheaper to serve in production,
  - and better aligned with tasks where cross features matter a lot (e.g., user–item, user–context interactions).

<div style="width:50%; margin: auto;">

![](https://i.ibb.co/K34HqJH/dcn-theory.png)

</div>

In [15]:
class CrossLayer(torch.nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.linear = nn.Linear(input_dim, input_dim)

    def forward(self, x0, xl):
        return x0 * self.linear(xl) + xl


class CrossNetwork(torch.nn.Module):
    def __init__(self, input_dim, num_layers):
        super().__init__()
        self.layers = nn.ModuleList([CrossLayer(input_dim) for _ in range(num_layers)])

    def forward(self, x):
        xl = x
        for layer in self.layers:
            xl = layer(x, xl)
        return xl


class DeepNetwork(torch.nn.Module):
    def __init__(self, input_dim, hidden_units):
        super().__init__()
        layers = []
        for units in hidden_units:
            layers.append(nn.Linear(input_dim, units))
            layers.append(nn.ReLU())
            input_dim = units
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


class DCNV2(nn.Module):
    def __init__(self, embedding_size, cross_layers, deep_units, input_size, cardinality=65536):
        super().__init__()
        self.sparse_encode_layer = UnifiedEmbeddings(cardinality, embedding_size)
        self.dense_encode_layer = PiecewiseLinearEncoding()
        self.cross_network = CrossNetwork(input_size, cross_layers)
        self.deep_network = DeepNetwork(input_size, deep_units)
        self.output_layer = nn.Linear(deep_units[-1], 1)

    def forward(self, dense_input, sparse_input):
        sparse_embeddings = self.sparse_encode_layer(sparse_input).view(sparse_input.size(0), -1)
        dense_embeddings = self.dense_encode_layer(dense_input)
        combined_input = torch.cat([dense_embeddings, sparse_embeddings], dim=-1)
        cross_output = self.cross_network(combined_input)
        deep_output = self.deep_network(cross_output)
        return self.output_layer(deep_output).squeeze(dim=-1)

## 5. Training Neural Ranking

In [16]:
class CriteoDataset(Dataset):
    def __init__(
            self,
            df: pl.DataFrame,
            transforms: list[tp.Callable[[tp.Any], tp.Any]] = None,
    ):
        self._labels = df[CriteoDatasetUtils.LABEL_COl].to_torch().to(torch.float32)
        self._dense = df[CriteoDatasetUtils.INT_COLS].to_torch()
        self._sparse = df[CriteoDatasetUtils.CAT_COLS].to_torch()
        self._transforms = (
            transforms if transforms is not None else []
        )

    def __len__(self):
        return self._labels.size(0)

    def __getitem__(self, idx):
        sample = {
            'label': self._labels[idx],
            'dense_features': self._dense[idx],
            'sparse_features': self._sparse[idx]
        }
        for transform in self._transforms:
            sample = transform(sample)
        return sample

In [17]:
batch_size = 4096
cardinality = 8 * 65536
seeds = [[2342 + 13 * i, 7777 + 17 * i, 131 + 833 * i] for i in range(len(CriteoDatasetUtils.CAT_COLS))]
num_hashes = 3
embedding_size = 64
n_bins = 39

In [18]:
dense_transform = PiecewiseLinearEncodingTransform(train_df[CriteoDatasetUtils.INT_COLS], n_bins, name='dense_features')
sparse_transform = MultihashTransform(cardinality, seeds, name='sparse_features')
transforms = [dense_transform, sparse_transform]

train_dataset = CriteoDataset(train_df, transforms)
val_dataset = CriteoDataset(test_df, transforms)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=2)

In [19]:
def train_model(model, train_loader, val_loader, epochs=5, lr=0.001):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)

    criterion = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    for epoch in range(epochs):
        # Training
        model.train()
        train_loss = 0
        for batch in tqdm(train_loader):
            int_features, cat_features, labels = batch['dense_features'].to(device), batch['sparse_features'].to(device), batch['label'].to(device)

            optimizer.zero_grad()
            outputs = model(int_features, cat_features)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()

        # Validation
        model.eval()
        val_loss, correct, total = 0, 0, 0
        all_scores, all_labels = [], []
        with torch.no_grad():
            for batch in tqdm(val_loader):
                int_features, cat_features, labels = batch['dense_features'].to(device), batch['sparse_features'].to(device), batch['label'].to(device)

                outputs = model(int_features, cat_features)
                loss = criterion(outputs, labels)

                val_loss += loss.item()
                predicted = (outputs > 0.5).float()
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                all_scores.append(outputs.clone().cpu())
                all_labels.append(labels.clone().cpu())
            all_scores = torch.cat(all_scores, dim=-1)
            all_labels = torch.cat(all_labels, dim=-1)


        print(f'Epoch {epoch+1}/{epochs}, Train Loss: {train_loss/len(train_loader):.4f}, '
              f'Val Loss: {val_loss/len(val_loader):.4f}, Accuracy: {100*correct/total:.2f}%, '
              f'Val ROC AUC: {roc_auc_score(all_labels, all_scores)}')

In [20]:
input_size = max(dense_transform.n_bins) * len(CriteoDatasetUtils.INT_COLS) + num_hashes * embedding_size * len(CriteoDatasetUtils.CAT_COLS)
model = DCNV2(
    embedding_size=embedding_size,
    cross_layers=3,
    deep_units=[1024, 1024, 1024],
    input_size=input_size,
    cardinality=cardinality,
)
print(input_size)

5486


In [21]:
train_model(model, train_loader, val_loader, epochs=1, lr=1e-4)

100%|██████████| 960/960 [11:31<00:00,  1.39it/s]
100%|██████████| 160/160 [01:18<00:00,  2.03it/s]


Epoch 1/1, Train Loss: 0.4660, Val Loss: 0.4611, Accuracy: 77.69%, Val ROC AUC: 0.790797975158338


## 6. Comparing with CatBoost (same features)

In [1]:
! pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [2]:
from catboost import CatBoostClassifier, Pool

In [None]:
X_train = train_df.drop('label').to_pandas()
y_train = train_df['label'].to_pandas()
X_test = test_df.drop('label').to_pandas()
y_test = test_df['label'].to_pandas()

train_pool = Pool(X_train, y_train, cat_features=CriteoDatasetUtils.CAT_COLS)
test_pool = Pool(X_test, y_test, cat_features=CriteoDatasetUtils.CAT_COLS)

model = CatBoostClassifier(
    iterations=1000,
    loss_function="Logloss",
    eval_metric="AUC",
    early_stopping_rounds=50,
    task_type="GPU"
)
model.fit(train_pool, eval_set=test_pool, use_best_model=True)

Learning rate set to 0.035238


Default metric period is 5 because AUC is/are not implemented for GPU


0:	test: 0.7171938	best: 0.7171938 (0)	total: 2.54s	remaining: 42m 20s
1:	total: 4.94s	remaining: 41m 7s
2:	total: 7.42s	remaining: 41m 7s
3:	total: 9.82s	remaining: 40m 45s
4:	total: 12.3s	remaining: 40m 38s
5:	test: 0.7310488	best: 0.7310488 (5)	total: 14.7s	remaining: 40m 40s
6:	total: 16.3s	remaining: 38m 26s
7:	total: 18.8s	remaining: 38m 53s
8:	total: 21.4s	remaining: 39m 14s
9:	total: 23.6s	remaining: 38m 54s
10:	test: 0.7424861	best: 0.7424861 (10)	total: 26s	remaining: 39m
11:	total: 28.5s	remaining: 39m 6s
12:	total: 31s	remaining: 39m 11s
13:	total: 33.4s	remaining: 39m 14s
14:	total: 35.9s	remaining: 39m 19s
15:	test: 0.7461343	best: 0.7461343 (15)	total: 37.6s	remaining: 38m 30s
16:	total: 39.7s	remaining: 38m 13s
17:	total: 42.1s	remaining: 38m 16s
18:	total: 44.3s	remaining: 38m 6s
19:	total: 46.4s	remaining: 37m 54s
20:	test: 0.7499114	best: 0.7499114 (20)	total: 49.1s	remaining: 38m 9s
21:	total: 51.3s	remaining: 37m 58s
22:	total: 53.4s	remaining: 37m 47s
23:	total: 5

From [DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/abs/2008.13535):
<div style="width:50%; margin: auto;">

![](https://i.ibb.co/HDHJ8Nzq/level-improvement.png)
![](https://i.ibb.co/fYpyrKBs/table.png)

</div>