# Introduction

This Jupyter Notebook demonstrates how to build a ranking model from scratch using the Kernelized Neural Ranking Model (KNRM). We will train the KNRM model on the Quora question pairs dataset, which contains pairs of questions from Quora and labels indicating whether the questions are duplicates of each other.

Ranking models are crucial in search engines, recommender systems, and other applications where items need to be ordered based on their relevance to a particular query. While it is possible to use general-purpose models for ranking, specialized models, such as KNRM, are designed to better capture the complex patterns and relationships in ranking tasks.

In this notebook, we will walk through the process of installing dependencies, loading and preparing data, building the model, and training it. Finally, we will show how to evaluate the model using the normalized discounted cumulative gain (NDCG) metric.

## Why separate models for ranking

It is important to use separate models for ranking because ranking tasks have unique characteristics and requirements that general-purpose models might not adequately address. Ranking models are specifically designed to learn and optimize for the ranking task, allowing them to perform better than general-purpose models in this domain.



# Install dependencies

To run this notebook, you will need to install the following dependencies using pip:

- numpy==1.19.2
- pandas
- torch==1.7.1

In [None]:
! pip install numpy==1.19.2 pandas torch==1.7.1

In [13]:
import pandas as pd
import torch
import numpy as np

# some local modules where the complex code parts are implemented
import aux
import dataset
from knrm import KNRM

# Quora question pairs dataset

The Quora question pairs dataset contains over 400,000 pairs of questions from the Quora platform, with binary labels indicating whether the questions are duplicates of each other. This dataset is an excellent choice for training a search relevance model because it provides a large number of diverse question pairs and relevance labels.

To train our relevance ranking model, we will generate a new dataset that includes relevance levels for each pair. We will categorize the pairs into three groups:

- Positive pairs, representing highly relevant samples;
- Negative pairs, representing samples with low relevance;
- Auto-generated pairs between random questions in the dataset, representing non-relevant samples. 

and mix them together.

In [None]:
! mkdir -p resources && \
  wget https://dl.fbaipublicfiles.com/glue/data/QQP-clean.zip -O - | tar xz -C resources QQP/train.tsv QQP/dev.tsv 

In [9]:
quora_dir = './resources/QQP/'

col_names = ['id', 'id_left', 'id_right', 'text_left', 'text_right', 'label']
train_df = pd.read_csv(f"{quora_dir}/train.tsv", sep='\t', names=col_names, skiprows=1)
print("Train size:", len(train_df))
test_df = pd.read_csv(f"{quora_dir}/dev.tsv", sep='\t', names=col_names, skiprows=1)
print("Test size:", len(test_df))
print('Dataset sample:')
train_df.head(2)

Train size: 363846
Test size: 40430
Dataset sample:


Unnamed: 0,id,id_left,id_right,text_left,text_right,label
0,133273,213221,213222,How is the life of a math student? Could you d...,Which level of prepration is enough for the ex...,0
1,402555,536040,536041,How do I control my horny emotions?,How do you control your horniness?,1


To process the text data in the Quora question pairs dataset, we will first build a vocabulary containing all unique words in the dataset. This vocabulary will be used to convert the text data into numerical representations that can be fed into our model.


In [11]:
vocabulary_list = aux.build_vocabulary(train_df)
print("Vocabulary len:", len(vocabulary_list))
print("10 first tokens:")
vocabulary_list[:10]

Vocabulary len: 82459
10 first tokens:


['PAD', 'OOV', 'the', 'what', 'is', 'a', 'i', 'to', 'in', 'how']

# Embeddings

In this example, we use the GloVe (Global Vectors for Word Representation) pre-trained embeddings. GloVe is an unsupervised learning algorithm that obtains vector representations for words. These embeddings capture semantic and syntactic similarities between words, which can be useful for our ranking model.

We will download the GloVe pre-trained embeddings and create a matrix containing the embeddings for each word in our vocabulary. This matrix will be used as the initial weights for our model's embedding layer.

In [None]:
! mkdir -p resources && \
  wget http://nlp.stanford.edu/data/glove.6B.zip -O - | tar xz -C resources glove.6B.50d.txt 

In [11]:
glove_path = './resources/glove.6B.50d.txt'
embeddings_matrix = aux.create_word_embeddings(glove_path, vocabulary_list)
print(embeddings_matrix.shape)
embeddings_matrix

(82461, 50)


array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.11888963,  0.19735816, -0.19424166, ..., -0.04189473,
        -0.17693753,  0.16185388],
       [-0.05232376,  0.13807736,  0.14311486, ...,  0.01411284,
        -0.18726017,  0.11294341],
       ...,
       [-0.20304   , -0.42267   , -0.89636   , ...,  1.0393    ,
        -0.11743   , -0.60719   ],
       [-0.10024228, -0.09543935, -0.15954408, ...,  0.05721237,
        -0.18037817,  0.03473483],
       [ 0.11142   ,  0.54765   ,  0.45764   , ..., -0.95053   ,
        -0.66657   ,  0.72248   ]])

# KNRM model

The Kernelized Neural Ranking Model (KNRM) is a neural ranking model specifically designed for ranking tasks. It learns to map textual inputs into a continuous relevance space, allowing it to rank items based on their relevance to a given query.

![topology](https://raw.githubusercontent.com/AdeDZY/K-NRM/master/model_simplified-1.png)

The KNRM model consists of an embedding layer, kernelized matching layer, and a fully connected output layer. In this notebook, we will create a KNRM model using the pre-trained GloVe embeddings and train it on the Quora question pairs dataset.

Here is the paper where you can read some details about it: [End-to-End Neural Ad-hoc Ranking with Kernel Pooling](https://arxiv.org/pdf/1706.06613.pdf)

In [12]:
model = KNRM(embeddings_matrix,
             freeze_embeddings=True,
             out_layers=[10,5],
             kernel_num=21)

KNRM embeddings is created
KNRM kernels is created
KNRM mlp is created


# Datasets preparation

To efficiently train and evaluate our model, we will create PyTorch DataLoaders for the training and test datasets. DataLoaders are useful because they handle batching, shuffling, and loading of data in parallel, making it easier to work with large datasets.



In [15]:
train_dataloader = dataset.make_train_dataloader(train_df, vocabulary_list)
test_dataloader = dataset.make_test_dataloader(test_df, vocabulary_list)

<torch.utils.data.dataloader.DataLoader at 0x7f8aa982be10>

# Model training


In this section, we will train our KNRM model using a training loop that iterates over the training dataset for multiple epochs. During each epoch, 
the model will be updated using the stochastic gradient descent (SGD) optimizer and the binary cross-entropy loss function. After each epoch, we will evaluate the model's performance on the test dataset using the NDCG metric.







## NDCG metric
Normalized Discounted Cumulative Gain (NDCG) is a widely used metric in ranking tasks. It evaluates the quality of a ranking by considering the relevance of each item in the ranked list and its position. NDCG is particularly suitable for ranking tasks because it takes into account both the order and relevance of items in the ranked list, making it more informative than other metrics like precision or recall.



In [18]:
from metrics import ndcg_k

def evaluate(model: torch.nn.Module = None, data: torch.utils.data.DataLoader = None) -> float:
    labels_and_groups = data.dataset.samples_list
    labels_and_groups = pd.DataFrame(labels_and_groups, columns=['left_id', 'right_id', 'rel'])

    pred = [model.predict(batch).detach().numpy() for batch, _ in data]
    pred = np.concatenate(pred, axis=0)
    labels_and_groups['pred'] = pred

    ndcg_list = [ndcg_k(df.rel.values, df.pred.values) for _, df in labels_and_groups.groupby('left_id')]
    mean_ndcg = float(np.mean(ndcg_list))
    return mean_ndcg


## Training loop

The training loop iterates over the training dataset for a specified number of epochs. During each epoch, the model's parameters are updated using the optimizer and the loss function. After each epoch, we evaluate the model's performance on the test dataset using the NDCG metric to track its progress.



In [20]:
n_epochs = 20
learning_rate = 0.01
from tqdm.notebook import tqdm

In [None]:
opt = torch.optim.SGD(model.parameters(), lr=learning_rate)
loss = torch.nn.BCELoss()

for epoch in tqdm(range(n_epochs)):
    model.train()
    for batch_idx, (left_batch, right_batch, y_true) in enumerate(train_dataloader):
#         left_batch = {k: v for k, v in left_batch.items()}
#         right_batch = {k: v for k, v in right_batch.items()}
        opt.zero_grad()
        y_pred = model.forward(left_batch, right_batch)
        query_loss = loss(y_pred, y_true)
        query_loss.backward()
        opt.step()
    ndcg_score = evaluate(model, test_dataloader)
    print(f"Epoch {epoch}. Test ndcg: {ndcg_score:.3f}")

# Final words

Once the model is trained, we can use it for various applications, such as predicting the relevance of items in a search engine or a recommender system. To use the trained model, we need to save its parameters and load them back when needed.



In [None]:
with open('resources/mlp_weights01.pkl', 'wb') as f:
    torch.save(model.mlp.state_dict(), f)

with open('resources/embeddings.pkl', 'wb') as f:
    torch.save(model.embeddings.state_dict(), f)

To load the saved model parameters, we can use PyTorch's torch.load() function and the load_state_dict() method:

In [None]:
with open(embeddings_knrm_path, 'rb') as f:
    emb_dict = torch.load(f,  map_location=torch.device('cpu'))

with open(mlp_path, 'rb') as f:
    mlp_dict = torch.load(f,  map_location=torch.device('cpu'))

knrm_model = KNRM(emb_dict, mlp_dict)

With the trained model, you can now use it for various ranking tasks by providing input data in the same format as the training data. The model will output relevance scores, which can be used to rank items based on their relevance to a given query.



