<a href="https://colab.research.google.com/github/polinak1r/Dense-Retrieval-with-BERT/blob/main/Dense_Retrieval_with_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dense Retrieval with BERT
This notebook implements a dense retrieval pipeline for semantic search using a pretrained RoBERTa model.  

1. **Data Preparation**  
   Load data. For quick experiments apply sampling.

2. **Model Initialization and Embedding Computation**  
   Load and explore RoBERTa from HuggingFace Transformers. Compute contextual embeddings for queries and documents via masked mean pooling over token-level outputs, followed by batch standardization and L2 normalization.

3. **Similarity Scoring and Retrieval Evaluation**  
   Compute pairwise similarities between query and document embeddings using dot product. Rank documents per query and evaluate retrieval performance using the PFound metric, which reflects ranked relevance under user browsing behavior.

4. **Inference and Submission**  
   Run the trained pipeline on test queries, generate predicted rankings, and export results in kaggle submission format.


### Data downloading

In [None]:
import json
import torch
from pathlib import Path
import random

import numpy as np
import pandas as pd
from tqdm import tqdm

data_dir = Path('/kaggle/input/nlp-nup-2025-hw2/')

In [None]:
docs = []
with open(data_dir / 'documents.jsonl') as fp:
    for line in tqdm(fp, total=367840):
        docs.append(json.loads(line))

with open(data_dir / 'queries_train.json') as fp:
    queries = json.load(fp)

with open(data_dir / 'qrels_train.json') as fp:
    qrels = json.load(fp)

100%|██████████| 367840/367840 [00:05<00:00, 63692.30it/s]


### Reduced sample (1:10 positive and negative) for quick experiments

In [None]:
import random

count = 10  # number of negative examples per every positive example (1:10 now)
seed = 42

pos_doc_ids = {
    rec['doc_id']
    for rec in qrels
    if rec.get('relevance', 0) > 0
}
all_doc_ids = {doc['id'] for doc in docs}

random.seed(seed)
neg_sample = set(random.sample(list(all_doc_ids - pos_doc_ids), count * len(pos_doc_ids)))

docs = [doc for doc in docs if doc['id'] in pos_doc_ids or doc['id'] in neg_sample]
print(f'Filtered docs length {len(docs)}')

qrels = [rec for rec in qrels if rec['doc_id'] in pos_doc_ids or rec['doc_id'] in neg_sample]
print(f'Filtered qrels length {len(qrels)}')

valid_query_ids = {rec['query_id'] for rec in qrels}
queries = [q for q in queries if q['query_id'] in valid_query_ids]
print(f'Filtered queries length {len(queries)}')

Filtered docs length 26664
Filtered qrels length 2711
Filtered queries length 28


### BERT model

In [None]:
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('roberta-large')
tokenizer = AutoTokenizer.from_pretrained('roberta-large')

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
tokenizer

RobertaTokenizerFast(name_or_path='roberta-large', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}
)

In [None]:
model

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 1024, padding_idx=1)
    (position_embeddings): Embedding(514, 1024, padding_idx=1)
    (token_type_embeddings): Embedding(1, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0-23): 24 x RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSdpaSelfAttention(
            (query): Linear(in_features=1024, out_features=1024, bias=True)
            (key): Linear(in_features=1024, out_features=1024, bias=True)
            (value): Linear(in_features=1024, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  

In [None]:
example = queries[0]['query']
print(example)

Would the United Kingdom have been ready for WWII without the time gained through Appeasement?


In [None]:
input_ids = tokenizer.encode(example)
print(input_ids)
print(len(input_ids))
print()
print(tokenizer.added_tokens_decoder)

[0, 29042, 5, 315, 5752, 33, 57, 1227, 13, 29001, 396, 5, 86, 3491, 149, 3166, 29358, 6285, 116, 2]
20

{0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True)}


In [None]:
decoded_example = tokenizer.decode(input_ids)
print(decoded_example)

<s>Would the United Kingdom have been ready for WWII without the time gained through Appeasement?</s>


In [None]:
inputs = tokenizer(example, return_tensors='pt')
input_ids = inputs.input_ids
print(inputs)
print()
print()
print(input_ids)
print(input_ids.shape)

{'input_ids': tensor([[    0, 29042,     5,   315,  5752,    33,    57,  1227,    13, 29001,
           396,     5,    86,  3491,   149,  3166, 29358,  6285,   116,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


tensor([[    0, 29042,     5,   315,  5752,    33,    57,  1227,    13, 29001,
           396,     5,    86,  3491,   149,  3166, 29358,  6285,   116,     2]])
torch.Size([1, 20])


Note the padding tokens in the second sample

In [None]:
inputs = tokenizer([example, example[:10]], return_tensors='pt', truncation=True, padding=True)
input_ids = inputs.input_ids
print(inputs)

{'input_ids': tensor([[    0, 29042,     5,   315,  5752,    33,    57,  1227,    13, 29001,
           396,     5,    86,  3491,   149,  3166, 29358,  6285,   116,     2],
        [    0, 29042,     5,  1437,     2,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [None]:
model_result = model(input_ids)
print(model_result.keys())

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


odict_keys(['last_hidden_state', 'pooler_output'])


In [None]:
print(model_result.last_hidden_state)
print()
print(model_result.last_hidden_state.shape)
print(model_result.last_hidden_state[0, :, :].shape)
print(model_result.last_hidden_state[:, 0, :].shape)

tensor([[[-0.1099,  0.0019, -0.0124,  ..., -0.0764,  0.0463,  0.1033],
         [ 0.0377,  0.2558, -0.6451,  ..., -0.0571, -0.1688, -0.3992],
         [-0.0200,  0.0879, -0.0725,  ...,  0.0603, -0.0499,  0.1073],
         ...,
         [-0.0410,  0.0792, -0.2168,  ..., -0.2132, -0.0974, -0.1989],
         [-0.3018, -0.4035, -0.3044,  ...,  0.1915, -0.1838,  0.3145],
         [-0.0610, -0.0411, -0.0086,  ..., -0.1118, -0.0074,  0.0444]],

        [[-0.0327, -0.0315, -0.0383,  ..., -0.0288,  0.1306,  0.0035],
         [-0.0035,  0.0217, -0.1948,  ...,  0.2103, -0.0253,  0.0205],
         [-0.0670,  0.1362, -0.1744,  ...,  0.1401,  0.1191,  0.0968],
         ...,
         [-0.0249,  0.1890, -0.1642,  ..., -0.1527,  0.0173,  0.1446],
         [-0.0249,  0.1890, -0.1642,  ..., -0.1527,  0.0173,  0.1446],
         [-0.0249,  0.1890, -0.1642,  ..., -0.1527,  0.0173,  0.1446]]],
       grad_fn=<NativeLayerNormBackward0>)

torch.Size([2, 20, 1024])
torch.Size([20, 1024])
torch.Size([2, 1024])


In [None]:
print(model_result.pooler_output)
print(model_result.pooler_output.shape)

tensor([[ 0.3217, -0.3042, -0.5720,  ..., -0.3693,  0.3369, -0.5884],
        [ 0.2873, -0.3008, -0.5220,  ..., -0.3009,  0.4818, -0.6207]],
       grad_fn=<TanhBackward0>)
torch.Size([2, 1024])


In [None]:
model.pooler

RobertaPooler(
  (dense): Linear(in_features=1024, out_features=1024, bias=True)
  (activation): Tanh()
)

In [None]:
pooled_output = model.pooler.dense(model_result.last_hidden_state[:, 0])
pooled_output = model.pooler.activation(pooled_output)
torch.equal(pooled_output, model_result.pooler_output)

True

## Preparing Dataset and DataLoader
The pipeline of text preprocessing for neural networks is the following:
- tokenization, note that special tokens are often automatically added at this step, in our case, `[CLS]` at the beginning and `[SEP]` at the end. This is `Dataset` part in our implementation.
- batching, at this step, we concatenate multiple token indices sequences into a matrix of the shape `batch_size x longest_seq_len`. Important, we pad sequences in the batch with `seq_len < longest_seq_len` with special token `[PAD]` up to the `longest_seq_len` (in our implementation, this is done under-the-hood of tokenizer). The model should not attend to these tokens, and for this purpose `attention_mask` of the shape `batch_size x longest_seq_len` is generated by tokenizer and passed to the model. This is `DataLoader` part in our implementation.

In the current implementation, tokenization is performed on the fly while we iterate over DataLoader. An alternative approach is to first tokenize the data in the `Dataset` class. Tokenization in `Dataset` has a potential advantage: it is performed once and does not take extra time in the case when we want to perform inference multiple times on the same data, e.g. during training multiple epochs. However, the advantage can be minimal if we have enough CPU cores, since DataLoader can be easily parallelized with `num_workers > 1` input parameter.

In [None]:
from torch.utils.data import Dataset

class DocsDataset(Dataset):
    def __init__(self, docs, char_max_length=8192):
        '''
        char_max_length: int
            Maximum number of characters to keep from each document;
            Used to control the speed of tokenization
            (tokenization of the full document might be too time consuming)
        '''
        # self.docs_full = [doc['title'][:char_max_length] for doc in docs]
        self.docs_full = [ (doc['title'] + ' ' + doc['contents'])[:char_max_length] for doc in docs]


    def __len__(self):
        return len(self.docs_full)

    def __getitem__(self, idx):
        return self.docs_full[idx]


class QueriesDataset(Dataset):
    def __init__(self, queries, char_max_length=8192):
        '''
        char_max_length: int
            Maximum number of characters to keep from each query;
            Used to control the speed of tokenization
            (tokenization of the full document might be too time consuming)
        '''
        self.queries_full = [query['query'][:char_max_length] for query in queries]
        # self.queries_full = [(query['query']  + ' ' + query['guidelines'])[:char_max_length] for query in queries]

    def __len__(self):
        return len(self.queries_full)

    def __getitem__(self, idx):
        return self.queries_full[idx]

In [None]:
docs_dataset = DocsDataset(docs, char_max_length=256)
queries_dataset = QueriesDataset(queries, char_max_length=128)

In [None]:
len(docs_dataset), len(queries_dataset)

(26664, 28)

In [None]:
def collate_fn(batch: list[str], token_max_length: int):
    tokenized_batch = tokenizer(
        batch,
        return_tensors="pt",
        max_length=token_max_length,
        truncation=True,
        padding=True
    )
    return tokenized_batch

Check the number of processors (cores) with terminal command to make batching mutltiprocess

In [None]:
!nproc

4


In [None]:
from torch.utils.data import DataLoader
import multiprocessing
from functools import partial

batch_size = 512
num_workers = multiprocessing.cpu_count() # 4
token_max_length = 128

final_collate_fn = partial(collate_fn,
                           token_max_length=token_max_length)

docs_dataloader = DataLoader(
    docs_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    shuffle=False,
    collate_fn=final_collate_fn,
)
queries_dataloader = DataLoader(
    queries_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    shuffle=False,
    collate_fn=final_collate_fn,
)

In [None]:
next(iter(docs_dataloader))['input_ids'].shape

torch.Size([512, 98])

In [None]:
next(iter(docs_dataloader))

{'input_ids': tensor([[    0, 13365,   208,  ...,     1,     1,     1],
        [    0,   597,  2685,  ...,     1,     1,     1],
        [    0,  5320,  6368,  ...,     1,     1,     1],
        ...,
        [    0,  4741, 22471,  ...,     1,     1,     1],
        [    0,  8773,  5075,  ...,     1,     1,     1],
        [    0, 11773,    18,  ...,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

### Calculating embeddings

In [None]:
import torch
import torch.nn.functional as F
from tqdm import tqdm
from torch.utils.data import DataLoader

def get_embeddings(model, dataloader: DataLoader, device='cuda') -> torch.Tensor:
    all_embeddings = []

    model.eval()
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Generating embeddings"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )

            # masked mean pooling
            last_hidden = outputs.last_hidden_state
            mask = attention_mask.unsqueeze(-1)
            masked_hidden = last_hidden * mask
            sum_hidden = masked_hidden.sum(dim=1)
            lengths = mask.sum(dim=1)
            embeddings = sum_hidden / (lengths + 1e-6)

            # Batch standardization
            mean = embeddings.mean(dim=0, keepdim=True)
            std = embeddings.std(dim=0, keepdim=True)
            embeddings = embeddings / (std + 1e-6)

            # L2 norm
            embeddings = F.normalize(embeddings, p=2, dim=1)

            all_embeddings.append(embeddings.cpu())

    return torch.cat(all_embeddings, dim=0)


In [None]:
device = "cuda"
model = model.to(device)

In [None]:
print("Generating document embeddings...")
doc_embeddings = get_embeddings(model, docs_dataloader)

Generating document embeddings...


Generating embeddings: 100%|██████████| 53/53 [09:29<00:00, 10.75s/it]


In [None]:
print("Generating query embeddings...")
query_embeddings = get_embeddings(model, queries_dataloader)

Generating query embeddings...


Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  1.84it/s]


In [None]:
doc_embeddings.shape, query_embeddings.shape

(torch.Size([26664, 1024]), torch.Size([28, 1024]))

### Computing similarity

In [None]:
def similarity_euclidean(doc_embedding, query_embedding) -> float:
    return float(-torch.norm(doc_embedding - query_embedding).item()) # euclidean similarity

In [None]:
def similarity_L1(doc_embedding, query_embedding) -> float:
    return float(-torch.sum(torch.abs(doc_embedding - query_embedding)).item()) # L1 similarity

In [None]:
def similarity(doc_embedding, query_embedding) -> float:
    return float(doc_embedding @ query_embedding.T) # dot product similarity

In [None]:
import torch
import torch.nn.functional as F
# this is a common way to import `functional` module
# however, one of the creators of the HW2 thinks that it
# is way more beautiful to write
# import torch.nn.functional as tofu

preds = []

for i, q in enumerate(query_embeddings):
    for j, d in enumerate(tqdm(doc_embeddings)):
        pred_sim = similarity(d, q)
        preds.append({
            'doc_id': docs[j]['id'],
            'query_id': queries[i]['query_id'],
            'score': pred_sim
        })

100%|██████████| 26664/26664 [00:00<00:00, 88270.33it/s]
100%|██████████| 26664/26664 [00:00<00:00, 109128.46it/s]
100%|██████████| 26664/26664 [00:00<00:00, 105462.95it/s]
100%|██████████| 26664/26664 [00:00<00:00, 107035.62it/s]
100%|██████████| 26664/26664 [00:00<00:00, 108751.32it/s]
100%|██████████| 26664/26664 [00:00<00:00, 41197.32it/s]
100%|██████████| 26664/26664 [00:00<00:00, 102308.80it/s]
100%|██████████| 26664/26664 [00:00<00:00, 110094.35it/s]
100%|██████████| 26664/26664 [00:00<00:00, 109209.77it/s]
100%|██████████| 26664/26664 [00:00<00:00, 109579.69it/s]
100%|██████████| 26664/26664 [00:00<00:00, 109903.18it/s]
100%|██████████| 26664/26664 [00:00<00:00, 111260.81it/s]
100%|██████████| 26664/26664 [00:00<00:00, 106340.30it/s]
100%|██████████| 26664/26664 [00:00<00:00, 45667.97it/s]
100%|██████████| 26664/26664 [00:00<00:00, 105198.38it/s]
100%|██████████| 26664/26664 [00:00<00:00, 101400.30it/s]
100%|██████████| 26664/26664 [00:00<00:00, 108212.45it/s]
100%|██████████| 

In [None]:
def pfound_score(y_true: 'npt.NDArray[np.int_]', y_score: 'npt.NDArray[np.float_]', pbreak: float = .15) -> float:
    assert y_true.shape == y_score.shape

    indices = np.argsort(y_score)[::-1]

    y_max = max(y_true)

    pfound, plook = 0., 1.

    for rank, i in enumerate(indices):
        r = (2. ** y_true[i] - 1.) / (2. ** y_max)

        pfound += r * plook * pbreak ** rank

        plook *= 1. - r

    return pfound


def pfound(qrels_list: list[dict[str: str | int]],
           y_pred: list[dict[str: str | float]],
           pbreak: float = 0.15
          ) -> float:
    assert 0 < pbreak < 1
    zero_score_qrel = {'score': 0.0, 'relevance': 0.0}

    queries = set(qrel['query_id'] for qrel in qrels_list)
    p_found_list = []
    for cur_query in queries:
        cur_y_pred_dicts = [doc_ranked for doc_ranked in y_pred
                            if doc_ranked['query_id'] == cur_query]
        y = {qrel['doc_id']: qrel for qrel in qrels_list if qrel['query_id'] == cur_query}
        cur_y_pred = np.empty(len(cur_y_pred_dicts))
        cur_y_true = np.empty(len(cur_y_pred_dicts))
        for n, y_pred_dict in enumerate(cur_y_pred_dicts):
            cur_y_pred[n] = y_pred_dict['score']
            cur_y_true[n] = y.get(y_pred_dict['doc_id'], zero_score_qrel)['relevance']

        cur_pfound = pfound_score(np.array(cur_y_true), np.array(cur_y_pred))
        p_found_list.append(cur_pfound)
    return float(np.mean(p_found_list))

In [None]:
pfound(qrels, preds)

0.3733633557789204

In [None]:
scores = [p['score'] for p in preds]
print(min(scores), max(scores), sum(scores)/len(scores))

0.7843459248542786 0.9780784845352173 0.9310470349293283


## Submission
We use the same database of documents, but will load test set queries.

In [None]:
with open(data_dir / 'queries_test.json') as fp:
    qs_test = json.load(fp)

print(f'Number of queries: {len(qs_test)}')

Number of queries: 14


In [None]:
qs_test_dataset = QueriesDataset(qs_test)

In [None]:
batch_size = 256
num_workers = multiprocessing.cpu_count() # 4
token_max_length = 32

final_collate_fn = partial(collate_fn,
                           token_max_length=token_max_length)

qs_test_dataloader = DataLoader(
    qs_test_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    shuffle=False,
    collate_fn=final_collate_fn,
)

In [None]:
print("Generating query embeddings...")
qs_test_embeddings = get_embeddings(model, qs_test_dataloader)

Generating query embeddings...


Generating embeddings: 100%|██████████| 1/1 [00:00<00:00,  1.85it/s]


In [None]:
import torch
import torch.nn.functional as F

submission_items = []

for i, q in enumerate(qs_test_embeddings):
    for j, d in enumerate(tqdm(doc_embeddings)):
        q_id = qs_test[i]['query_id']
        doc_id = docs[j]['id']
        pred_sim = similarity(d, q)
        submission_items.append({
            'id': f'{q_id}_{doc_id}',
            'doc_id': docs[j]['id'],
            'query_id': qs_test[i]['query_id'],
            'score': pred_sim
        })

100%|██████████| 26664/26664 [00:00<00:00, 75569.47it/s]
100%|██████████| 26664/26664 [00:00<00:00, 93744.43it/s] 
100%|██████████| 26664/26664 [00:00<00:00, 90255.77it/s]
100%|██████████| 26664/26664 [00:00<00:00, 98708.76it/s] 
100%|██████████| 26664/26664 [00:00<00:00, 94989.89it/s]
100%|██████████| 26664/26664 [00:00<00:00, 89279.12it/s]
100%|██████████| 26664/26664 [00:00<00:00, 38564.97it/s]
100%|██████████| 26664/26664 [00:00<00:00, 95705.89it/s]
100%|██████████| 26664/26664 [00:00<00:00, 102068.18it/s]
100%|██████████| 26664/26664 [00:00<00:00, 98527.61it/s]
100%|██████████| 26664/26664 [00:00<00:00, 98903.68it/s] 
100%|██████████| 26664/26664 [00:00<00:00, 85663.41it/s]
100%|██████████| 26664/26664 [00:00<00:00, 96748.84it/s]
100%|██████████| 26664/26664 [00:00<00:00, 99496.03it/s] 


In [None]:
df = pd.DataFrame(submission_items)
df.set_index('id', inplace=True)
df.to_csv('submission.csv')