## Part 2 - Neural Re-Ranking **40 points**

Implement 2 neural architectures based on the kernel-pooling paradigm to perform re-ranking in ``src/re_ranking.py`` (KNRM, TK)

- Implement: the 2 (KNRM, TK) model classes **20 points**
   - Show that you understood what happens by adding comments to difficult parts of the model (what tensor dimensions represent, what gets summed up, etc..)
- Implement: training process & result evaluation **10 points**
    - Including early stopping based on the validation set
	   - Use the **msmarco_tuples.validation.tsv** input to feed the neural models and **msmarco_qrels.txt** qrels to evaluate the output
- Evaluate: Compute a test set evaluation at the end  **10 points**
	- MS-MARCO sparse labels
	  - Use the **msmarco_tuples.test.tsv** input to feed the neural models and **msmarco_qrels.txt** qrels to evaluate the output
	- FiRA-2022 fine-grained labels on out-of-domain data
	  - Use your created created labels from part 1
	     - Use the **fira-2022.tuples.tsv** input to feed the neural models and your qrels from part 1 to evaluate the output
	  - Compare these results with our baseline label creation method
	     - Use the **fira-2022.tuples.tsv** input to feed the neural models and **fira-2022.baseline-qrels.tsv** qrels to evaluate the output
	  - Explore & describe the differences in metrics between the baseline and your label creation 

## Provided data:
* AllenNLP vocabulary (collection specific, in two sizes: use the _10 = min of 10 occurrences in the collection if you have memory problems with the _5)
* train triples
* evaluation tuples (validation & test) with 2.000 queries each and the top 40 BM25 results per query, relevance judgments (qrels, one file covering both validation & test)

In [11]:
#Put in your basepath like for example "/home/studio-lab-user/src/data_part2"
base_path = "../"

## Imports

In [12]:
from typing import Dict, Iterator, List

import torch
import torch.nn as nn
from torch.autograd import Variable

from allennlp.modules.text_field_embedders import TextFieldEmbedder

import sys
sys.path.append('..')

from src.data_loading import IrTripleDatasetReader
import numpy as np
import os

from allennlp.data.vocabulary import Vocabulary

from allennlp.modules.token_embedders import Embedding
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder

from src.data_loading import *
from allennlp.data.dataloader import PyTorchDataLoader
import pandas as pd
from src.model_knrm import *
from src.BatchWordEmbedder import *

In [13]:
config = {
    "vocab_directory": os.path.join(base_path, "data/Part-2/allen_vocab_lower_10"),
    "pre_trained_embedding": os.path.join(base_path, "data/Part-2/glove.42B.300d.txt"),
    "model": "knrm",
    "train_data": os.path.join(base_path, "data/Part-2/triples.train.tsv"),
    "validation_data": os.path.join(base_path, "data/Part-2/msmarco_tuples.validation.tsv"),
    "test_data": os.path.join(base_path, "data/Part-2/tuples.test.tsv"),
    "qrels": os.path.join(base_path, "data/Part-2/msmarco_qrels.txt"),
}

## Loading the data

In [14]:
vocab = Vocabulary.from_files(config["vocab_directory"])
tokens_embedder = Embedding(vocab=vocab,
                           pretrained_file= config["pre_trained_embedding"],
                           embedding_dim=300,
                           trainable=True,
                           padding_index=0,
                           )
word_embedder = BasicTextFieldEmbedder({"tokens": tokens_embedder})

_triple_reader = IrTripleDatasetReader(lazy=True, max_doc_length=180, max_query_length=30)
_triple_reader = _triple_reader.read(config["train_data"])
_triple_reader.index_with(vocab)
loader = PyTorchDataLoader(_triple_reader, batch_size=256)

0it [00:00, ?it/s]

In [16]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device = 'cpu'
batch_embedder = BatchWordEmbedder(word_embedder, device)
model_knrm = KNRM(n_kernels=11).to(device)
optimizer = torch.optim.Adam(model_knrm.parameters(), lr=0.5 * 1e-3)
loss_criterion = torch.nn.MarginRankingLoss(margin=1, reduction='elementwise_mean').to(device)
print(device)

cpu


## Start training

In [11]:
model_knrm_trained = knrm_training_loop(model_knrm, loader, optimizer, loss_criterion, batch_embedder, device, epochs=1)

reading instances: 0it [00:00, ?it/s]



0 Batch loss: 0.9861485958099365
50 Batch loss: 0.9711251854896545
100 Batch loss: 0.9416072964668274
150 Batch loss: 0.9400886297225952
200 Batch loss: 0.9375748634338379
250 Batch loss: 0.9009732007980347
300 Batch loss: 0.9000574350357056
350 Batch loss: 0.8765318393707275
400 Batch loss: 0.8214479684829712
450 Batch loss: 0.8288471102714539
500 Batch loss: 0.8129605650901794
550 Batch loss: 0.7374101877212524
600 Batch loss: 0.7331011295318604
650 Batch loss: 0.7029364109039307
700 Batch loss: 0.627629280090332
750 Batch loss: 0.7026369571685791
800 Batch loss: 0.624007523059845
850 Batch loss: 0.6411666870117188
900 Batch loss: 0.507813036441803
950 Batch loss: 0.6552830934524536
1000 Batch loss: 0.579552173614502
1050 Batch loss: 0.5931665897369385
1100 Batch loss: 0.5118998289108276
1150 Batch loss: 0.5565185546875
1200 Batch loss: 0.5134824514389038
1250 Batch loss: 0.5800173282623291
1300 Batch loss: 0.5316937565803528
1350 Batch loss: 0.542830228805542
1400 Batch loss: 0.4688

**Load the model**

In [6]:
#Input path of the saved model
model_path = os.path.join(base_path, "models/model_knrm_trained.pth")

# Define the configuration for TK model initialization
knrm_config = {
    "n_kernels": 11,
}

# Initialize the TK model
model_knrm = KNRM(**knrm_config).to(device)

# Load the model state_dict
model_knrm.load_state_dict(torch.load(model_path, map_location=device))

<All keys matched successfully>

**MS Marco Sparse**

In [7]:
# read txt file
df_qrels = pd.DataFrame()
list_qrels = []
with open(config['qrels']) as f:
    lines = f.readlines()
    for line in lines:
        query_id, _, doc_id, _ = line.split()
        list_qrels.append([query_id, doc_id])
df_qrels = pd.DataFrame(list_qrels, columns=['query_id', 'doc_id'])
df_qrels['query_id'] = df_qrels['query_id'].astype(int)
df_qrels['doc_id'] = df_qrels['doc_id'].astype(int)

In [8]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device = 'cpu'
batch_embedder_eval = BatchWordEmbedder(word_embedder, device, train=False)

In [9]:
_triple_reader_eval = IrLabeledTupleDatasetReader(lazy=True, max_doc_length=128, max_query_length=128)
_triple_reader_eval = _triple_reader_eval.read(config["validation_data"])
_triple_reader_eval.index_with(vocab)
loader_test = PyTorchDataLoader(_triple_reader_eval, batch_size=64)
# batch = next(iter(loader_test))
# batch_embedder_eval = BatchWordEmbedder(word_embedder, device, train=False)

In [10]:
def evaluate_model(model, df_qrels, loader, batch_embedder, device):
    model.eval()
    query_ids = []
    doc_ids = []
    preds = []
    with torch.no_grad():
        for idx, batch in enumerate(loader):
            query_emb, doc_pos_emb, _, query_pad_mask, document_pad_mask_pos, _ = batch_embedder(batch)
            pred = model(query_emb, doc_pos_emb, query_pad_mask, document_pad_mask_pos)
            query_ids.extend(batch['query_id'])  # Directly extend the list
            doc_ids.extend(batch['doc_id'])      # Directly extend the list
            preds.extend(pred.cpu().numpy().flatten())
            # if idx * 32 >= 10000:
            #     break
    df_eval = pd.DataFrame({
        'query_id': query_ids,
        'doc_id': doc_ids,
        'score': preds
    })
    
    # Assigning rank based on scores within each query_id group
    df_eval['rank'] = df_eval.groupby('query_id')['score'].rank(ascending=False, method='first').astype(int)
    
    # Sorting by query_id and rank
    df_eval = df_eval.sort_values(by=['query_id', 'rank'])
    
    return df_eval

In [11]:
df_eval = evaluate_model(model_knrm, df_qrels, loader_test, batch_embedder_eval, device)

reading instances: 0it [00:00, ?it/s]

In [12]:
# Define the specific paths relative to the base path
path_result = os.path.join(base_path, "data/results_part2/knrm_msmarco_ranking_final.tsv")
path_baseline = config['qrels']


# Write the DataFrame to a TSV file
df_eval.to_csv(path_result, sep='\t', header=False, index=False)

In [13]:
from src.core_metrics import calculate_metrics_plain,load_ranking,load_qrels

calculate_metrics_plain(load_ranking(path_result),load_qrels(path_baseline))

{'MRR@10': 0.1817690476190476,
 'Recall@10': 0.369375,
 'QueriesWithNoRelevant@10': 1245,
 'QueriesWithRelevant@10': 755,
 'AverageRankGoldLabel@10': 3.7072847682119203,
 'MedianRankGoldLabel@10': 3.0,
 'MRR@20': 0.18971659236875957,
 'Recall@20': 0.47979166666666667,
 'QueriesWithNoRelevant@20': 1019,
 'QueriesWithRelevant@20': 981,
 'AverageRankGoldLabel@20': 6.243628950050969,
 'MedianRankGoldLabel@20': 4.0,
 'MRR@1000': 0.19311101499504907,
 'Recall@1000': 0.574625,
 'QueriesWithNoRelevant@1000': 830,
 'QueriesWithRelevant@1000': 1170,
 'AverageRankGoldLabel@1000': 9.876068376068377,
 'MedianRankGoldLabel@1000': 6.0,
 'nDCG@3': 0.16749706819748705,
 'nDCG@5': 0.1919873251456052,
 'nDCG@10': 0.22463290914396003,
 'nDCG@20': 0.2531676692056566,
 'nDCG@1000': 0.272969811529321,
 'QueriesRanked': 2000,
 'MAP@1000': 0.18992419478055275}

**fira-2022.tuples.tsv input to feed the neural models and fira-2022.baseline-qrels.tsv for evaluation**

In [14]:

config = {
    "vocab_directory": os.path.join(base_path, "data/Part-2/allen_vocab_lower_10"),
    "pre_trained_embedding": os.path.join(base_path, "data/Part-2/glove.42B.300d.txt"),
    "model": "knrm",
    "train_data": os.path.join(base_path, "data/Part-2/triples.train.tsv"),
    "validation_data": os.path.join(base_path, "data/Part-2/fira-22.tuples_mod.tsv"),
    "test_data": os.path.join(base_path, "data/Part-2/tuples.test.tsv"),
    "qrels": os.path.join(base_path, "data/Part-2/fira-22.baseline-qrels.tsv"),
}

In [15]:

# Specify the path to your file

# Initialize an empty list to store [query_id, doc_id] pairs
list_qrels = []

# Read the file line by line and process each line
with open(config['qrels']) as f:
    for line in f:
        parts = line.strip().split()  # Split line by tab separator
        if len(parts) >= 4:  # Check if we have at least 4 parts
            query_id = parts[0]  # First column
            doc_id = parts[2]    # Third column
            list_qrels.append([query_id, doc_id])  # Append [query_id, doc_id] to list_qrels
        else:
            print(f"Warning: Skipping line with unexpected format: {line}")
            print(f"The were so many part: {len(parts)}")

# Create a DataFrame from list_qrels with columns 'query_id' and 'doc_id'
df_qrels = pd.DataFrame(list_qrels, columns=['query_id', 'doc_id'])

In [16]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device = 'cpu'
batch_embedder_eval = BatchWordEmbedder(word_embedder, device, train=False)

In [17]:
_tuple_reader_eval = IrLabeledTupleDatasetReader(lazy=True, max_doc_length=512, max_query_length=512)
#Load modified tuples
_tuple_reader_eval = _tuple_reader_eval.read(config['validation_data'])
_tuple_reader_eval.index_with(vocab)
loader_test = PyTorchDataLoader(_tuple_reader_eval, batch_size=64)

In [18]:
def evaluate_model(model, df_qrels, loader, batch_embedder, device):
    model.eval()
    query_ids = []
    doc_ids = []
    preds = []
    with torch.no_grad():
        for idx, batch in enumerate(loader):
            try:
                query_emb, doc_pos_emb, _, query_pad_mask, document_pad_mask_pos, _ = batch_embedder(batch)

                pred = model(query_emb, doc_pos_emb, query_pad_mask, document_pad_mask_pos)
                query_ids.extend(batch['query_id'])  # Directly extend the list
                doc_ids.extend(batch['doc_id'])      # Directly extend the list
                preds.extend(pred.cpu().numpy().flatten())
            except Exception as e:
                print(f"Error processing batch {idx}: {e}")
                print("Skipping batch with query_id and doc_id:")
                print("query_id:", batch['query_id'])
                print("doc_id:", batch['doc_id'])
                continue
            # if idx * 32 >= 10000:
            #     break
    print("Finished!")
    df_eval = pd.DataFrame({
        'query_id': query_ids,
        'doc_id': doc_ids,
        'score': preds
    })
    
    # Assigning rank based on scores within each query_id group
    df_eval['rank'] = df_eval.groupby('query_id')['score'].rank(ascending=False, method='first').astype(int)
    
    # Sorting by query_id and rank
    df_eval = df_eval.sort_values(by=['query_id', 'rank'])
    
    return df_eval

In [19]:
df_eval = evaluate_model(model_knrm, df_qrels, loader_test, batch_embedder_eval, device)

reading instances: 0it [00:00, ?it/s]

Finished!


In [None]:
#### Write the DataFrame to a TSV file
path_result= os.path.join(base_path, "data/results_part2/knrm_fira_baseline_final.tsv")
path_baseline=config['qrels']
#Write this to the results
df_eval.to_csv(path_result, sep='\t', header=False, index=False)


In [21]:
from src.core_metrics import calculate_metrics_plain,load_ranking,load_qrels

calculate_metrics_plain(load_ranking(path_result),load_qrels(path_baseline))

{'MRR@10': 0.9560344827586207,
 'Recall@10': 0.9508765283703707,
 'QueriesWithNoRelevant@10': 115,
 'QueriesWithRelevant@10': 4060,
 'AverageRankGoldLabel@10': 1.1029556650246306,
 'MedianRankGoldLabel@10': 1.0,
 'MRR@20': 0.9560344827586207,
 'Recall@20': 1.0000527797325827,
 'QueriesWithNoRelevant@20': 115,
 'QueriesWithRelevant@20': 4060,
 'AverageRankGoldLabel@20': 1.1029556650246306,
 'MedianRankGoldLabel@20': 1.0,
 'MRR@1000': 0.9560344827586207,
 'Recall@1000': 1.0000527797325827,
 'QueriesWithNoRelevant@1000': 115,
 'QueriesWithRelevant@1000': 4060,
 'AverageRankGoldLabel@1000': 1.1029556650246306,
 'MedianRankGoldLabel@1000': 1.0,
 'nDCG@3': 0.870678805114396,
 'nDCG@5': 0.8755932545053234,
 'nDCG@10': 0.9001940416552495,
 'nDCG@20': 0.9139540272527631,
 'nDCG@1000': 0.9139540272527631,
 'QueriesRanked': 4060,
 'MAP@1000': 0.9468693220351503}

**FIRA our own judgement**

In [22]:
path_baseline_own=os.path.join(base_path,"data/Part-1/fira-22.judgements-anonymized-aggregated_v1.tsv")
calculate_metrics_plain(load_ranking(path_result),load_qrels(path_baseline_own))

{'MRR@10': 0.9477214289621971,
 'Recall@10': 0.9517564585696048,
 'QueriesWithNoRelevant@10': 113,
 'QueriesWithRelevant@10': 4062,
 'AverageRankGoldLabel@10': 1.1233382570162482,
 'MedianRankGoldLabel@10': 1.0,
 'MRR@20': 0.9477214289621971,
 'Recall@20': 1.000052753745516,
 'QueriesWithNoRelevant@20': 113,
 'QueriesWithRelevant@20': 4062,
 'AverageRankGoldLabel@20': 1.1233382570162482,
 'MedianRankGoldLabel@20': 1.0,
 'MRR@1000': 0.9477214289621971,
 'Recall@1000': 1.000052753745516,
 'QueriesWithNoRelevant@1000': 113,
 'QueriesWithRelevant@1000': 4062,
 'AverageRankGoldLabel@1000': 1.1233382570162482,
 'MedianRankGoldLabel@1000': 1.0,
 'nDCG@3': 0.8614270150469169,
 'nDCG@5': 0.8688532389398897,
 'nDCG@10': 0.8945172252825958,
 'nDCG@20': 0.9077459409980746,
 'nDCG@1000': 0.9077459409980746,
 'QueriesRanked': 4062,
 'MAP@1000': 0.9362002840850584}