<a href="https://colab.research.google.com/github/pvannies/information-retrieval/blob/main/modified_BEIR_for_dutch_evaluations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Evaluating a Dutch embedding model on a machine translated Dutch BEIR dataset**

This notebook is adpated from the original BEIR notebook: https://github.com/beir-cellar/beir/wiki/Examples-and-tutorials that was developed by Nandan Thakur, Researcher @ UKP Lab, TU Darmstadt

The purpose of this notebook is to evaluate information retrieval using a sentence transformer embedding model and reranking by a cross-encoder on a custom dataset, here a dutch scifact dataset.

Specifically, we are interested in the Dutch embedding and cross-encoder model trained by the NFI:
- [robbert-2022-dutch-sentence-transformers](https://huggingface.co/NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers)
- [robbert-2023-dutch-base-cross-encoder](https://huggingface.co/NetherlandsForensicInstitute/robbert-2023-dutch-base-cross-encoder)

The scifact-nl dataset is machine translated from the original scifact (a BEIR benchmark dataset) and can be (soon) found here: https://huggingface.co/datasets/DSAISopraSteriaNL/scifact-NL/tree/main

## How to
- Have your custom dataset ready in the BEIR format. In the qrel folder, only one tsv file is necessary.

  * scifact-nl/
      * corpus.jsonl
      * queries.jsonl
      * qrels/
          * train.tsv
          * dev.tsv
          * test.tsv


- upload your files in the notebook session in a new `datasets` folder in the `content` folder, which is opened by default when clicking on the folder icon in the left menu bar of colab.

- Define your (sentence transformer) embedding model and cross encoder model that you want to evaluate in the first code cell below

- Run the cells and enjoy

In [1]:
# define embedding model you want to evaluate
model_name = "NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers"
# folder where the files in BEIR format are saved
data_path = "datasets/scifact-nl"
# split, which qrel .tsv file is present or which you want to evaluate on
qrel_split = "test"
# define cross-encoder model for reranking
cross_encoder_model_name = 'NetherlandsForensicInstitute/robbert-2023-dutch-base-cross-encoder'

In [2]:
# check if your files and folders are indeed present
!ls datasets

scifact-nl


In [3]:
# Install the beir PyPI package, hide output
!nvidia-smi &> /dev/null
!pip install beir &> /dev/null

In [4]:
from beir import util, LoggingHandler

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

  from tqdm.autonotebook import tqdm


# **Data Loading**

In [5]:
from beir.datasets.data_loader import GenericDataLoader

corpus, queries, qrels = GenericDataLoader(data_path).load(split=qrel_split) # or split = "train" or "dev"

  0%|          | 0/5183 [00:00<?, ?it/s]

# **Dense Retrieval using Exact Search**

## **Sentence-BERT**
We use the [``NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers``](https://huggingface.co/NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers) SBERT model in this example.

In [6]:
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval import models
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

#### Dense Retrieval using SBERT (Sentence-BERT) ####
#### Provide any pretrained sentence-transformers model
#### The model was fine-tuned using cosine-similarity.
#### Complete list - https://www.sbert.net/docs/pretrained_models.html

# model_name is defined in first cell

model = DRES(models.SentenceBERT(model_name), batch_size=128)
retriever = EvaluateRetrieval(model, score_function="cos_sim")

#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Batches:   0%|          | 0/41 [00:00<?, ?it/s]

In [7]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...

logging.info("Retriever evaluation for k in: {}".format(retriever.k_values))
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)


In [8]:
print(data_path)
print(model_name)
print(ndcg)
print(_map)
print(recall)
print(precision)

datasets/scifact-nl
NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers
{'NDCG@1': 0.32, 'NDCG@3': 0.36713, 'NDCG@5': 0.39326, 'NDCG@10': 0.41025, 'NDCG@100': 0.46031, 'NDCG@1000': 0.48366}
{'MAP@1': 0.29917, 'MAP@3': 0.34491, 'MAP@5': 0.36095, 'MAP@10': 0.36861, 'MAP@100': 0.37906, 'MAP@1000': 0.37983}
{'Recall@1': 0.29917, 'Recall@3': 0.40339, 'Recall@5': 0.46611, 'Recall@10': 0.51744, 'Recall@100': 0.75478, 'Recall@1000': 0.941}
{'P@1': 0.32, 'P@3': 0.14778, 'P@5': 0.106, 'P@10': 0.05933, 'P@100': 0.00857, 'P@1000': 0.00107}


In [9]:
import random

#### Print top-k documents retrieved ####
top_k = 5

query_id, ranking_scores = random.choice(list(results.items()))
scores_sorted = sorted(ranking_scores.items(), key=lambda item: item[1], reverse=True)
print("Query : %s\n" % queries[query_id])

for rank in range(top_k):
    doc_id = scores_sorted[rank][0]
    # Format: Rank x: ID [Title] Body
    print("Rank %d: %s [%s] - %s\n" % (rank+1, doc_id, corpus[doc_id].get("title"), corpus[doc_id].get("text")))

Query : Peroxynitrite is vereist voor de filtratie van TCR/CD8.

Rank 1: 6784372 [Meerdere afbraakroutes regelen veelzijdige CIP/KIP CDK-remmers.] - De zoogdier CIP/KIP-familie van cycline-afhankelijke kinase (CDK) remmers (CKI's) bestaat uit drie eiwitten--p21(Cip1/WAF1), p27(Kip1), en p57(Kip2) -- die cycline-CDK complexen binden en remmen, die de belangrijkste regulators van de celcyclus zijn. CIP/KIP CKI's hebben extra onafhankelijke functies in het reguleren van transcriptie, apoptose en actin cytoskelet dynamica. Deze uiteenlopende functies worden uitgevoerd in verschillende celcompartimenten en dragen bij tot de schijnbaar tegenstrijdige observatie dat de CKI's zowel kanker kunnen onderdrukken als bevorderen. Meerdere ubiquitin ligases (E3s) sturen de proteasoom-gemedieerde afbraak van p21, p27 en p57. Deze beoordeling analyseert recente gegevens over ons huidige inzicht in hoe verschillende E3 routes subpopulaties van de CKI's reguleren om hun diverse functies te beheersen.

Ra

# **Reranking using Cross-Encoder**

In this example, we rerank the top-20 documents retrieved from embedding model, using ([robbert-2023-dutch-base-cross-encoder](https://huggingface.co/NetherlandsForensicInstitute/robbert-2023-dutch-base-cross-encoder)) SBERT cross-encoder model

In [10]:
from beir.reranking.models import CrossEncoder
from beir.reranking import Rerank

#### Reranking using Cross-Encoder models (list: )
cross_encoder_model = CrossEncoder(cross_encoder_model_name)
reranker = Rerank(cross_encoder_model, batch_size=128)

# Rerank top-20 results using the reranker provided
rerank_results = reranker.rerank(corpus, queries, results, top_k=20)

Batches:   0%|          | 0/47 [00:00<?, ?it/s]

In [11]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...
ndcg_r, _map_r, recall_r, precision_r = EvaluateRetrieval.evaluate(qrels, rerank_results, retriever.k_values)

In [12]:
print(f'Without reranking, the information retrieval evaluation metrics on {data_path} by {model_name} are')
print(ndcg)
print(_map)
print(recall)
print(precision)

print(f'After Reranking results of {model_name} with crossencoder {cross_encoder_model_name}, the evaluation metrics are:')
print(ndcg_r)
print(_map_r)
print(recall_r)
print(precision_r)


Without reranking, the information retrieval evaluation metrics on datasets/scifact-nl by NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers are
{'NDCG@1': 0.32, 'NDCG@3': 0.36713, 'NDCG@5': 0.39326, 'NDCG@10': 0.41025, 'NDCG@100': 0.46031, 'NDCG@1000': 0.48366}
{'MAP@1': 0.29917, 'MAP@3': 0.34491, 'MAP@5': 0.36095, 'MAP@10': 0.36861, 'MAP@100': 0.37906, 'MAP@1000': 0.37983}
{'Recall@1': 0.29917, 'Recall@3': 0.40339, 'Recall@5': 0.46611, 'Recall@10': 0.51744, 'Recall@100': 0.75478, 'Recall@1000': 0.941}
{'P@1': 0.32, 'P@3': 0.14778, 'P@5': 0.106, 'P@10': 0.05933, 'P@100': 0.00857, 'P@1000': 0.00107}
After Reranking results of NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers with crossencoder NetherlandsForensicInstitute/robbert-2023-dutch-base-cross-encoder, the evaluation metrics are:
{'NDCG@1': 0.40667, 'NDCG@3': 0.45319, 'NDCG@5': 0.47512, 'NDCG@10': 0.48698, 'NDCG@100': 0.49127, 'NDCG@1000': 0.49127}
{'MAP@1': 0.3815, 'MAP@3': 0.43263, 'MAP@5

## Good Results!

On the scifact-nl dataset, reranking results by the robbert-2023-dutch-base-cross-encoder cross encoder leads to a significant improvement of evaluation metrics!

- NDCG@10: 0.41 -> 0.49
  - NDCG@10 is the standard metric in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) and also takes into account the ranking of the correct result.
- Recall@1: 0.30 -> 0.38
 - Recall@1: Placing the most relevant document at the top (because in scifact, there is only one relevant document per query)
- Recall@5: 0.47 -> 0.54
  - Recall@5: How often is the correct search result in the top 5, is interesting for RAG, because you can, for example, give the top 5 search results to an LLM to process it further.


