# BEIR: A Heterogenous benchmark for Zero-shot Evaluation of Information Retrieval models

This notebook contains an simple and easy examples to evaluate retrieval models from our new benchmark.

## Introduction
The BEIR benchmark contains 9 diverse retrieval tasks including 17 diverse datasets. We evaluate 9 state-of-the-art retriever models all in a zero-shot evaluation setup. Today, in this colab notebook, we first will show how to download and load the 14 open-sourced datasets with just three lines of code. Afterward, we would load some state-of-the-art dense retrievers (bi-encoders) such as SBERT, ANCE, DPR models and use them for retrieval and evaluate them in a zero-shot setup.

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

Developed by Nandan Thakur, Researcher @ UKP Lab, TU Darmstadt

(https://nthakur.xyz) (nandant@gmail.com)

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
print("CUDA_VISIBLE_DEVICES:", os.environ["CUDA_VISIBLE_DEVICES"], "HF_HOME:", os.environ["HF_HOME"])

CUDA_VISIBLE_DEVICES: 3 HF_HOME: /local1/mohsenfayyaz/.hfcache/


# Install BEIR

In [None]:
# ! pip install beir

In [2]:
from beir import util, LoggingHandler
import logging
import pathlib, os
#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

  from tqdm.autonotebook import tqdm


**BEIR Datasets**

BEIR contains 17 diverse datasets overall. You can view all the datasets (14 downloadable) with the link below:

[``https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/``](https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/)

Please refer GitHub page to evaluate on other datasets (3 of them).


We include the following datasets in BEIR:

| Dataset   | Website| BEIR-Name | Domain     | Relevancy| Queries  | Documents | Avg. Docs/Q | Download |
| -------- | -----| ---------| ----------- | ---------| ---------| --------- | ------| ------------|
| MSMARCO    | [``Homepage``](https://microsoft.github.io/msmarco/)| ``msmarco`` | Misc.       |  Binary  |  6,980   |  8.84M     |    1.1 | Yes |  
| TREC-COVID |  [``Homepage``](https://ir.nist.gov/covidSubmit/index.html)| ``trec-covid``| Bio-Medical |  3-level|50|  171K| 493.5 | Yes |
| NFCorpus   | [``Homepage``](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) | ``nfcorpus``  | Bio-Medical |  3-level |  323     |  3.6K     |  38.2 | Yes |
| BioASQ     | [``Homepage``](http://bioasq.org) | ``bioasq``| Bio-Medical |  Binary  |   500    |  14.91M    |  8.05 | No |
| NQ         | [``Homepage``](https://ai.google.com/research/NaturalQuestions) | ``nq``| Wikipedia   |  Binary  |  3,452   |  2.68M  |  1.2 | Yes |
| HotpotQA   | [``Homepage``](https://hotpotqa.github.io) | ``hotpotqa``| Wikipedia   |  Binary  |  7,405   |  5.23M  |  2.0 | Yes |
| FiQA-2018  | [``Homepage``](https://sites.google.com/view/fiqa/) | ``fiqa``    | Finance     |  Binary  |  648     |  57K    |  2.6 | Yes |
| Signal-1M (RT) | [``Homepage``](https://research.signal-ai.com/datasets/signal1m-tweetir.html)| ``signal1m`` | Twitter     |  3-level  |   97   |  2.86M  |  19.6 | No |
| TREC-NEWS  | [``Homepage``](https://trec.nist.gov/data/news2019.html) | ``trec-news``    | News     |  5-level  |   57    |  595K    |  19.6 | No |
| ArguAna    | [``Homepage``](http://argumentation.bplaced.net/arguana/data) | ``arguana`` | Misc.       |  Binary  |  1,406     |  8.67K    |  1.0 | Yes |
| Touche-2020| [``Homepage``](https://webis.de/events/touche-20/shared-task-1.html) | ``webis-touche2020``| Misc.       |  6-level  |  49     |  382K    |  49.2 |  Yes |
| CQADupstack| [``Homepage``](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | ``cqadupstack``| StackEx.      |  Binary  |  13,145 |  457K  |  1.4 |  Yes |
| Quora| [``Homepage``](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs) | ``quora``| Quora  | Binary  |  10,000     |  523K    |  1.6 |  Yes |
| DBPedia | [``Homepage``](https://github.com/iai-group/DBpedia-Entity/) | ``dbpedia-entity``| Wikipedia |  3-level  |  400    |  4.63M    |  38.2 |  Yes |
| SCIDOCS| [``Homepage``](https://allenai.org/data/scidocs) | ``scidocs``| Scientific |  Binary  |  1,000     |  25K    |  4.9 |  Yes |
| FEVER| [``Homepage``](http://fever.ai) | ``fever``| Wikipedia     |  Binary  |  6,666     |  5.42M    |  1.2|  Yes |
| Climate-FEVER| [``Homepage``](http://climatefever.ai) | ``climate-fever``| Wikipedia |  Binary  |  1,535     |  5.42M |  3.0 |  Yes |
| SciFact| [``Homepage``](https://github.com/allenai/scifact) | ``scifact``| Scientific |  Binary  |  300     |  5K    |  1.1 |  Yes |


# Download Dataset

In [4]:
DATASET = "nq"

# ! git lfs install

# ! git clone https://huggingface.co/datasets/BeIR/nq
# ! mkdir --parents ./datasets/; 
# ! mv nq datasets/
# ! gzip -d datasets/nq/corpus.jsonl.gz
# ! gzip -d datasets/nq/queries.jsonl.gz

# ! git clone https://huggingface.co/datasets/BeIR/nq-qrels
# ! mv nq-qrels datasets/nq/qrels

### SLOW
import pathlib, os
from beir import util
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(DATASET)
out_dir = os.path.join(os.getcwd(), "datasets")
data_path = util.download_and_unzip(url, out_dir)
print("Dataset downloaded here: {}".format(data_path))

Dataset downloaded here: /local1/mohsenfayyaz/projects/Retriever-Contextualization/src/notebooks/datasets/nq


In [5]:
from beir.datasets.data_loader import GenericDataLoader
from tqdm.auto import tqdm

data_path = f"datasets/{DATASET}"
corpus_raw, queries, qrels = GenericDataLoader(data_path).load(split="test") # or split = "train" or "dev"

gold_docs = set()
for test_k, test_v in tqdm(qrels.items()):
    for doc_k, doc_v in test_v.items():
        gold_docs.add(doc_k)
print({
    "#Corpus:": len(corpus_raw), 
    "#Gold_Corpus:": len(gold_docs),
    "#Queries&qrels:": len(queries)
})
corpus = {d: corpus_raw[d] for d in gold_docs}  # corpus = raw_corpus  (FOR GOLD OR ALL)

2024-08-20 15:11:27 - Loading Corpus...


  0%|          | 0/2681468 [00:00<?, ?it/s]

2024-08-20 15:11:40 - Loaded 2681468 TEST Documents.
2024-08-20 15:11:40 - Doc Example: {'text': "In accounting, minority interest (or non-controlling interest) is the portion of a subsidiary corporation's stock that is not owned by the parent corporation. The magnitude of the minority interest in the subsidiary company is generally less than 50% of outstanding shares, or the corporation would generally cease to be a subsidiary of the parent.[1]", 'title': 'Minority interest'}
2024-08-20 15:11:40 - Loading Queries...
2024-08-20 15:11:40 - Loaded 3452 TEST Queries.
2024-08-20 15:11:40 - Query Example: what is non controlling interest on balance sheet


  0%|          | 0/3452 [00:00<?, ?it/s]

{'#Corpus:': 2681468, '#Gold_Corpus:': 4201, '#Queries&qrels:': 3452}


# **Dense Retrieval using Exact Search**

In [6]:
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval import models
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

#### Dense Retrieval using SBERT (Sentence-BERT) ####
#### Provide any pretrained sentence-transformers model
#### The model was fine-tuned using cosine-similarity.
#### Complete list - https://www.sbert.net/docs/pretrained_models.html

MODEL = "facebook/contriever-msmarco"  # "msmarco-distilbert-base-v3"
model = DRES(models.SentenceBERT(MODEL), batch_size=128)
retriever = EvaluateRetrieval(model, score_function="cos_sim")

#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

2024-08-20 15:11:42 - PyTorch version 2.4.0 available.
2024-08-20 15:11:42 - Loading faiss with AVX2 support.
2024-08-20 15:11:42 - Successfully loaded faiss with AVX2 support.
2024-08-20 15:11:42 - Use pytorch device_name: cuda
2024-08-20 15:11:42 - Load pretrained SentenceTransformer: facebook/contriever-msmarco
2024-08-20 15:11:42 - No sentence-transformers model found with name facebook/contriever-msmarco. Creating a new one with mean pooling.




2024-08-20 15:11:44 - Encoding Queries...


Batches:   0%|          | 0/27 [00:00<?, ?it/s]

2024-08-20 15:11:45 - Sorting Corpus by document length (Longest first)...
2024-08-20 15:11:45 - Scoring Function: Cosine Similarity (cos_sim)
2024-08-20 15:11:45 - Encoding Batch 1/1...


Batches:   0%|          | 0/33 [00:00<?, ?it/s]

In [11]:
len(results["test0"])

1000

In [8]:
#### Evaluate your retrieval using NDCG@k, MAP@K ...

logging.info("Retriever evaluation for k in: {}".format(retriever.k_values))
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
recall

2024-08-20 15:22:54 - Retriever evaluation for k in: [1, 3, 5, 10, 100, 1000]
2024-08-20 15:22:54 - For evaluation, we ignore identical query and document ids (default), please explicitly set ``ignore_identical_ids=False`` to ignore this.
2024-08-20 15:22:55 - 

2024-08-20 15:22:55 - NDCG@1: 0.9076
2024-08-20 15:22:55 - NDCG@3: 0.9378
2024-08-20 15:22:55 - NDCG@5: 0.9453
2024-08-20 15:22:55 - NDCG@10: 0.9501
2024-08-20 15:22:55 - NDCG@100: 0.9518
2024-08-20 15:22:55 - NDCG@1000: 0.9522
2024-08-20 15:22:55 - 

2024-08-20 15:22:55 - MAP@1: 0.8138
2024-08-20 15:22:55 - MAP@3: 0.9266
2024-08-20 15:22:55 - MAP@5: 0.9322
2024-08-20 15:22:55 - MAP@10: 0.9346
2024-08-20 15:22:55 - MAP@100: 0.9351
2024-08-20 15:22:55 - MAP@1000: 0.9351
2024-08-20 15:22:55 - 

2024-08-20 15:22:55 - Recall@1: 0.8138
2024-08-20 15:22:55 - Recall@3: 0.9596
2024-08-20 15:22:55 - Recall@5: 0.9762
2024-08-20 15:22:55 - Recall@10: 0.9897
2024-08-20 15:22:55 - Recall@100: 0.9974
2024-08-20 15:22:55 - Recall@1000: 0.9999

{'Recall@1': 0.81383,
 'Recall@3': 0.95959,
 'Recall@5': 0.97625,
 'Recall@10': 0.98972,
 'Recall@100': 0.99744,
 'Recall@1000': 0.9999}

# Save Results

In [None]:
import torch
import pickle
import numpy as np
import pandas as pd
import os
from dotenv import load_dotenv
from huggingface_hub import login
load_dotenv()
login(os.environ["HF_API_TOKEN"])
pd.set_option('display.max_colwidth', 200)

# df = pd.read_json("hf://datasets/mohsenfayyaz/misc/res_triviaqa_test_w_gs.jsonl", lines=True)
# df.to_json("./res_triviaqa_test_w_gs.jsonl", lines=True, orient="records")

df_dict = []
sorted_results = {k: dict(sorted(v.items(), key=lambda item: item[1], reverse=True)) for k, v in results.items()}
for key in tqdm(sorted_results.keys()):
    df_dict.append({
        "key": key,
        "query": queries[key],
        "gold_docs": [k for k, v in qrels[key].items()],
        "gold_docs_text": [corpus[k] for k, v in qrels[key].items()],
        "results": sorted_results[key],
        "predicted_docs_text_5": [corpus[k] for k, v in dict(list(sorted_results[key].items())[:5]).items()],
    })
df = pd.DataFrame(df_dict)
df.attrs['model'] = MODEL
df.attrs['dataset'] = DATASET
df.attrs['corpus_size'] = len(corpus)
df.attrs['eval'] = {"ndcg": ndcg, "map": _map, "recall": recall, "precision": precision}
print(df.attrs)
hf_path = f"hf://datasets/Retriever-Contextualization/datasets/{DATASET}/{MODEL.replace('/', '--')}_corpus{len(corpus)}.parquet"
df.to_parquet(hf_path)
print("UPLOADED:", hf_path)
df

In [None]:
qrels["test0"]