# Evaluating SPLADE (Sparse Lexical and Expansion Model for First Stage Ranking) with BEIR

This notebook gives a minimal example usage for evaluating SPLADE on the BEIR benchmark (https://github.com/UKPLab/beir). 

Below are the BEIR performances of our distilled model (`../weights/distilsplade_max`) (see: https://arxiv.org/abs/2109.10086 for more details on training SPLADE with Margin-MSE distillation).

|      dataset      | NDCG@10 | Recall@10 |
|:-----------------:|:-------:|:---------:|
|      ArguAna      |  0.479  |   97.23   |
|   Climate-FEVER   |  0.235  |   52.43   |
|      DBPedia      |  0.435  |   57.52   |
|       FEVER       |  0.786  |   95.14   |
|    FiQA-2018      |  0.336  |   62.10   |
|      HotpotQA     |  0.684  |   82.03   |
|      NFCorpus     |  0.334  |   27.71   |
|         NQ        |  0.521  |   93.05   |
|       Quora       |  0.838  |   98.69   |
|      SCIDOCS      |  0.158  |   36.43   |
|      SciFact      |  0.693  |   92.03   |
|     TREC-COVID    |  0.710  |   54.98   |
|  Touché-2020 (v1) |  0.364  |   35.39   |
|    Avg. zero-shot |  0.506  |   66.89   |

In [1]:
from models import Splade, BEIRSpladeModel
from transformers import AutoModelForMaskedLM, AutoTokenizer

In [2]:
# set the dir for trained weights
# NOTE: this version only works for max agg in SPLADE, so the two directories below !
# If you want to use old weights ("../weights/flops_best" and "../weights/flops_efficient") for BEIR benchmark,
# change the SPLADE aggregation in SPLADE forward in models.py

model_type_or_dir = "../weights/distilsplade_max"
# model_type_or_dir = "../weights/splade_max"

In [3]:
# loading model and tokenizer

model = Splade(model_type_or_dir)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_type_or_dir)
beir_splade = BEIRSpladeModel(model, tokenizer)

In [4]:
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir import util, LoggingHandler

dataset = "nfcorpus"

url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = "dataset/{}".format(dataset)
data_path = util.download_and_unzip(url, out_dir)

#### Provide the data path where nfcorpus has been downloaded and unzipped to the data loader
# data folder would contain these files: 
# (1) nfcorpus/corpus.jsonl  (format: jsonlines)
# (2) nfcorpus/queries.jsonl (format: jsonlines)
# (3) nfcorpus/qrels/test.tsv (format: tsv ("\t"))

corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

dataset/nfcorpus/nfcorpus.zip:   0%|          | 0.00/2.34M [00:00<?, ?iB/s]

  0%|          | 0/3633 [00:00<?, ?it/s]

In [5]:
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
from beir.retrieval.evaluation import EvaluateRetrieval

dres = DRES(beir_splade)
retriever = EvaluateRetrieval(dres, score_function="dot")
results = retriever.retrieve(corpus, queries)
ndcg, map_, recall, p = EvaluateRetrieval.evaluate(qrels, results, [1, 10, 100, 1000])
results2 = EvaluateRetrieval.evaluate_custom(qrels, results, [1, 10, 100, 1000], metric="r_cap")
res = {"NDCG@10": ndcg["NDCG@10"],
       "Recall@100": recall["Recall@100"],
       "R_cap@100": results2["R_cap@100"]}
print("res for {}:".format(dataset), res, flush=True)

  return torch._C._cuda_getDeviceCount() > 0


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

Batches:   0%|          | 0/114 [00:00<?, ?it/s]

res for nfcorpus: {'NDCG@10': 0.33409, 'Recall@100': 0.27705, 'R_cap@100': 0.29269}
