# SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

This notebook gives a minimal example usage of SPLADE.

* We provide models via Hugging Face (https://huggingface.co/naver)
* See [Naver Labs Europe website](https://europe.naverlabs.com/research/machine-learning-and-optimization/splade-models/) for other intermediate models.

| model | MRR@10 (MS MARCO dev) | recall@1000 (MS MARCO dev) | expected FLOPS | ~ avg q length | ~ avg d length | 
| --- | --- | --- | --- | --- | --- |
| `naver/splade_v2_max` (**v2** [HF](https://huggingface.co/naver/splade_v2_max)) | 34.0 | 96.5 | 1.32 | 18 | 92 |
| `naver/splade_v2_distil` (**v2** [HF](https://huggingface.co/naver/splade_v2_distil)) | 36.8 | 97.9 | 3.82 | 25 | 232 |
| `naver/splade-cocondenser-selfdistil` (**v2bis**, [HF](https://huggingface.co/naver/splade-cocondenser-selfdistil))| 37.6 | 98.4 | 2.32 | 56 | 134 |
| `naver/splade-cocondenser-ensembledistil` (**v2bis**, [HF](https://huggingface.co/naver/splade-cocondenser-ensembledistil)) | 38.3 | 98.3  | 1.85 | 44 | 120 |

In [1]:
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from splade.models.transformer_rep import Splade

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# set the dir for trained weights

##### v2
# model_type_or_dir = "naver/splade_v2_max"
# model_type_or_dir = "naver/splade_v2_distil"

### v2bis, directly download from Hugging Face
model_type_or_dir = "naver/splade-cocondenser-selfdistil"
model_type_or_dir = "/scratch/lamdo/splade_checkpoints/experiments_combined_references_v6/debug/checkpoint/model"
# model_type_or_dir = "/home/lamdo/keyphrase_informativeness_test/splade/experiments_unarxive_intro_relatedwork_1citationpersentence+scirepeval_search_v2/debug/checkpoint/model"

In [3]:
# loading model and tokenizer

model = Splade(model_type_or_dir, agg="max")
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_type_or_dir)
reverse_voc = {v: k for k, v in tokenizer.vocab.items()}

In [20]:
# example document from MS MARCO passage collection (doc_id = 8003157)

doc = """Study on the Structure of Index Data for Metasearch System
This paper proposes a new technique for Metasearch system, which is based on the grouping of
both keywords and URLs. This technique enables metasearch systems to share information and to
reflect the estimation of users’ preference. With this system, users can search not only by their own
keywords but by similarity of HTML documents. In this paper, we describe the principle of the
grouping technique as well as the summary of the existing search systems."""

In [21]:
# now compute the document representation
doc_tokens = tokenizer(doc, return_tensors="pt")
with torch.no_grad():
    doc_rep = model(d_kwargs=doc_tokens)["d_rep"].squeeze()  # (sparse) doc rep in voc space, shape (30522,)

# get the number of non-zero dimensions in the rep:
col = torch.nonzero(doc_rep).squeeze().cpu().tolist()
print("number of actual dimensions: ", len(col))

# now let's inspect the bow representation:
weights = doc_rep[col].cpu().tolist()
d = {k: v for k, v in zip(col, weights)}
sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
bow_rep = []
for k, v in sorted_d.items():
    bow_rep.append((reverse_voc[k], round(v, 2)))
print("SPLADE BOW rep:\n", bow_rep)

number of actual dimensions:  19
SPLADE BOW rep:
 [('index', 1.58), ('grouping', 1.58), ('meta', 1.48), ('search', 1.44), ('##sea', 1.43), ('html', 1.22), ('key', 0.97), ('data', 0.79), ('structure', 0.65), ('study', 0.52), ('ur', 0.47), ('system', 0.4), ('##word', 0.38), ('share', 0.24), ('systems', 0.21), ('document', 0.16), ('##words', 0.15), ('word', 0.1), ('text', 0.08)]


In [22]:
doc_tokens.input_ids[0]

doc_tokens_string = [reverse_voc[vid] for vid in doc_tokens.input_ids.tolist()[0]]

In [23]:
[item for item in bow_rep[:] if item[0] not in doc_tokens_string]

[('##word', 0.38), ('document', 0.16), ('word', 0.1), ('text', 0.08)]