# MODERN INFORMATION RETRIEVAL - AITECH 2023

In this notebook there are examples of how to use Sentence Transformers and BEIR libraries.
After completion you should be able to build Information Retrieval pipeline with you own documents.

### HOW TO LOAD DATA FROM BEIR and BEIR-PL benchmarks
There are avaliable datasets and models from BEIR and BEIR-PL benchmarks on huggingface:
- BEIR-PL https://huggingface.co/clarin-knext 
- BEIR https://huggingface.co/BeIR 

In case of BEIR benchmark it is also possible to download their datasets dirctly in zip archive.

Remeber about sizes of datasets, some require a few GBs or space.

![BEIR_PL_DATASETS.png](images/BEIR_PL_DATASETS.png)

In [None]:
from beir import util


dataset = "scifact"
# Downloading data, unpacking and loading
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = "data"
data_path = util.download_and_unzip(url, out_dir)

from beir.datasets.data_loader import GenericDataLoader
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

In [3]:
queries

{'1': '0-dimensional biomaterials show inductive properties.',
 '3': '1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.',
 '5': '1/2000 in UK have abnormal PrP positivity.',
 '13': '5% of perinatal mortality is due to low birth weight.',
 '36': 'A deficiency of vitamin B12 increases blood levels of homocysteine.',
 '42': 'A high microerythrocyte count raises vulnerability to severe anemia in homozygous alpha (+)- thalassemia trait subjects.',
 '48': 'A total of 1,000 people in the UK are asymptomatic carriers of vCJD infection.',
 '49': 'ADAR1 binds to Dicer to cleave pre-miRNA.',
 '50': 'AIRE is expressed in some skin tumors.',
 '51': 'ALDH1 expression is associated with better breast cancer outcomes.',
 '53': 'ALDH1 expression is associated with poorer prognosis in breast cancer.',
 '54': 'AMP-activated protein kinase (AMPK) activation increases inflammation-related fibrosis in the lung

In [None]:
# Downloading dataset directly from huggingface
# Note that the formats are different from GenericDataLoader output
from beir.datasets.data_loader_hf import HFDataLoader
corpus, queries, qrels = HFDataLoader(hf_repo="clarin-knext/scifact-pl", streaming=False, keep_in_memory=False).load(split="test")
# Conversion from DataSet
queries = {query['id']: {'text': query['text']} for query in queries}
corpus = {doc['id']: {'title': doc['title'] , 'text': doc['text']} for doc in corpus}

### How to use Sentence Transformers

You can visit https://www.sbert.net/
There are other examples and documentation of the framework.

There are models avaliable on huggingface: https://huggingface.co/sentence-transformers

English biencoder models:
- sentence-transformers/all-MiniLM-L6-v2 - cos sim, mean pooling
- sentence-transformers/all-distilroberta-v1 - cos sim, mean pooling
- sentence-transformers/all-mpnet-base-v2 - cos sim, mean pooling
- sentence-transformers/msmarco-distilbert-base-tas-b - dot prod, cls pooling

Worth noting multilingual models which can be used with sentence transformers and work for Polish:
- sentence-transformers/distiluse-base-multilingual-cased-v1 - dot product, mean pooling
- nthakur/mcontriever-base-msmarco - dot product, mean pooling
- intfloat/multilingual-e5-base - cos sim, mean pooling, remembert to add "query: " or "passage: " before query or passage


In [33]:
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['A man is eating pasta.', 
           'Someone in a gorilla costume is playing a set of drums.', 
           'A cheetah chases prey on across a field.']



# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n======================\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))



Query: Man is playing piano?

Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.3061)
A man is eating food. (Score: 0.2775)
A man is eating a piece of bread. (Score: 0.2480)
A woman is playing violin. (Score: 0.2098)
A man is riding a horse. (Score: 0.1839)


### You can get directly embeddings from the model and compute scores
You can check that different models may have different length of embeddings.

Try loading other models and check how different is similarity for the same sentences.

In [1]:
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('nthakur/mcontriever-base-msmarco')
embeddings = model.encode(sentences)
print(embeddings)
print(embeddings.shape)

[[ 0.0390083   0.03461862  0.0430243  ...  0.02513455 -0.10736141
  -0.04870313]
 [-0.06797605 -0.05478052  0.1530795  ...  0.0206313  -0.12184491
   0.06304809]]
(2, 768)


In [2]:
util.cos_sim(embeddings[0], embeddings[1]), util.dot_score(embeddings[0], embeddings[1])

(tensor([[0.5552]]), tensor([[1.7575]]))

### Small Exercise
### Chek with your own queries and with different models. Can you think of queries that are getting confusing for the model?

In [None]:
# YOUR CODE HERE

### Cross-encoders in Sentence Transformers

There are also cross-encoders models avaliable. 

Cross-encoders are models that performing input sequence classification.
When we provide Query and Passage jointly to the model it is able to assess whether they are relevant to each other.
Cross-encoders are more computionally expensive than retrievers because we can not store precomputed coprus embeddings.
![Bi_vs_Cross-Encoder.png](images/Bi_vs_Cross-Encoder.png)

https://www.sbert.net/examples/applications/cross-encoder/README.html

Example cross-encoders for English:
- cross-encoder/ms-marco-MiniLM-L-6-v2
- cross-encoder/ms-marco-TinyBERT-L-2

Multilingual and Polish:
- cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
- unicamp-dl/mMiniLM-L6-v2-mmarco-v2
- clarin-knext/herbert-base-reranker-msmarco


In [42]:
from sentence_transformers.cross_encoder import CrossEncoder
import numpy as np

# Pre-trained cross encoder
model = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2')

# We want to compute the similarity between the query sentence

query = 'What is a man eating?'

# With all sentences in the corpus
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]

# So we create the respective sentence combinations
sentence_combinations = [[query, corpus_sentence] for corpus_sentence in corpus]

# Compute the similarity scores for these combinations
similarity_scores = model.predict(sentence_combinations)

# Sort the scores in decreasing order
sim_scores_argsort = reversed(np.argsort(similarity_scores))

# Print the scores
print("Query:", query)
for idx in sim_scores_argsort:
    print("{:.2f}\t{}".format(similarity_scores[idx], corpus[idx]))

Query: What is a man eating?
0.98	A man is eating food.
0.98	A man is eating a piece of bread.
0.07	A man is riding a horse.
0.05	A man is riding a white horse on an enclosed ground.
0.00	A woman is playing violin.
0.00	A cheetah is running behind its prey.
0.00	Two men pushed carts through the woods.
0.00	The girl is carrying a baby.
0.00	A monkey is playing drums.


### Small Exercise
### Try different crossencoders and compare the results with bi-encoders. Are they really performing better?

In [None]:
# YOUR CODE HERE

### Evaluation with BEIR benchmark

BEIR benchmark has a lot of usefull classes that help with loading datasets, models and performing evaluation.
More examples can be found : https://github.com/beir-cellar/beir/tree/main/examples/retrieval/evaluation




### Evaluate BM25

To run elasticsearch in docker on your local machine run command:
```
docker run -d --name elasticsearch_ir -p 9200:9200 -p 9300:9300 -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.9.2
```

In google colab:
```
%%bash

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512 
```


```
%%bash --bg

sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch
```

First examples is how to evaluate BM25.
- Run your Elasticsearch with docker (or in colab)
- Index your corpus from BEIR benchmark in Elasticsearch
- Use EvaluateRetrieval class to perform retrieval and evaluation


In [None]:
from beir.retrieval.search.lexical import BM25Search as BM25
from beir.retrieval.evaluation import EvaluateRetrieval

#### Provide parameters for Elasticsearch
hostname = "localhost" #localhost
index_name = "your-index-name" # scifact
initialize = True # False

dataset = "scifact"
# Downloading data, unpacking and loading
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = "data"
data_path = util.download_and_unzip(url, out_dir)

corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")


model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)
retriever = EvaluateRetrieval(model)

results = retriever.retrieve(corpus, queries)
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
print(ndcg)

### How to search with Elasticsearch

Example how to search Elasticsearch with own query. You can provide your queries as a list of strings to 'texts' as in example below:

In [5]:
bm25_results = model.es.lexical_multisearch(
                texts=["Example query"], 
                top_hits=10)
bm25_results

[{'meta': {'total': 135, 'took': 1, 'num_hits': 10},
  'hits': [('6826636', 11.450113),
   ('10029970', 10.169866),
   ('17447653', 10.14879),
   ('37297740', 9.742962),
   ('1256116', 9.382215),
   ('3367829', 7.67427),
   ('15617300', 7.67427),
   ('15048300', 7.4469895),
   ('6078882', 7.008301),
   ('15588516', 6.83874)]}]

In [7]:
# Get ID of the passage with the highest score
doc_id = bm25_results[0]['hits'][0][0]
print(doc_id)
print(corpus[doc_id])

6826636
{'text': 'GiardiaDB (http://GiardiaDB.org) and TrichDB (http://TrichDB.org) house the genome databases for Giardia lamblia and Trichomonas vaginalis, respectively, and represent the latest additions to the EuPathDB (http://EuPathDB.org) family of functional genomic databases. GiardiaDB and TrichDB employ the same framework as other EuPathDB sites (CryptoDB, PlasmoDB and ToxoDB), supporting fully integrated and searchable databases. Genomic-scale data available via these resources may be queried based on BLAST searches, annotation keywords and gene ID searches, GO terms, sequence motifs and other protein characteristics. Functional queries may also be formulated, based on transcript and protein expression data from a variety of platforms. Phylogenetic relationships may also be interrogated. The ability to combine the results from independent queries, and to store queries and query results for future use facilitates complex, genome-wide mining of functional genomic data.', 'title

In [None]:
model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)

corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]

corpus = {str(key): {'text': val} for key,val in enumerate(corpus)}

model.index(corpus)

In [17]:
corpus

{'0': {'text': 'A man is eating food.'},
 '1': {'text': 'A man is eating a piece of bread.'},
 '2': {'text': 'The girl is carrying a baby.'},
 '3': {'text': 'A man is riding a horse.'},
 '4': {'text': 'A woman is playing violin.'},
 '5': {'text': 'Two men pushed carts through the woods.'},
 '6': {'text': 'A man is riding a white horse on an enclosed ground.'},
 '7': {'text': 'A monkey is playing drums.'},
 '8': {'text': 'A cheetah is running behind its prey.'}}

In [19]:
queries = ['A man is eating pasta.', 
           'Someone in a gorilla costume is playing a set of drums.', 
           'A cheetah chases prey on across a field.']

bm25_results = model.es.lexical_multisearch(
                texts=queries, 
                top_hits=5)
bm25_results[0]

for query, bm25_result in zip(queries,bm25_results):
    print("\n======================\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    for idx, score in bm25_result['hits']:
        print("{:.2f}\t{}".format(score, corpus[idx]))



Query: A man is eating pasta.

Top 5 most similar sentences in corpus:
2.43	{'text': 'A man is eating food.'}
2.18	{'text': 'A man is eating a piece of bread.'}
0.89	{'text': 'A man is riding a horse.'}
0.66	{'text': 'A man is riding a white horse on an enclosed ground.'}


Query: Someone in a gorilla costume is playing a set of drums.

Top 5 most similar sentences in corpus:
3.66	{'text': 'A monkey is playing drums.'}
1.54	{'text': 'A woman is playing violin.'}


Query: A cheetah chases prey on across a field.

Top 5 most similar sentences in corpus:
3.44	{'text': 'A cheetah is running behind its prey.'}


### Evaluation of dense models with BEIR benchmark 

In [6]:
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
from beir.retrieval import models
from beir.retrieval.evaluation import EvaluateRetrieval

In [None]:
# Here define model name:
model_name = 'nthakur/mcontriever-base-msmarco' 
# model_name = 'intfloat/multilingual-e5-base'
# model_name = 'sentence-transformers/distiluse-base-multilingual-cased-v1'
model = DRES(models.SentenceBERT(model_name), batch_size=128, corpus_chunk_size=512)

In [None]:
retriever = EvaluateRetrieval(model, score_function="dot")
results = retriever.retrieve(corpus, queries)
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
mrr = retriever.evaluate_custom(qrels, results, retriever.k_values, metric="mrr")
ndcg, recall

print(f"NDGC: {ndcg}")

### Evaluation and usage of MonoT5 Reranker models

Example of how to use sequence-to-sequence monoT5 models for reranking.
Remember that different models may predict different tokens as true and false!

For English language:
- castorini/monot5-base-msmarco (tokens: ['▁false', '▁true'])
- castorini/monot5-small-msmarco-100k (tokens: ['▁false', '▁true'])

For Polish language:
- clarin-knext/plt5-base-msmarco (tokens: ['▁fałsz'   , '▁prawda'])
- unicamp-dl/mt5-base-mmarco-v2 (tokens: ['▁no'   , '▁yes'])

In [22]:
from beir.reranking.models import MonoT5
from beir.reranking import Rerank
cross_encoder_name = "clarin-knext/plt5-base-msmarco"
# castorini/monot5-base-msmarco
# unicamp-dl/mt5-base-mmarco-v2
cross_encoder = MonoT5(cross_encoder_name, use_amp=False, token_true='▁prawda', token_false='▁fałsz')
reranker = Rerank(cross_encoder, batch_size=128)

rerank_results = reranker.rerank(corpus, queries, results, top_k=100)
ndcg, _map, recall, precision = EvaluateRetrieval.evaluate(qrels, rerank_results, retriever.k_values)
print(ndcg)

In [None]:
# How to predict on your query and passage:

cross_encoder.predict([("Czy Ala ma psa?", "Ala ma kota.")])

### SPLADE
Here is example of how to use sparse retrieval model - SPLADE.
It produces representation of the query over a whole tokenizer vocabulary.

To get better understanding how the model output, it is possible to print which tokens are the most important for given query.

Avaliable models:
- naver/splade-cocondenser-ensembledistil
- naver/splade_v2_distil

In [None]:
from beir.retrieval.models import SPLADE
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

model_path = "naver/splade_v2_distil" #"naver/splade-cocondenser-ensembledistil"
model = SPLADE(model_path)
splade_tokenizer = model.tokenizer


In [59]:
# Reverse vocabulary index is created to print the tokens later 
reverse_voc = {v: k for k, v in splade_tokenizer.vocab.items()}

In [None]:
# Here the SPLADE model is accessed in order to represent a single query
example_query = ['Who is carrying a child?']
example_doc = ['The girl is carrying a baby.', 'A man is eating food.']

encoded_splade = model.model.encode_sentence_bert(splade_tokenizer, example_query, is_q=True, maxlen=model.max_length)
encoded_doc_splade = model.model.encode_sentence_bert(splade_tokenizer, example_doc, is_q=False, maxlen=model.max_length)



In [None]:
util.dot_score(encoded_splade,encoded_doc_splade)

In [62]:
# Check the SPLADE model vector size
encoded_splade.shape

(1, 30522)

In [63]:
import numpy as np
col = np.nonzero(encoded_splade.reshape(-1))[0]
print("number of actual dimensions: ", len(col))

number of actual dimensions:  26


In [64]:
weights = encoded_splade[0][col].tolist()
threshold = 0.5
d = {k: v for k, v in zip(col, weights) if v > threshold}
# sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
bow_rep = []
for k, v in d.items():
    bow_rep.append((reverse_voc[k], round(v, 2)))
print("SPLADE BOW rep:\n", bow_rep)

SPLADE BOW rep:
 [('a', 0.84), ('who', 1.7), ('children', 2.32), ('police', 0.66), ('god', 0.86), ('person', 0.91), ('child', 2.87), ('baby', 0.75), ('carried', 1.26), ('carry', 2.66), ('carrying', 2.56)]


### SPLADE MODEL EVALUATION ON BEIR DATASET

In [None]:
model = DRES(SPLADE(model_path), batch_size=128)
retriever = EvaluateRetrieval(model, score_function="dot")

dataset = "scifact"
# Downloading data, unpacking and loading
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = "data"
data_path = util.download_and_unzip(url, out_dir)

corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

results = retriever.retrieve(corpus, queries)
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
print(ndcg)

### HOW TO TRAIN A MODEL
Here is example of training the model with BEIR library

For more examples check: https://github.com/beir-cellar/beir/tree/main/examples/retrieval/training

In [None]:
from sentence_transformers import losses, models, SentenceTransformer
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.train import TrainRetriever
import os

#### Download nfcorpus.zip dataset and unzip the dataset
dataset = "nfcorpus"

url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join("datasets")
data_path = util.download_and_unzip(url, out_dir)

#### Provide the data_path where nfcorpus has been downloaded and unzipped
corpus, queries, qrels = GenericDataLoader(data_path).load(split="train")
#### Please Note not all datasets contain a dev split, comment out the line if such the case
dev_corpus, dev_queries, dev_qrels = GenericDataLoader(data_path).load(split="dev")

#### Provide any sentence-transformers or HF model
model_name = "distilbert-base-uncased" 
word_embedding_model = models.Transformer(model_name, max_seq_length=350)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

#### Or provide pretrained sentence-transformer model
# model = SentenceTransformer("msmarco-distilbert-base-v3")

retriever = TrainRetriever(model=model, batch_size=16)

#### Prepare training samples
train_samples = retriever.load_train(corpus, queries, qrels)
train_dataloader = retriever.prepare_train(train_samples, shuffle=True)

#### Training SBERT with cosine-product
train_loss = losses.MultipleNegativesRankingLoss(model=retriever.model)
#### training SBERT with dot-product
# train_loss = losses.MultipleNegativesRankingLoss(model=retriever.model, similarity_fct=util.dot_score)

#### Prepare dev evaluator
ir_evaluator = retriever.load_ir_evaluator(dev_corpus, dev_queries, dev_qrels)

#### If no dev set is present from above use dummy evaluator
# ir_evaluator = retriever.load_dummy_evaluator()

#### Provide model save path
model_save_path = os.path.join( "output", "{}-v1-{}".format(model_name, dataset))
os.makedirs(model_save_path, exist_ok=True)

#### Configure Train params
num_epochs = 1
evaluation_steps = 5000
warmup_steps = int(len(train_samples) * num_epochs / retriever.batch_size * 0.1)

retriever.fit(train_objectives=[(train_dataloader, train_loss)], 
                evaluator=ir_evaluator, 
                epochs=num_epochs,
                output_path=model_save_path,
                warmup_steps=warmup_steps,
                evaluation_steps=evaluation_steps,
                use_amp=True)

# EXCERCISE

Try to build your own retrieval system with existing models. You can try BM25, retriever models and reranker models.

Build retrieval system for Game of Thrones. There is existing dataset avaliabe on huggingface. To download it you need dataset library.
Load dataset: https://huggingface.co/datasets/Tuana/game-of-thrones


In [None]:
from datasets import load_dataset

dataset = load_dataset("Tuana/game-of-thrones")

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'content', 'content_type', 'meta', 'score', 'embedding'],
        num_rows: 2357
    })
})

In [None]:
# YOUR CODE HERE