# Module 3 - Dense Retrieval with SentenceTransformers

In this notebook, we will learn how to retrieve documents for a given query using a dense retrieval model. We will use the [SentenceTransformer](https://www.sbert.net/) library that has an simple API and multiple dense models already finetuned on retrieval datasets, which provide reasonable zero-shot performance on the BEIR benchmark.

In this example, we will use `BAAI/bge-base-en-v1.5` as the retrieval model. `bge` is short for BAAI general embedding.

# Installing required packages

For this example, we will install the following libraries:

**`sentence-transformers`**:

`sentence-transformers` is a library developed by UKPLab, which provides a wide range of pre-trained models for computing sentence embeddings. It also offers a simple interface for fine-tuning these models on custom datasets, which can be used to improve the performance of the models on specific tasks.

**`datasets`**:

`datasets` is a library developed by Hugging Face, which provides a wide range of datasets for NLP tasks. It also offers a simple interface for downloading and loading these datasets, which can be used to train and evaluate models on specific tasks.

**`hnswlib`**:

`hnswlib` is a library developed by Yury Malkov, which provides an efficient implementation of the Hierarchical Navigable Small World (HNSW) algorithm for approximate nearest neighbor search. It is particularly optimized for use on GPUs, which allows it to perform large-scale similarity searches at high speed.

In [None]:
!pip install sentence-transformers
!pip install datasets  # For the FiQA dataset
!pip install hnswlib  # To perform Approximate Nearest Neighbor search

# Setting the device

In this example, we will use a GPU to speed up the processing of our model. GPUs (Graphics Processing Units) are specialized processors that are optimized for performing large-scale computations in parallel. By using a GPU, we can accelerate the training and inference of a machine learning model, which can significantly reduce the time required to complete these tasks.

Before we begin, we need to check whether a GPU is available and select it as the default device for our PyTorch operations. This is because PyTorch can use either a CPU or a GPU to perform computations, and by default, it will use the CPU.

For using a GPU in Google Colab:
1. Click on the "Runtime" menu at the top of the screen.
2. From the dropdown menu, click on "Change runtime type".
3. In the popup window that appears, select "GPU" as the hardware accelerator.
4. Click on the "Save" button.

That's it! Now you can use the GPU for faster computations in your notebook.

In [None]:
!nvidia-smi

In [None]:
# Check if GPU is available
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'device: {device}')

# Initializing the model

We will use the `BAAI/bge-base-en-v1.5` model for this example. For more information about this model, please refer to the [model card](https://huggingface.co/BAAI/bge-base-en-v1.5).

For loading the model, we will use the `SentenceTransformer` class from the `sentence-transformers` library. This class provides a simple interface for loading and using sentence embedding models. It also offers a wide range of pre-trained models, which can be used for various NLP tasks.

In [None]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# Send model to GPU, if available
model = model.to(device)

# Encoding a few sentences

Let's encode a few sentences using the model. We will use the `encode` method of the `SentenceTransformer` class to encode the sentences. This method takes a list of strings as input and returns a list of embeddings as output.

In [None]:
# Sentences we like to encode.
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

embeddings = model.encode(sentences)

# Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")


# Encoding and searching over a tiny corpus

When the corpus is small (i.e., a few thosand documents), we can convert all the documents to embeddings and perform a **"brute-force"** search: compute the cosine similarity of each document embedding against the query embedding and return to the user the top k documents with the highest cosine similarity.

Before, as this model was fine-tuned for computing embeddings for different tasks, we need to define the `query_instruction`, which will be prepended in each query.

In [None]:
query_instruction = "Represent this sentence for searching relevant passages: "

In [None]:
# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    formated_query = query_instruction + query
    query_embedding = model.encode(formated_query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

# Indexing and searching over a large corpus

When the corpus is large (e.g., with millions of documents), brute-force search will be too slow. Thus, we need an Approximate Nearest Neighbor (ANN) algorithm to efficiently find the most similar document embeddings to a query embedding. Results, however, are approximate, i.e., the most similar document might not be retrieved.

In this notebook, we will use the HNSW algorithm inplemented in hnswlib library as our ANN algorithm.

We will work with a smaller version of the FiQA corpus with 1,000 documents.


In [None]:
from datasets import load_dataset

dataset = load_dataset("BeIR/fiqa", 'corpus')

dataset.values()

## Converting the corpus to embeddings


We will convert the corpus to embeddings using the `encode` method of the `SentenceTransformer` class. This method takes a list of strings as input and returns a list of embeddings as output.

In [None]:
corpus_texts = [item['text'] for item in dataset['corpus']]
corpus_texts = corpus_texts[:1000]

corpus_embeddings = model.encode(corpus_texts, show_progress_bar=True, convert_to_numpy=True)

## Indexing

We will use the HNSW algorithm inplemented in hnswlib library as our ANN algorithm. We will use the `Index` class from the `hnswlib` library to create an index for our corpus. This class provides a simple interface for creating and using an index. It also offers a wide range of parameters, which can be used to configure the index for a specific task.

In [None]:
import hnswlib

index = hnswlib.Index(space='cosine', dim=corpus_embeddings.shape[-1])

index.init_index(max_elements=len(corpus_embeddings), ef_construction=400, M=64)

# We train the index to find a suitable cluster
index.add_items(corpus_embeddings, list(range(len(corpus_embeddings))))

# We can optionally save the index to disk so can we reuse it later.
index_path = "./hnswlib.index"
print("Saving index to:", index_path)
index.save_index(index_path)

## Searching

Now that we have indexed our corpus, we can use the `knn_query` method of the `Index` class to find the top k most similar document embeddings to a query embedding. This method takes a query embedding and the number of nearest neighbors to return as input and returns a list of document ids and distances as output.

In [None]:
query = "What is considered a business expense on a business trip?"

formated_query = formated_query + query

top_k_hits = 5

query_embedding = model.encode(formated_query)

# We use hnswlib knn_query method to find the top_k_hits
corpus_ids, distances = index.knn_query(query_embedding, k=top_k_hits)

# We extract corpus ids and scores
hits = [{'corpus_id': id, 'score': 1 - score} for id, score in zip(corpus_ids[0], distances[0])]
hits = sorted(hits, key=lambda x: x['score'], reverse=True)

print("Input query:", query)
for hit in hits[0:top_k_hits]:
    print("\t{:.3f}\t{}".format(hit['score'], corpus_texts[hit['corpus_id']]))