# Module 3 - Dense Retrieval with SentenceTransformers

In this notebook, we will learn how to retrieve documents for a given query using a dense retrieval model. We will use the [SentenceTransformer](https://www.sbert.net/) library that has an simple API and multiple dense models already finetuned on retrieval datasets, which provide reasonable zero-shot performance on the BEIR benchmark.

In this example, we will use `BAAI/bge-base-en-v1.5` as the retrieval model. `bge` is short for BAAI general embedding.

# Installing required packages

For this example, we will install the following libraries:

**`sentence-transformers`**:

`sentence-transformers` is a library developed by UKPLab, which provides a wide range of pre-trained models for computing sentence embeddings. It also offers a simple interface for fine-tuning these models on custom datasets, which can be used to improve the performance of the models on specific tasks.

**`datasets`**:

`datasets` is a library developed by Hugging Face, which provides a wide range of datasets for NLP tasks. It also offers a simple interface for downloading and loading these datasets, which can be used to train and evaluate models on specific tasks.

**`hnswlib`**:

`hnswlib` is a library developed by Yury Malkov, which provides an efficient implementation of the Hierarchical Navigable Small World (HNSW) algorithm for approximate nearest neighbor search. It is particularly optimized for use on GPUs, which allows it to perform large-scale similarity searches at high speed.

In [None]:
!pip install sentence-transformers
!pip install datasets  # For the FiQA dataset
!pip install hnswlib  # To perform Approximate Nearest Neighbor search

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m81.9/86.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_trans

# Setting the device

In this example, we will use a GPU to speed up the processing of our model. GPUs (Graphics Processing Units) are specialized processors that are optimized for performing large-scale computations in parallel. By using a GPU, we can accelerate the training and inference of a machine learning model, which can significantly reduce the time required to complete these tasks.

Before we begin, we need to check whether a GPU is available and select it as the default device for our PyTorch operations. This is because PyTorch can use either a CPU or a GPU to perform computations, and by default, it will use the CPU.

For using a GPU in Google Colab:
1. Click on the "Runtime" menu at the top of the screen.
2. From the dropdown menu, click on "Change runtime type".
3. In the popup window that appears, select "GPU" as the hardware accelerator.
4. Click on the "Save" button.

That's it! Now you can use the GPU for faster computations in your notebook.

In [None]:
!nvidia-smi

Mon Nov 27 19:07:24 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Check if GPU is available
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'device: {device}')

device: cuda


# Initializing the model

We will use the `BAAI/bge-base-en-v1.5` model for this example. For more information about this model, please refer to the [model card](https://huggingface.co/BAAI/bge-base-en-v1.5).

For loading the model, we will use the `SentenceTransformer` class from the `sentence-transformers` library. This class provides a simple interface for loading and using sentence embedding models. It also offers a wide range of pre-trained models, which can be used for various NLP tasks.

In [None]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# Send model to GPU, if available
model = model.to(device)

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.2k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

# Encoding a few sentences

Let's encode a few sentences using the model. We will use the `encode` method of the `SentenceTransformer` class to encode the sentences. This method takes a list of strings as input and returns a list of embeddings as output.

In [None]:
# Sentences we like to encode.
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

embeddings = model.encode(sentences)

# Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")


Sentence: This framework generates embeddings for each input sentence
Embedding: [ 9.53931920e-03 -4.07816917e-02  3.13987955e-02  3.53472568e-02
  5.94176836e-02  5.78968935e-02  1.58114545e-02 -5.86107932e-03
  1.54616702e-02  2.62350100e-03  9.07820743e-03  9.62177583e-04
 -5.71961924e-02 -8.88177101e-03 -1.36985872e-02  4.40546647e-02
 -3.89513816e-03  1.77075118e-02  2.30337493e-02  3.08191925e-02
 -1.17485188e-02  3.25982124e-02  5.03877327e-02  2.01352104e-03
  7.95669705e-02  3.68604832e-03  5.31924283e-03 -3.42049152e-02
 -4.85396497e-02  2.92565674e-02  8.49060743e-05 -1.32923191e-02
 -4.65372857e-03 -4.43957979e-03  1.35743534e-02 -4.08644341e-02
 -5.44175878e-02  1.78457573e-02  2.75585558e-02 -4.69652191e-02
 -2.27812678e-02 -7.81649444e-03  2.85488795e-02  1.90629289e-02
 -2.77508721e-02 -7.92344217e-04 -8.25617462e-02  1.64128374e-02
  3.62763624e-03 -6.19615763e-02 -8.42266008e-02  1.85632927e-03
  1.10487891e-02 -3.39604802e-02  5.08478750e-03  3.30551118e-02
  1.09231

# Encoding and searching over a tiny corpus

When the corpus is small (i.e., a few thosand documents), we can convert all the documents to embeddings and perform a **"brute-force"** search: compute the cosine similarity of each document embedding against the query embedding and return to the user the top k documents with the highest cosine similarity.

Before, as this model was fine-tuned for computing embeddings for different tasks, we need to define the `query_instruction`, which will be prepended in each query.

In [None]:
query_instruction = "Represent this sentence for searching relevant passages: "

In [None]:
# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    formated_query = query_instruction + query
    query_embedding = model.encode(formated_query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))





Query: A man is eating pasta.

Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.7022)
A man is eating a piece of bread. (Score: 0.6018)
A man is riding a horse. (Score: 0.4294)
A man is riding a white horse on an enclosed ground. (Score: 0.3766)
A woman is playing violin. (Score: 0.2995)




Query: Someone in a gorilla costume is playing a set of drums.

Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.6605)
A woman is playing violin. (Score: 0.4115)
A cheetah is running behind its prey. (Score: 0.3617)
A man is eating food. (Score: 0.3537)
A man is eating a piece of bread. (Score: 0.3518)




Query: A cheetah chases prey on across a field.

Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.7275)
A monkey is playing drums. (Score: 0.3575)
A man is riding a horse. (Score: 0.3455)
A man is riding a white horse on an enclosed ground. (Score: 0.3253)
A man is eating food. (Score: 0.3219)


# Indexing and searching over a large corpus

When the corpus is large (e.g., with millions of documents), brute-force search will be too slow. Thus, we need an Approximate Nearest Neighbor (ANN) algorithm to efficiently find the most similar document embeddings to a query embedding. Results, however, are approximate, i.e., the most similar document might not be retrieved.

In this notebook, we will use the HNSW algorithm inplemented in hnswlib library as our ANN algorithm.

We will work with a smaller version of the FiQA corpus with 1,000 documents.


In [None]:
from datasets import load_dataset

dataset = load_dataset("BeIR/fiqa", 'corpus')

dataset.values()

Downloading builder script:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating corpus split: 0 examples [00:00, ? examples/s]

dict_values([Dataset({
    features: ['_id', 'title', 'text'],
    num_rows: 57638
})])

## Converting the corpus to embeddings


We will convert the corpus to embeddings using the `encode` method of the `SentenceTransformer` class. This method takes a list of strings as input and returns a list of embeddings as output.

In [None]:
corpus_texts = [item['text'] for item in dataset['corpus']]
corpus_texts = corpus_texts[:1000]

corpus_embeddings = model.encode(corpus_texts, show_progress_bar=True, convert_to_numpy=True)

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

## Indexing

We will use the HNSW algorithm inplemented in hnswlib library as our ANN algorithm. We will use the `Index` class from the `hnswlib` library to create an index for our corpus. This class provides a simple interface for creating and using an index. It also offers a wide range of parameters, which can be used to configure the index for a specific task.

In [None]:
import hnswlib

index = hnswlib.Index(space='cosine', dim=corpus_embeddings.shape[-1])

index.init_index(max_elements=len(corpus_embeddings), ef_construction=400, M=64)

# We train the index to find a suitable cluster
index.add_items(corpus_embeddings, list(range(len(corpus_embeddings))))

# We can optionally save the index to disk so can we reuse it later.
index_path = "./hnswlib.index"
print("Saving index to:", index_path)
index.save_index(index_path)

Saving index to: ./hnswlib.index


## Searching

Now that we have indexed our corpus, we can use the `knn_query` method of the `Index` class to find the top k most similar document embeddings to a query embedding. This method takes a query embedding and the number of nearest neighbors to return as input and returns a list of document ids and distances as output.

In [None]:
query = "What is considered a business expense on a business trip?"

formated_query = formated_query + query

top_k_hits = 5

query_embedding = model.encode(formated_query)

# We use hnswlib knn_query method to find the top_k_hits
corpus_ids, distances = index.knn_query(query_embedding, k=top_k_hits)

# We extract corpus ids and scores
hits = [{'corpus_id': id, 'score': 1 - score} for id, score in zip(corpus_ids[0], distances[0])]
hits = sorted(hits, key=lambda x: x['score'], reverse=True)

print("Input query:", query)
for hit in hits[0:top_k_hits]:
    print("\t{:.3f}\t{}".format(hit['score'], corpus_texts[hit['corpus_id']]))

Input query: What is considered a business expense on a business trip?
	0.518	"As long as the losing business is not considered ""passive activity"" or ""hobby"", then yes. Passive Activity is an activity where you do not have to actively do anything to generate income. For example - royalties or rentals. Hobby is an activity that doesn't generate profit. Generally, if your business doesn't consistently generate profit (the IRS looks at 3 out of the last 5 years), it may be characterized as hobby. For hobby, loss deduction is limited by the hobby income and the 2% AGI threshold."
	0.508	Based on the definitions I found on Investopedia, it depends on whether or not it is going against an asset or a liability.  I am not sure what type of accounting you are performing, but I know in my personal day-to-day dealings credits are money coming into my account and debits are money going out of my account. Definition: Credit, Definition: Debit
	0.501	There is no universal answer here; it depends