# To [semantic] search, or not to [semantic] search, that is the ~question~ query

## TLDR summary: you should (probably) fine-tune your document embeddings.

The table below shows the results for 1000 queries made to a dataset of 15K text passages, where for each query there is only one passage containing the corresponding answer. Here are some quick takeaways (discussed in more detail later in the notebook):

1. Unsurprisingly, your semantic search is only as good as your embeddings.
2. What makes a *good* embedding? Roughly speaking, there are two components to this: the size of the embedding model (meaning mainly the number of its parameters, but under the assumption that the training set was sufficiently large), and whether it has been trained/fine-tuned for what you are trying to do - both data and task wise.
3. **Having fine-tuned embeddings brings a larger improvement than what you get from simply using a larger embedding model.**
4. **Fine-tuned embeddings outperform lexical search**, e.g. BM25, a well known
TF-IDF (Term Frequency - Inverse Document Frequency) ranking algorithm, by a large margin.
5. **Lexical search can outperform smaller non-fine-tuned embeddings, and is comparable to larger ones - as long as they are not fine-tuned.**
6. Although only one of the embedding models (*gtr-t5-large*) was specifically fine-tuned on the exact task I am considering (semantic search on the MS MARCO dataset), the data that I used for these experiments more or less falls within the same domain as the data that the other models were trained on. It seems reasonable to propose that **the difference in the performance of the fine-tuned vs. non fine-tuned models would be amplified for out-of-domain data** (e.g. specialized legal, medical etc documents).


|   | Num of param| Fine-tuned | Recall at k=1 | Recall at k=5 |
|  --- | --- |  --- | --- | --- |
| Semantic **paraphrase-MiniLM-L6-v2** | 22.7M | no | 76% | 92% |
| Hybrid  **paraphrase-MiniLM-L6-v2** | 22.7M | no |77% | 93% |
| Semantic **sentence-t5-large** | 335M | no |83% | 96% |
| Hybrid **sentence-t5-large**  | 335M | no |86% | 96% |
| Semantic **gtr-t5-large** | 335M | yes | 93% | 99% |
| Hybrid **gtr-t5-large** | 335M | yes| 90% | 99% |
| BM25| --- | --- | 78% | 90% |


## Introduction

### The Data

MS MARCO is ["a collection of datasets focused on deep learning in search"](https://microsoft.github.io/msmarco/), released by Microsoft. One of the datasets, aimed at the question answering task, contains, among other fields, queries (e.g. "*what is a corporation?*") and passages containing the corresponding answer (e.g. "*McDonald's Corporation is one of the most recognizable corporations in the world. A corporation is a company or group of people authorized to act as a single entity (legally a person) and recognized as such in law. Early incorporated entities were established by charter (i.e. by an ad hoc act granted by a monarch or passed by a parliament or legislature)*"). The code that I wrote to extract the query-passage pairs from the original MS MARCO dataset has been made available [here](https://github.com/opetrova/rag-experiments/blob/main/MS_MARCO_dataset_prep_RAG.ipynb).

Note that I sourced the data from the MS MARCO *dev* set rather than its *train* set, because one of the embedding models I'll be using has been fine-tuned on MS MARCO. So we better avoid what would have otherwise been testing the model on its training data!

### The Metrics

In this experiment I will be attempting to retrieve the passage corresponding to a given query, given the latter's text. The two performance metrics I'll be keeping track of are:

* how many of the correct passages show up as the top returned results?

* how many of the top 5 returned results contain the correct passages?

In the context of RAG ([Retrieval Augmented Generation](https://en.wikipedia.org/wiki/Prompt_engineering#Retrieval-augmented_generation)), such recall-based metrics make more sense than the ranking ones commongly used in the broader Information Retrieval field.

### The Models

The three embedding models I'll be comparing are both [Sentence Transformers, freely available through Hugging Face](https://huggingface.co/sentence-transformers):

* [**paraphrase-MiniLM-L6-v2**](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2): a relatively lightweight 22.7M parameter model resulting in 384 dimensional embeddings. (Paper: [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084))

* [**gtr-t5-large**](https://huggingface.co/sentence-transformers/gtr-t5-large): 335M parameters, 768 dimensional embeddings, fine-tuned on MS MARCO for semantic search. (Paper: [Large Dual Encoders Are Generalizable Retrievers](https://arxiv.org/abs/2112.07899))

* [**sentence-t5-large**](https://huggingface.co/sentence-transformers/sentence-t5-large): also 335M parameters and 768 dimensional embeddings, but has **not** been fine-tuned for semantic search, or exposed to MS MARCO during training. (Paper: [Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models](https://arxiv.org/abs/2108.08877))

## Results

When it comes to query-passage retrieval, fine-tuned embeddings outperform both non fine-tuned embeddings and lexical search. In principle, there are two aspects to this fine-tuning: the data and the task itself (e.g. sentence similarity vs. semantic search). In my experiments it has been difficult to separate the two because when it comes to fine-tuning embedding models for semantic search, the MS MARCO dataset is commonly used. It would make sense for the ideal scenario to be fine-tuning the embedding model both on the domain one plans to use it in, **and** the task at hand (e.g. semantic search based on a query).

One experiment that is currently lacking from my analysis is fine-tuning a smaller model, e.g. *paraphrase-MiniLM-L6-v2* on the MS MARCO semantic search task, and comparing its performance to the other methods. My guess is that it will outperform *sentence-t5-large*, but probably not *gtr-t5-large*. To be explored in a future notebook!

A note on the hybrid search (the combination of lexical and semantic searches): I've seen a suggestion floating around, that adding lexical search into the mix could effectively serve as a replacement for fine-tuning the embeddings. Intuitively I see why this could be the case for out-of-domain data, for instance, but in my experiments hybrid search did not do much. Another potential avenue to explore.

## Code to reproduce the experiments

### 0. Setup

First I am going to take my [preprocessed](https://github.com/opetrova/rag-experiments/blob/main/MS_MARCO_dataset_prep_RAG.ipynb) query-passage dataset, check for any invalid entries, embed the queries and the passages using an open source [SentenceTransformers](https://huggingface.co/sentence-transformers) model, and upload both dense embedding and sparse BM25 vectors into a Pinecone vector store.

In [1]:
!pip install -U --quiet sentence-transformers pinecone-client pinecone-text

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.9/215.9 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.6/67.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m76.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m1.

In [2]:
from google.colab import drive

import json

import numpy as np

import pandas as pd

from pinecone import Pinecone, PodSpec
from pinecone_text.sparse import BM25Encoder

from sentence_transformers import SentenceTransformer

import torch

from tqdm.auto import tqdm

In [3]:
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
df = pd.read_csv('/content/gdrive/MyDrive/test_RAG/MS_MARCO_retrieval.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,query,document
0,0,. what is a corporation?,McDonald's Corporation is one of the most reco...
1,1,why did rachel carson write an obligation to e...,The Obligation to Endure by Rachel Carson Rach...
2,2,symptoms of a dying mouse,The symptoms are similar but the mouse will be...
3,3,average number of lightning strikes per day,Although many lightning flashes are simply clo...
4,4,can you burn your lawn with fertilizer,Fertilizer burn is the result of over fertiliz...


In [None]:
df.isnull().any()

Unnamed: 0    False
query         False
document       True
dtype: bool

In [5]:
df2=df.dropna()
df2.isnull().any()

Unnamed: 0    False
query         False
document      False
dtype: bool

In [6]:
# Number of query-passage pairs in the dev set (15K of these were used for the experiments to save time)
num_samples = len(df2)
print(num_samples)

55578


In [6]:
queries = df2['query'].tolist()
documents = df2['document'].tolist()

bm25 = BM25Encoder.default()
# (Pinecone's default BM25Encoder was fitted to MS MARCO)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

encoder = SentenceTransformer('gtr-t5-large', device=device)
encoder

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()
)

In [7]:
pc = Pinecone(api_key="...")

index_name = "ms-marco-transfomers"

# only create index if it doesn't exist
if index_name not in pc.list_indexes().names():
  pc.create_index(
        name=index_name,
        dimension=encoder.get_sentence_embedding_dimension(),
        metric='dotproduct', # Make sure to set the metric to dotproduct if you intend to run hybrid search
        spec=PodSpec(environment="gcp-starter")
    )

# now connect to the index
vector_store = pc.Index(index_name)

In [None]:
# in case you decide to start over:
pc.delete_index(index_name)

In [None]:
batch_size = 128

num_samples = 15000

for i in tqdm(range(0, num_samples, batch_size)):
    # find end of batch
    i_end = min(i+batch_size, num_samples)
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]

    # create embeddings
    dense_vectors = encoder.encode(documents[i:i_end])
    sparse_vectors = bm25.encode_documents(documents[i:i_end])

    vectors = []
    for _id, sparse, dense in zip(
        ids, sparse_vectors, dense_vectors
   ):

        vectors.append({
            'id': _id,
            'sparse_values': sparse,
            'values': dense
        })

    vector_store.upsert(vectors)

# check number of records in the index
vector_store.describe_index_stats()

### 1. Semantic search


In [9]:
batch_size = 128

# pre-embed the first 1000 queries from the dataset above:
num_samples = 1000

dense_queries = np.zeros((num_samples, encoder.get_sentence_embedding_dimension()))

for i in tqdm(range(0, num_samples, batch_size)):
    # find end of batch
    i_end = min(i+batch_size, num_samples)

    # create embeddings
    dense_queries[i:i_end] = encoder.encode(queries[i:i_end])

  0%|          | 0/8 [00:00<?, ?it/s]

In [16]:
num_correct_1 = 0
num_correct_5 = 0

for ind in range(num_samples):
  result = vector_store.query(top_k=5, vector=dense_queries[ind].tolist())
  if int(result['matches'][0]['id']) == ind:
    num_correct_1 += 1
  if ind in [int(result['matches'][i]['id']) for i in range(5)]:
    num_correct_5 += 1

print(f"Percentage of top results being correct: {(num_correct_1/num_samples)*100}%")
print(f"Recall at k=5: {(num_correct_5/num_samples)*100}%")

Percentage of top results being correct: 93.5%
Recall at k=5: 99.5%


### 2. Lexical search

As of right now, in order to run pure [lexical search in Pinecone](https://www.pinecone.io/learn/hybrid-search-intro/), you should define a function for the hybrid search and set the scaling factor to 0.

In [18]:
def hybrid_scale(dense, sparse, alpha: float):
    # check alpha value is in range
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    # scale sparse and dense vectors to create hybrid search vecs
    hsparse = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    hdense = [v * alpha for v in dense]
    return hdense, hsparse

num_correct_1 = 0
num_correct_5 = 0

for ind in range(num_samples):

  sparse = bm25.encode_queries(queries[ind])
  dense_vec, sparse_vec = hybrid_scale(dense_queries[ind], sparse, alpha=0)

  result = vector_store.query(top_k=5, vector=dense_vec, sparse_vector=sparse_vec)

  if int(result['matches'][0]['id']) == ind:
    num_correct_1 += 1
  if ind in [int(result['matches'][i]['id']) for i in range(5)]:
    num_correct_5 += 1

print(f"Percentage of top results being correct: {(num_correct_1/num_samples)*100}%")
print(f"Recall at k=5: {(num_correct_5/num_samples)*100}%")

Percentage of top results being correct: 77.7%
Recall at k=5: 90.4%


### 3. Hybrid search:

In [19]:
num_correct_1 = 0
num_correct_5 = 0

for ind in range(num_samples):

  sparse = bm25.encode_queries(queries[ind])

  result = vector_store.query(top_k=5, vector=dense_queries[ind].tolist(), sparse_vector=sparse)

  if int(result['matches'][0]['id']) == ind:
    num_correct_1 += 1
  if ind in [int(result['matches'][i]['id']) for i in range(5)]:
    num_correct_5 += 1

print(f"Percentage of top results being correct: {(num_correct_1/num_samples)*100}%")
print(f"Recall at k=5: {(num_correct_5/num_samples)*100}%")

Percentage of top results being correct: 90.2%
Recall at k=5: 99.2%
