# Improved Search Engine with Transformers and Rerankers

## Retrival and Re-ranking

In the basic semantic search engine we saw how to use SentenceTransformers to compute embeddings for queries, sentences, and paragraphs and how to use this for semantic search.

For complex search tasks such as for question-answering retrieval, the search can be significantly improved by using Retrieve & Re-Rank strategy.


## Retrieve & Re-Rank Pipeline

A pipeline for information retrieval / question answering retrieval that works well is the following:

![](https://i.imgur.com/yIXJRSo.png)


Given a search query, we first use a simple retrieval strategy that retrieves a list of e.g. 100 possible hits which are potentially relevant for the query. Hits simply refer to the most similar documents retrieved.

For the retrieval, we can use either lexical search, e.g. with ElasticSearch, or we can use dense retrieval with a bi-encoder. Simple Lexical searches can be based on TF-IDF, BM25 etc.

However, the retrieval system might retrieve documents that are not that relevant for the search query. Hence, in a second stage, we use a re-ranker model which is based on a cross-encoder transformer model that scores the relevancy (not similarity) of all candidates for the given search query.

The output will be a ranked list of hits we can present to the user.


## Retrieval: Bi-Encoder

For the retrieval of the candidate set, we use a bi-encoder for semantic search.
We can also try a hybrid approach which is a combination of the traditional lexical search with semantic search. Lexical search looks for literal matches of the query words in the document collection. It will not recognize synonyms, acronyms or spelling variations and relies solely on exact keyword search.

In contrast, semantic search (or dense retrieval) encodes the search query into vector space and retrieves the document embeddings that are close in vector space. Hence, retrieving the most contextually relevant or similar documents.


## Re-Ranker: Cross-Encoder

The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.

A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query.

![](https://i.imgur.com/PFgkrcI.png)

The advantage of Cross-Encoders is the higher performance, as they perform attention across the query and the document.

Scoring thousands or millions of (query, document)-pairs would be rather slow. Hence, we use the retriever to create a set of e.g. 100 possible candidates which are then re-ranked by the Cross-Encoder.

First, we use an efficient Bi-Encoder to retrieve e.g. the top-100 most similar sentences for a query. Then, you use a Cross-Encoder to re-rank these 100 hits by computing the score for every (query, hit) pair.





## Retrieve & Re-Rank Search Engine over Wikipedia

This examples demonstrates the Retrieve & Re-Rank strategy to search over Wikipedia.

You can input a query or a question. The script then uses semantic search to find relevant passages.

### Install Dependencies

In [2]:
!pip install -U sentence-transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

### Load Transformer Models, Wikipedia Data and Generate Embeddings

For initial retrieval, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve 20 potentially relevant passages that answer the input query.

Next, we use a more powerful CrossEncoder `(cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2'))` that scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance.

MS MARCO is a large scale information retrieval corpus that was created by Microsoft based on real user search queries using Bing search engine.

The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.

## Load Wikipedia Dataset

In [51]:
import os
import json
import gzip
import torch
from sentence_transformers import SentenceTransformer, CrossEncoder, util

# the dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = 'simplewiki.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        # Add all paragraphs
        # passages.extend(data['paragraphs'])

        # Only add the first paragraph
        passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

Passages: 169597


## Subset the Data

In [56]:
# We subset our data so we only use a subset of wikipedia to run things faster
passages = [passage for passage in passages for x in ['india', 'sports', 'politics',
                                                      'music', 'history', 'machine learning',
                                                      'artificial intelligence', 'movies',
                                                      'places', 'animals', 'books']
                                                    if x in passage.lower()]

## Look at sample documents

In [57]:
len(passages)

19867

In [58]:
passages[6]

'Warriors is a series of fantasy fiction books written by Erin Hunter. The series is about the adventures of wild cats as they try to survive in their forest homes. The series is made up of four mini-series with six books in each series. The first of these, called "Warriors", was released in 2003, starting with the book "Into the Wild". The authors were not planning to write another mini-series, but they eventually began the second mini-series. The second mini-series is titled "Warriors: The New Prophecy" and was published in 2005. The first book was called "Midnight". The first book of the third series "Warriors: The Power of Three", "The Sight", was released on April 24, 2007. The fourth series, "Warriors: Omen of the Stars", began with "The Fourth Apprentice"'

## Load Transformer Models

In [10]:
if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")

# We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
# The bi-encoder will retrieve 100 documents.
# We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Get Wikipedia Document Embeddings

In [59]:
# We encode all passages into our vector space. This may take about few seconds (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages)

In [60]:
corpus_embeddings[6], corpus_embeddings[6].shape

(array([ 1.64197758e-02, -2.07085442e-02, -2.17787493e-02,  6.74611852e-02,
        -4.11994644e-02,  2.11862829e-02, -9.29941386e-02, -8.40707645e-02,
         1.07269898e-01,  7.50587061e-02, -9.32021998e-03,  9.74128172e-02,
         2.04353407e-02, -1.85746606e-02,  4.49149497e-02,  2.88625658e-02,
        -5.18515706e-02, -3.71433906e-02,  8.14397335e-02,  1.85641348e-02,
         3.34720500e-02, -2.08245777e-02, -1.78623050e-02,  1.66738387e-02,
         5.78140654e-02, -6.89734146e-02, -2.29526758e-02,  1.23695806e-02,
        -9.31518599e-02,  6.53616190e-02, -8.31344426e-02, -5.31980880e-02,
        -6.74653128e-02, -5.40473685e-03, -4.73618619e-02, -1.12483732e-03,
         3.80187668e-02,  2.77996268e-02, -4.28113341e-02,  2.24229880e-02,
        -5.16939815e-03,  1.13870287e-02,  2.42556091e-02,  4.37106052e-03,
        -1.57549437e-02,  4.50212993e-02, -1.22535400e-01, -4.95266318e-02,
         1.01762906e-01, -3.13249789e-02, -3.23150754e-02, -3.33762094e-02,
        -3.5

In [61]:
corpus_embeddings.shape

(19867, 384)

In [62]:
query = "What is the capital of India?"
query

'What is the capital of India?'

In [63]:
# Get Embedding for query
query_embedding = bi_encoder.encode(query)
query_embedding.shape

(384,)

In [64]:
# get cosine similarity score of document emebddings against query embedding
cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
cos_scores

tensor([-0.0456,  0.1076,  0.0780,  ..., -0.0107, -0.0464,  0.1008])

In [65]:
# Get Most Similar Document ID
top_results = torch.topk(cos_scores, k=1)
idx = top_results.indices
idx

tensor([415])

In [66]:
# Get Most Similar Document
passages[idx]

"Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people."

Mumbai is not the capital of India. This time the bi-encoder returns the wrong document as the most similar one.

In [67]:
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=1)
hits[0]

[{'corpus_id': 415, 'score': 0.5979241728782654}]

In [68]:
hits[0][0]['corpus_id']

415

## Bi Encoder + ReRanker Cross Encoder Search

Re-ranker strategy is similar to what is used in RAG. It has been trained on pairs of documents and returns a score that corresponds to how similar the two documents are to each other. The cross encoder model has been fine-tuned on document pairs. Given a pair it outputs the similarity score.

In [69]:
# Get top k similar documents to the query
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=20)[0]
# Format data for the reranker -> [query, similar_doc] for each of the top_k similar documents
reranker_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
reranker_inp[:3] # look at the first 3 query inputs to the reranker cross encoder model

[['What is the capital of India?',
  "Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people."],
 ['What is the capital of India?',
  "Kolkata (spelled Calcutta before 1 January 2001) is the capital city of the Indian state of West Bengal. It is the second largest city in India after Mumbai. It is on the east bank of the River Hooghly. When it is called Calcutta, it includes the suburbs. This makes it the third largest city of India. This also makes it the world's 8th largest metropolitan area as defined by the United Nations. Kolkata served as the capita

In [70]:
# Get Reranker score for every similar document
reranker_scores = cross_encoder.predict(reranker_inp)
reranker_scores[:3] # look at relevance scores from reranker cross encoder

array([3.861031 , 3.4595218, 2.7084024], dtype=float32)

In [71]:
hits[:3]

[{'corpus_id': 415, 'score': 0.5979241728782654},
 {'corpus_id': 2761, 'score': 0.5937108993530273},
 {'corpus_id': 15636, 'score': 0.5878057479858398}]

In [72]:
# Add Reranker score back to the hits dictionary
for id, hit in enumerate(hits):
    hit['reranker_score'] = reranker_scores[id]
hits[:3]

[{'corpus_id': 415, 'score': 0.5979241728782654, 'reranker_score': 3.861031},
 {'corpus_id': 2761, 'score': 0.5937108993530273, 'reranker_score': 3.4595218},
 {'corpus_id': 15636,
  'score': 0.5878057479858398,
  'reranker_score': 2.7084024}]

In [73]:
# Show the top similar document to query based on both models
print("Top Bi-Encoder Retrieval hit: ")
hit = sorted(hits, key=lambda x: x['score'], reverse=True)[0]
print(passages[hit['corpus_id']])

print("Top Reranker Retrieval hit: ")
hit = sorted(hits, key=lambda x: x['reranker_score'], reverse=True)[0]
print(passages[hit['corpus_id']])

Top Bi-Encoder Retrieval hit: 
Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people.
Top Reranker Retrieval hit: 
New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7 km. New Delhi has a population of about 9.4 Million people.


Here the cross encoder model reranks the outputs from the bi encoder and returns the most relevant document (correct answer)

In [74]:
# function to return the top similar document based on any query
def improved_search(query, top_k=20):
    # print the input question
    print("Input question:", query)

    ##### Bi-Encoder: Sematic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    query_embedding = bi_encoder.encode(query)
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)[0]

    ##### Cross-Encoder: Re-Ranking #####
    # Now, score all retrieved passages with the reranker cross encoder
    reranker_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    reranker_scores = cross_encoder.predict(reranker_inp)

    # Store reranker cross encoder scores back into the hits variable
    for id, hit in enumerate(hits):
        hit['reranker_score'] = reranker_scores[id]

    # Output of top-3 hit from bi-encoder
    print("Top-3 Bi-Encoder Retrieval hit: ")
    hit = sorted(hits, key=lambda x: x['score'], reverse=True)[:3]
    print(*[passages[h['corpus_id']] for h in hit], sep="\n")
    print("-------------------------------------------------------")
    # Output of top-3 hit from re-ranker
    print("Top-3 Reranker Retrieval hit: ")
    hit = sorted(hits, key=lambda x: x['reranker_score'], reverse=True)[:3]
    print(*[passages[h['corpus_id']] for h in hit], sep="\n")

In [75]:
improved_search(query="What is the capital of India?")

Input question: What is the capital of India?
Top-3 Bi-Encoder Retrieval hit: 
Mumbai (previously known as Bombay until 1996) is a natural harbor on the west coast of India, and is the capital city of Maharashtra state. It is India's largest city, and one of the world's most populous cities. It is the financial capital of India. The city is the second most-populous in the world. It has approximately 13 million people. Along with the neighboring cities of Navi Mumbai and Thane, it forms the world's 4th largest urban agglomeration. They have around 19.1 million people.
Kolkata (spelled Calcutta before 1 January 2001) is the capital city of the Indian state of West Bengal. It is the second largest city in India after Mumbai. It is on the east bank of the River Hooghly. When it is called Calcutta, it includes the suburbs. This makes it the third largest city of India. This also makes it the world's 8th largest metropolitan area as defined by the United Nations. Kolkata served as the capita

In [83]:
improved_search(query = "Which place has the most population on Earth?")

Input question: Which place has the most population on Earth?
Top-3 Bi-Encoder Retrieval hit: 
Places in the United States of America:
Asia is the largest continent on Earth. It is in the northern hemisphere. Asia is connected to Europe in the west (creating a supercontinent called Eurasia). Some of the oldest human civilizations began in Asia, such as Sumer, China, and India. Asia was also home to some large empires such as the Persian Empire, the Mughal Empire, the Mongol Empire, and the Ming Empire. It is home to at least 44 countries. Turkey, Russia, Georgia and Cyprus are partly in other continents.
Places in the United States:
-------------------------------------------------------
Top-3 Reranker Retrieval hit: 
The Northern Hemisphere is the part of the planet that is north of the equator. It has about 90 percent of world's population and most of the world's land. All of North America and Europe are in the Northern Hemisphere. Most of Asia, two-thirds of Africa and 10 percent of

In [91]:
improved_search(query = "What is AI?")

Input question: What is AI?
Top-3 Bi-Encoder Retrieval hit: 
Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955.
A.I. Artificial Intelligence, or A.I., is a 2001 American science fiction drama movie directed by Steven Spielberg. The screenplay was by Spielberg based on the 1969 short story "Supertoys Last All Summer Long" by Brian Aldiss.
Air India (AI/AIC) () "(officially known as Air India Air Transport Services Limited)" is the national airline company of India. Air India is part of the "National Aviation Company of India Limited"
-------------------------------------------------------
Top-3 Reranker Retrieval hit: 
Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study w

In [89]:
improved_search(query = "How old is the olympics?")

Input question: How old is the olympics?
Top-3 Bi-Encoder Retrieval hit: 
United States at the Olympics is a history which starts in the 1890s.
The 2010 Summer Youth Olympic Games, officially known as the I Olympic Youth Summer Games, is an international summer sports event that was celebrated from August 14 to August 26, 2010 for youths. It was the first Youth Olympic Games(YOG) and the host city was Singapore.
The Olympic Games () is an important international event featuring summer and winter sports. Summer Olympic Games and Winter Olympic Games are held every four years. Originally, the ancient Olympic Games were held in Ancient Greece at Olympia. The first games were in 776 BC. They were held every four years until the 6th century AD. The first "modern" Olympics happened in 1896 in Athens, Greece. Athletes participate in the Olympics Games to represent their country.
-------------------------------------------------------
Top-3 Reranker Retrieval hit: 
The Olympic Games () is an i

In [85]:
improved_search(query = "What is the most dangerous animal?")

Input question: What is the most dangerous animal?
Top-3 Bi-Encoder Retrieval hit: 
For most animals, defence against predators is vital. Being eaten is not the only threat to life: parasites and diseases may also be fatal. But animals, especially small animals, are often eaten.
The term is most used for the Pleistocene megafauna the large land animals of the last ice age, such as mammoths. It is also used for the largest living wild land animals, especially elephants, giraffes, hippopotamus, rhinoceros, elk, condors, etc.
The Zebra Turkeyfish ("Dendrochirus zebra") is a very venomous fish. It lives in the Indian and Pacific seas. The fish has 13 venomous spines along its back, used to look after itself. The fish is slow and quiet, but can be a danger. The fish rests in dark places such as under a rock or a piece of coral. They aren't affected by each other's venom. They are solitary fish that are not scared of anything, as they have no predators other than groupers.
------------------