# Rerankers

Rerankers are models used to improve the relevance of search or retrieval results. After an initial retrieval (often using a vector database or keyword search), rerankers re-evaluate the top candidates to sort them more accurately based on the original query.

They typically use **deep learning models**, especially **cross-encoders**, which consider the query and each candidate together for finer semantic matching.

### Rerankers vs Retrievers

* **Retrievers** (e.g. in VectorDBs) quickly return a list of candidates based on similarity in embedding space.
* **Rerankers** take the top-N retrieved results and **rescore them** based on deeper semantic alignment with the query.
* While retrievers are fast and scalable, rerankers are **more accurate** but **computationally heavier**.

### Common Reranking Models

* **Cross-Encoders**: Models like `cross-encoder/ms-marco-MiniLM-L-6-v2` that take *(query, document)* pairs and return a relevance score.

  * They consider full interaction between query and document.
  * Typically used for reordering top-K results (e.g., top 100 to top 5).
* **MonoT5 / MonoBERT**: Encoder-decoder models that classify each result as relevant or not, or score them.
* **LLMs (e.g., GPT)**: Can act as rerankers via prompt-based ranking or chain-of-thought relevance scoring.

### Tradeoffs

| Feature       | Retriever (e.g. VectorDB) | Reranker                  |
| ------------- | ------------------------- | ------------------------- |
| Speed         | Fast (sub-second)         | Slower (per item scoring) |
| Accuracy      | Moderate                  | High                      |
| Scalability   | High (millions of docs)   | Low (top-N only)          |
| Contextuality | Limited                   | Deep semantic matching    |

### Use in RAG Pipelines

Rerankers are widely used in **RAG (Retrieval-Augmented Generation)** systems to ensure only the **most relevant passages** are sent to the generator (e.g., GPT). This improves generation quality and factual accuracy.


In [1]:
import pdfplumber
import pandas as pd
import numpy as np

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

#FAISS, PINECONE
import chromadb
from chromadb import PersistentClient
from chromadb.config import Settings

In [2]:
pdf_reader = pdfplumber.open("../Data/Uber-2024-Annual-Report.pdf")
len(pdf_reader.pages)

142

#### Chunking Strategies
- Fixed Size chunking - Fixed length
- Sentence based chunking - EOS
- New Line based chunking - \n
- Paragraph based Chunking - \n\n
- Page based Chunking
- Token based chunking - Fixed length of tokens rather than words
- Sliding window chunking - Overlaps some content from previous chunk
- Hierarhical Chunking - Breaks down documents at multiple levels, such as sections, subsections, and paragraphs
- Content-Aware Chunking - Chunking text at paragraph level and tables as seperate entities
- Table aware Chunking
- Keyword based Chunking - Introduction, Conclusion, Summary these are chunked
- Hybrid Chunking - Using different Chunking strategies based on data

In [3]:
text_content = []
document_name = "".join(pdf_reader.stream.name.split("/")[-1].split(".")[:-1])

for i, page in enumerate(pdf_reader.pages):
    text_page = page.extract_text()

    split_text = text_page.split("\n")

    for text in split_text:
        if len(text.split(" ")) > 10:
            text_content.append({
                "type" : "text",
                "document": document_name,
                "page": f"{i+1}",
                "content": text
            })

text_content[0]

{'type': 'text',
 'document': 'Uber-2024-Annual-Report',
 'page': '2',
 'content': 'We are Uber. The go-getters. The kind of people who are relentless about our'}

In [4]:
len(text_content)

4476

In [5]:
text_content = []

def find_middle_newline(s):
    # Step 1: Find all indexes of '\n'
    newline_indices = [i for i, char in enumerate(s) if char == '\n']
    
    if not newline_indices:
        return None  # No newline found
    
    # Step 2: Find the middle index
    middle_index = len(newline_indices) // 2
    
    # Step 3: Return the position of the middle '\n'
    return newline_indices[middle_index]


document_name = "".join(pdf_reader.stream.name.split("/")[-1].split(".")[:-1])


for i, page in enumerate(pdf_reader.pages):
    text_page = page.extract_text()

    if len(text_page.split(" ")) < 10:
        print(f"Page number: {i+1}, count: {len(text_page.split(" "))}")
        continue

    if len(text_page) > 5000:
        mid_index = find_middle_newline(text_page)
        text_content.append({
            "type" : "text",
            "document": document_name,
            "page": f"{i+1}",
            "split":f"0",
            "content": text_page[:mid_index]
        })

        text_content.append({
            "type" : "text",
            "document": document_name,
            "page": f"{i+1}",
            "split":f"1",
            "content": text_page[mid_index+1:]
        })
    else:
        text_content.append({
                    "type" : "text",
                    "document": document_name,
                    "page": f"{i+1}",
                    "split":f"0",
                    "content": text_page
                })

text_content[0]

Page number: 1, count: 5
Page number: 139, count: 2
Page number: 140, count: 5


{'type': 'text',
 'document': 'Uber-2024-Annual-Report',
 'page': '2',
 'split': '0',
 'content': 'Uber’s Mission\nWe reimagine the way the world moves for the better\nWe are Uber. The go-getters. The kind of people who are relentless about our\nmission to help people go anywhere and get anything and earn their way.\nMovement is what we power. It’s our lifeblood. It runs through our veins. It’s\nwhat gets us out of bed each morning. It pushes us to constantly reimagine\nhow we can move better. For you. For all the places you want to go. For all the\nthings you want to get. For all the ways you want to earn. Across the entire\nworld. In real time. At the incredible speed of now.'}

In [6]:
len(text_content)

205

In [7]:
text_doc = pd.DataFrame(text_content)
text_doc.head()

Unnamed: 0,type,document,page,split,content
0,text,Uber-2024-Annual-Report,2,0,Uber’s Mission\nWe reimagine the way the world...
1,text,Uber-2024-Annual-Report,3,0,UNITED STATES\nSECURITIES AND EXCHANGE COMMISS...
2,text,Uber-2024-Annual-Report,4,0,Large accelerated filer ☒ Accelerated filer ☐\...
3,text,Uber-2024-Annual-Report,5,0,"UBER TECHNOLOGIES, INC.\nTABLE OF CONTENTS\nPa..."
4,text,Uber-2024-Annual-Report,6,0,SPECIAL NOTE REGARDING FORWARD-LOOKING STATEME...


In [8]:
text_doc["MetaData"] = text_doc.apply(lambda x: {"Document": x["document"], "Page": x["page"], "Split": x["split"], "Type": x["type"]}, axis=1)
text_doc = text_doc.drop(["type", "document", "page"], axis=1)
text_doc.head()


Unnamed: 0,split,content,MetaData
0,0,Uber’s Mission\nWe reimagine the way the world...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
1,0,UNITED STATES\nSECURITIES AND EXCHANGE COMMISS...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
2,0,Large accelerated filer ☒ Accelerated filer ☐\...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."
3,0,"UBER TECHNOLOGIES, INC.\nTABLE OF CONTENTS\nPa...","{'Document': 'Uber-2024-Annual-Report', 'Page'..."
4,0,SPECIAL NOTE REGARDING FORWARD-LOOKING STATEME...,"{'Document': 'Uber-2024-Annual-Report', 'Page'..."


In [9]:
model_name = "all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)
only_text = text_doc["content"].tolist()

embeddings = embedding_model.encode(only_text)
ids = text_doc["MetaData"].apply(lambda x: f"{x['Document']}_p{x['Page']}_s{x['Split']}") 

In [10]:
Chroma_DB_Path = "../Store/2_VectorDB"
COLLECTION_NAME = "uber_revenue"

# chroma_client = chromadb.Client(Settings(
#     persist_directory=Chroma_DB_Path,
#     anonymized_telemetry=False
# ))

chroma_client = PersistentClient(path=Chroma_DB_Path)

# collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)

try:
    collection = chroma_client.get_collection(name=COLLECTION_NAME)
    print(f"Collection '{COLLECTION_NAME}' exists.")
    # You can now work with the 'collection' object
except Exception as e:
    print(f"Collection '{COLLECTION_NAME}' does not exist. {str(e)}")
    # You might choose to create the collection here
    collection = chroma_client.create_collection(name=COLLECTION_NAME)
    # print(f"Collection '{collection_name}' created.")
       
    collection.add(
        documents=text_doc['content'].tolist(),
        metadatas=text_doc['MetaData'].tolist(),
        ids=ids.tolist()
    )
    print("Successfully stored")

Collection 'uber_revenue' exists.


In [11]:
caching = []
cache_emd = []

In [12]:
def get_chroma_results(query):
    query_emd = embedding_model.encode([query])
    
    if len(cache_emd) > 0:
        cache_emd_array = np.vstack(cache_emd) 
        similarities = cosine_similarity(query_emd, cache_emd_array)
        best_match_indexes = [np.argmax(item) for item in similarities]

        if len(best_match_indexes) > 0 and similarities[0][best_match_indexes[0]] > 0.8:
            print(f"Returning from query: {caching[best_match_indexes[0]]["query"]} cache with score: {similarities[0][best_match_indexes[0]]:.4f}")
            return (similarities[0][best_match_indexes[0]], caching[best_match_indexes[0]]["query"], caching[best_match_indexes[0]]["results"])
    

    results = collection.query(
        query_texts=[query],
        n_results=15
    )

    caching.append({"query": query, "results": results}) 
    cache_emd.append(query_emd)
    
    return (0, caching[-1]["query"], results)

In [None]:
query = "what is uber\'s revenue"
sim_score, ret_query, result = get_chroma_results(query=query)

In [14]:
result['documents'][0]

['UBER TECHNOLOGIES, INC.\nCONSOLIDATED STATEMENTS OF OPERATIONS\n(In millions, except share amounts which are reflected in thousands, and per share amounts)\nYear Ended December 31,\n2022 2023 2024\nRevenue $ 31,877 $ 37,281 $ 43,978\nCosts and expenses\nCost of revenue, exclusive of depreciation and amortization shown separately below 19,659 22,457 26,651\nOperations and support 2,413 2,689 2,732\nSales and marketing 4,756 4,356 4,337\nResearch and development 2,798 3,164 3,109\nGeneral and administrative 3,136 2,682 3,639\nDepreciation and amortization 947 823 711\nTotal costs and expenses 33,709 36,171 41,179\nIncome (loss) from operations (1,832) 1,110 2,799\nInterest expense (565) (633) (523)\nOther income (expense), net (7,029) 1,844 1,849\nIncome (loss) before income taxes and income (loss) from equity method investments (9,426) 2,321 4,125\nProvision for (benefit from) income taxes (181) 213 (5,758)\nIncome (loss) from equity method investments 107 48 (38)\nNet income (loss) i

## 🔍 Reranker Models — Examples & Use Cases

| Model Name                                          | Type                         | Size      | Speed       | Accuracy       | When to Use                                                                                                      |
| --------------------------------------------------- | ---------------------------- | --------- | ----------- | -------------- | ---------------------------------------------------------------------------------------------------------------- |
| `cross-encoder/ms-marco-MiniLM-L-6-v2`              | Cross-Encoder                | ~65M      | ✅ Fast      | ⚠️ Medium-High | Best all-rounder for small to medium-scale reranking (Top 100 to Top 5–10).                                      |
| `cross-encoder/ms-marco-TinyBERT-L-2-v2`            | Cross-Encoder                | ~14M      | ⚡ Very Fast | ⚠️ Medium      | Ultra low-latency reranking with decent performance. Use on edge devices or low-resource setups.                 |
| `cross-encoder/ms-marco-MiniLM-L-12-v2`             | Cross-Encoder                | ~110M     | ❌ Slower    | ✅ High         | Better for reranking when latency is less of a concern (e.g., batch jobs).                                       |
| `cross-encoder/bert-base`                           | Cross-Encoder                | ~110M     | ❌ Slow      | ✅ High         | Baseline model with robust performance. Good for experimentation.                                                |
| `cross-encoder/roberta-base`                        | Cross-Encoder                | ~125M     | ❌ Slow      | ✅ High         | More accurate on some tasks than BERT. Use when accuracy is a priority.                                          |
| `rerank-multilingual-MiniLM` (from Cohere or SBERT) | Cross-Encoder (Multilingual) | ~65M      | ✅ Fast      | ⚠️ Medium-High | Use for multilingual documents. Works well across languages.                                                     |
| `monoT5-small`                                      | Seq2Seq                      | ~60M      | ⚠️ Medium   | ✅ High         | Lightweight T5 for reranking. Use for quality-first pipelines with moderate resources.                           |
| `monoT5-base` / `monoT5-large`                      | Seq2Seq                      | 220M–770M | ❌ Slow      | 🚀 Very High   | Use for high-quality document reranking where latency isn't critical. Ideal in offline or batch mode.            |
| OpenAI GPT (via API or prompt)                      | LLM Reranker                 | billions  | ⚠️ Variable | 🚀 Very High   | Use for few-shot or CoT-style reranking when transparency and reasoning are needed. Cost and latency are higher. |

---

## 📌 When to Use Which Reranker?

### ✅ If you need **speed** (real-time apps, chatbots):

* Use: `MiniLM-L6-v2`, `TinyBERT-L2-v2`
* Why: They balance speed and relevance well for interactive systems.

### ✅ If you need **accuracy** (offline batch reranking, fact-sensitive RAG):

* Use: `MiniLM-L12-v2`, `monoT5-base`, or `GPT-based reranking`
* Why: These give better semantic alignment, especially for nuanced queries.

### ✅ If you support **multiple languages**:

* Use: `rerank-multilingual-MiniLM`
* Why: Trained on multilingual data; good for cross-language search.

### ✅ If you're constrained on **resources**:

* Use: `TinyBERT`, `MiniLM-L6`
* Why: Small and efficient, suitable for edge devices or large-scale deployments.

### ✅ If you want **explainable reranking** (e.g., CoT or score reasoning):

* Use: GPT or Claude, via prompt engineering
* Why: LLMs can rank based on reasoning steps (e.g., “this passage better answers the question because…”).

In [15]:
rerank_cache = {}

In [18]:
from sentence_transformers import CrossEncoder

def rerank_results(query):
    if sim_score > 0.9 and rerank_cache.get(ret_query):
        print(f"Reranker returning from cache for query {ret_query}")
        return rerank_cache.get(ret_query)

    model_name = 'cross-encoder/ms-marco-MiniLM-L-12-v2'

    reranker_model = CrossEncoder(model_name)

    pairs = [(query, doc) for doc in result['documents'][0]]

    scores = reranker_model.predict(pairs)

    scored_results = list(zip(result['documents'][0], scores))

    new_docs = [(doc, result['metadatas'][0][i], score) for i, (doc, score) in enumerate(scored_results)]

    # Sort by score (descending)
    sorted_results = sorted(new_docs, key=lambda x: x[2], reverse=True)

    # Select top 5
    top_5 = sorted_results[:5]

    rerank_cache[query] = top_5

    return top_5

In [17]:
# Output the results
top_5 = rerank_results(query=query)
for i, (doc, meta, score) in enumerate(top_5):
    print("--"*50)
    print(f"{i+1}. Score: {score:.4f} - \nDocument: {doc[:1000]}...\nPage: {meta["Page"]}")

----------------------------------------------------------------------------------------------------
1. Score: 6.1961 - 
Document: UBER TECHNOLOGIES, INC.
CONSOLIDATED STATEMENTS OF OPERATIONS
(In millions, except share amounts which are reflected in thousands, and per share amounts)
Year Ended December 31,
2022 2023 2024
Revenue $ 31,877 $ 37,281 $ 43,978
Costs and expenses
Cost of revenue, exclusive of depreciation and amortization shown separately below 19,659 22,457 26,651
Operations and support 2,413 2,689 2,732
Sales and marketing 4,756 4,356 4,337
Research and development 2,798 3,164 3,109
General and administrative 3,136 2,682 3,639
Depreciation and amortization 947 823 711
Total costs and expenses 33,709 36,171 41,179
Income (loss) from operations (1,832) 1,110 2,799
Interest expense (565) (633) (523)
Other income (expense), net (7,029) 1,844 1,849
Income (loss) before income taxes and income (loss) from equity method investments (9,426) 2,321 4,125
Provision for (benefit from

In [19]:
query = "what is the revenue of uber"
sim_score, ret_query, result = get_chroma_results(query=query)

# Output the results
top_5 = rerank_results(query=query)
for i, (doc, meta, score) in enumerate(top_5):
    print("--"*50)
    print(f"{i+1}. Score: {score:.4f} - \nDocument: {doc[:1000]}...\nPage: {meta["Page"]}")

Returning from query: what is ubers revenue cache with score: 0.9415
Reranker returning from cache for query what is ubers revenue
----------------------------------------------------------------------------------------------------
1. Score: 6.1961 - 
Document: UBER TECHNOLOGIES, INC.
CONSOLIDATED STATEMENTS OF OPERATIONS
(In millions, except share amounts which are reflected in thousands, and per share amounts)
Year Ended December 31,
2022 2023 2024
Revenue $ 31,877 $ 37,281 $ 43,978
Costs and expenses
Cost of revenue, exclusive of depreciation and amortization shown separately below 19,659 22,457 26,651
Operations and support 2,413 2,689 2,732
Sales and marketing 4,756 4,356 4,337
Research and development 2,798 3,164 3,109
General and administrative 3,136 2,682 3,639
Depreciation and amortization 947 823 711
Total costs and expenses 33,709 36,171 41,179
Income (loss) from operations (1,832) 1,110 2,799
Interest expense (565) (633) (523)
Other income (expense), net (7,029) 1,844 1,849