# Rerankers

Jay Urbain, PhD
8/22/2024, 3/13/2025

Rerankers add a final "reranking" step to retrieval pipelines. Like with **R**etrieval **A**ugmented **G**eneration (RAG), they can be used to dramatically optimize retrieval pipelines and improve their accuracy.

Create retrieval pipelines with reranking using the [Cohere reranking model](https://txt.cohere.com/rerank/) (which is available for free).

References:  

https://docs.cohere.com/docs/overview 


https://www.pinecone.io/learn/series/rag/rerankers/

In [1]:
# !pip install -qU datasets
# !pip install -qU openai
# !pip install -qU pinecone-client

In [1]:
!pip install -qU \
    datasets==2.14.5 \
    "pinecone[grpc]"==5.1.0

In [2]:
# !pip install -qU \
#     datasets==2.14.5 \
#     openai==1.6.1 \
#     pinecone-client==3.1.0 \
#     cohere==4.27

In [None]:
PINECONE_API_KEY = "xxx" or getpass("Pinecone API key: ")


### Ranking Metrics

In [5]:
# recall@k function
def recall(actual, predicted, k):
    act_set = set(actual)
    pred_set = set(predicted[:k])
    result = round(len(act_set & pred_set) / float(len(act_set)), 2)
    return result

actual = ["2", "4", "5", "7"]
predicted = ["1", "2", "3", "4", "5", "6", "7", "8"]
for k in range(1, 9):
    print(f"Recall@{k} = {recall(actual, predicted, k)}")

Recall@1 = 0.0
Recall@2 = 0.25
Recall@3 = 0.25
Recall@4 = 0.5
Recall@5 = 0.75
Recall@6 = 0.75
Recall@7 = 1.0
Recall@8 = 1.0


In [6]:
# Mean recipricol rank

# relevant results for query #1, #2, and #3
actual_relevant = [
    [2, 4, 5, 7],
    [1, 4, 5, 7],
    [5, 8]
]

# number of queries
Q = len(actual_relevant)

# calculate the reciprocal of the first actual relevant rank
cumulative_reciprocal = 0
for i in range(Q):
    first_result = actual_relevant[i][0]
    reciprocal = 1 / first_result
    cumulative_reciprocal += reciprocal
    print(f"query #{i+1} = 1/{first_result} = {reciprocal}")

# calculate mrr
mrr = 1/Q * cumulative_reciprocal

# generate results
print("MRR =", round(mrr,2))

query #1 = 1/2 = 0.5
query #2 = 1/1 = 1.0
query #3 = 1/5 = 0.2
MRR = 0.57


In [7]:
# mean average precision

# initialize variables
actual = [
    [2, 4, 5, 7],
    [1, 4, 5, 7],
    [5, 8]
]
Q = len(actual)
predicted = [1, 2, 3, 4, 5, 6, 7, 8]
k = 8
ap = []

# loop through and calculate AP for each query q
for q in range(Q):
    ap_num = 0
    # loop through k values
    for x in range(k):
        # calculate precision@k
        act_set = set(actual[q])                                                                                                                                   
        pred_set = set(predicted[:x+1])
        precision_at_k = len(act_set & pred_set) / (x+1)
        # calculate rel_k values
        if predicted[x] in actual[q]:
            rel_k = 1
        else:
            rel_k = 0
        # calculate numerator value for ap
        ap_num += precision_at_k * rel_k
    # now we calculate the AP value as the average of AP
    # numerator values
    ap_q = ap_num / len(actual[q])
    print(f"AP@{k}_{q+1} = {round(ap_q,2)}")
    ap.append(ap_q)

# now we take the mean of all ap values to get mAP
map_at_k = sum(ap) / Q

# generate results
print(f"mAP@{k} = {round(map_at_k, 2)}")

AP@8_1 = 0.54
AP@8_2 = 0.67
AP@8_3 = 0.23
mAP@8 = 0.48


In [8]:
# normlized discounted cumulative gain

from math import log2

# initialize variables
relevance = [0, 7, 2, 4, 6, 1, 4, 3]
K = 8

dcg = 0
# loop through each item and calculate DCG
for k in range(1, K+1):
    rel_k = relevance[k-1]
    # calculate DCG
    dcg += rel_k / log2(1 + k)

# sort items in 'relevance' from most relevant to less relevant
ideal_relevance = sorted(relevance, reverse=True)

print(ideal_relevance)

idcg = 0
# as before, loop through each item and calculate *Ideal* DCG
for k in range(1, K+1):
    rel_k = ideal_relevance[k-1]
    # calculate DCG
    idcg += rel_k / log2(1 + k)


dcg = 0
idcg = 0

for k in range(1, K+1):
    # calculate rel_k values
    rel_k = relevance[k-1]
    ideal_rel_k = ideal_relevance[k-1]
    # calculate dcg and idcg
    dcg += rel_k / log2(1 + k)
    idcg += ideal_rel_k / log2(1 + k)
    # calcualte ndcg
    ndcg = dcg / idcg



[7, 6, 4, 4, 3, 2, 1, 0]


## Data Preparation

We start by downloading a dataset that we will encode and store. The dataset [`jamescalam/ai-arxiv-chunked`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked) contains scraped data from many popular ArXiv papers centred around LLMs. Including papers from Llama 2, GPTQ, and the GPT-4 technical paper.

In [17]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

In [18]:
data[0]

{'doi': '1910.01108',
 'chunk-id': '0',
 'chunk': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof i

Reformat the dataset to be more Pinecone-friendly when it does come to the later embed and index process.

In [19]:
data = data.map(lambda x: {
    "id": f'{x["id"]}-{x["chunk-id"]}',
    "text": x["chunk"],
    "metadata": {
        "title": x["title"],
        "url": x["source"],
        "primary_category": x["primary_category"],
        "published": x["published"],
        "updated": x["updated"],
        "text": x["chunk"],
    }
})
# drop uneeded columns
data = data.remove_columns([
    "title", "summary", "source",
    "authors", "categories", "comment",
    "journal_ref", "primary_category",
    "published", "updated", "references",
    "doi", "chunk-id",
    "chunk"
])
data

Map:   0%|          | 0/41584 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'metadata'],
    num_rows: 41584
})

## Embed and index

We need to define an embedding model to create our embedding vectors for retrieval, for that we will be using OpenAI's text-embedding-ada-002. There is some cost associated with this model, so be aware of that (costs for running this notebook are <$1).

In [20]:
import os
import openai
import getpass  # platform.openai.com

# # get API key from top-right dropdown on OpenAI website
# openai.api_key = OPENAI_API_KEY

# embed_model = "text-embedding-ada-002"
# embed_model = "text-embedding-3-large"

In [21]:
from pinecone.grpc import PineconeGRPC

embed_model = "multilingual-e5-large"

# configure client
pc = PineconeGRPC(api_key=PINECONE_API_KEY)


In [22]:
# from pinecone import ServerlessSpec

# spec = ServerlessSpec(
#     cloud="aws", region="us-west-2"
# )

Creating an index, we set `dimension` equal to to dimensionality of the LLM Embedding: Ada-002 (`1536`), Large (`3072`). Use a `metric` also compatible with Ada-002 (this can be either `cosine` or `dotproduct`). 

Uncomment next cell to reindex

In [34]:
index_name = "rerankers"
pc.delete_index(index_name)

In [36]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"
)

In [37]:
import time

index_name = "rerankers"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1024,  # dimensionality of e5-large
        metric='cosine',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 0}},
 'total_vector_count': 0}

Define an embedding function to handle embedding with our model. Within the function, we also include the handling of rate limit errors.

In [38]:
from pinecone_plugins.inference.core.client.exceptions import PineconeApiException

def embed(batch: list[str]) -> list[float]:
    # create embeddings (exponential backoff to avoid RateLimitError)
    for j in range(5):  # max 5 retries
        try:
            res = pc.inference.embed(
                model=embed_model,
                inputs=batch,
                parameters={
                    "input_type": "passage",  # for docs/context/chunks
                    "truncate": "END",  # truncate to max length
                }
            )
            passed = True
        except PineconeApiException:
            time.sleep(2**j)  # wait 2^j seconds before retrying
            print("Retrying...")
    if not passed:
        raise RuntimeError("Failed to create embeddings.")
    # get embeddings
    embeds = [x["values"] for x in res.data]
    return embeds

In [39]:
# test

aa = embed(["hello world"])
len(aa[0])

1024

We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI's `text-embedding-ada-002` built embeddings like so:

**⚠️ WARNING: Embedding costs for the full dataset as of 3 Jan 2024 is ~$5.70**

Uncomment cell below to re-index

In [40]:
from tqdm.auto import tqdm

batch_size = 96  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(data), batch_size)):
    passed = False
    # find end of batch
    i_end = min(len(data), i+batch_size)
    # create batch
    batch = data[i:i_end]
    embeds = embed(batch["text"])
    to_upsert = list(zip(batch["id"], embeds, batch["metadata"]))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/434 [00:00<?, ?it/s]

Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...
Retrying...


### Retrieval _without_ reranking model.

get_docs to return documents using the first stage of retrieval only

In [42]:
def get_docs(query: str, top_k: int) -> list[str]:
    # encode query
    res = pc.inference.embed(
        model=embed_model,
        inputs=[query],
        parameters={
            "input_type": "query",  # for queries
            "truncate": "END",  # truncate to max length
        }
    )
    xq = res.data[0]["values"]
    # search pinecone index
    res = index.query(vector=xq, top_k=top_k, include_metadata=True)
    # get doc text
    docs = [{
        "id": str(i),
        "text": x["metadata"]['text']
    } for i, x in enumerate(res["matches"])]
    return docs

Query about Reinforcement Learning with Human Feedback

In [48]:
query = "can you explain why we would want to do rlhf?"
docs = get_docs(query, top_k=25)
docs[:3]

[{'id': '0',
  'text': 'We examine the inﬂuence of the amount of RLHF training for two reasons. First, RLHF [13, 57] is an\nincreasingly popular technique for reducing harmful behaviors in large language models [3, 21, 52]. Some of\nthese models are already deployed [52], so we believe the impact of RLHF deserves further scrutiny. Second,\nprevious work shows that the amount of RLHF training can signiﬁcantly change metrics on a wide range of\npersonality, political preference, and harm evaluations for a given model size [41]. As a result, it is important\nto control for the amount of RLHF training in the analysis of our experiments.\n3.2 Experiments\n3.2.1 Overview\nWe test the effect of natural language instructions on two related but distinct moral phenomena: stereotyping\nand discrimination. Stereotyping involves the use of generalizations about groups in ways that are often\nharmful or undesirable.4To measure stereotyping, we use two well-known stereotyping benchmarks, BBQ\n[40] (§

You should see relevant chunks of data


### Retrieval _without_ reranking model.

Use Pinecone's rerank endpoint for this. 

In [49]:
rerank_name = "bge-reranker-v2-m3"

rerank_docs = pc.inference.rerank(
    model=rerank_name,
    query=query,
    documents=docs,
    top_n=25,
    return_documents=True
)

Returns a rerank document

In [50]:
rerank_docs

RerankResult(
  model='bge-reranker-v2-m3',
  data=[
    { index=1, score=0.9071478,
      document={id="1", text="RLHF Response ! I..."} },
    { index=6, score=0.6962682,
      document={id="6", text="team, instead of ..."} },
    ... (21 more documents) ...,
    { index=17, score=0.13432105,
      document={id="17", text="helpfulness and h..."} },
    { index=23, score=0.1161611,
      document={id="23", text="responses respons..."} }
  ],
  usage={'rerank_units': 1}
)

Access the text content of the docs via rerank_docs.data[0]["document"]["text"]

In [51]:
def compare(query: str, top_k: int, top_n: int):
    # first get vec search results
    top_k_docs = get_docs(query, top_k=top_k)
    # rerank
    top_n_docs = pc.inference.rerank(
        model=rerank_name,
        query=query,
        documents=docs,
        top_n=top_n,
        return_documents=True
    )
    original_docs = []
    reranked_docs = []
    # compare order change
    print("[ORIGINAL] -> [NEW]")
    for i, doc in enumerate(top_n_docs.data):
        print(str(doc.index)+"\t->\t"+str(i))
        if i != doc.index:
            reranked_docs.append(f"[{doc.index}]\n"+doc["document"]["text"])
            original_docs.append(f"[{i}]\n"+top_k_docs[i]['text'])
        else:
            reranked_docs.append(doc["document"]["text"])
            original_docs.append(None)
    # print results
    for orig, rerank in zip(original_docs, reranked_docs):
        if not orig:
            print(f"SAME:\n{rerank}\n\n---\n")
        else:
            print(f"ORIGINAL:\n{orig}\n\nRERANKED:\n{rerank}\n\n---\n")

Start with the RLHF query. Do a more standard retrieval-rerank process of retrieving 25 documents (top_k=25) and reranking to the top three documents (top_n=3).

In [52]:
compare(query, 25, 3)

[ORIGINAL] -> [NEW]
1	->	0
6	->	1
14	->	2
ORIGINAL:
[0]
We examine the inﬂuence of the amount of RLHF training for two reasons. First, RLHF [13, 57] is an
increasingly popular technique for reducing harmful behaviors in large language models [3, 21, 52]. Some of
these models are already deployed [52], so we believe the impact of RLHF deserves further scrutiny. Second,
previous work shows that the amount of RLHF training can signiﬁcantly change metrics on a wide range of
personality, political preference, and harm evaluations for a given model size [41]. As a result, it is important
to control for the amount of RLHF training in the analysis of our experiments.
3.2 Experiments
3.2.1 Overview
We test the effect of natural language instructions on two related but distinct moral phenomena: stereotyping
and discrimination. Stereotyping involves the use of generalizations about groups in ways that are often
harmful or undesirable.4To measure stereotyping, we use two well-known stereotyping be

After reranking, we have far more relevant information. Naturally, this can result in significantly better performance for RAG. It means we maximize relevant information while minimizing noise input into our LLM.

Reranking is one of the simplest methods for dramatically improving recall performance in Retrieval Augmented Generation (RAG) or any other retrieval-based pipeline.

We've explored why rerankers can provide so much better performance than their embedding model counterparts — and how a two-stage retrieval system allows us to get the best of both, enabling search at scale while maintaining quality performance.