# RAG Advanced Technique: Reranking with Cohere AI

Reranking is a technique used to improve the performance of a model by reordering the results of the model. It works by following these steps:

1. Generate a list of candidate answers using a model.
2. Score each candidate answer using a reranker model.
3. Reorder the candidate answers based on the scores.
4. Return the top candidate answers.


In [54]:
!pip install -qU \
    datasets \
    pinecone-client \
    cohere==4.34


## Loading the Data


In [5]:
from datasets import load_dataset

In [7]:
data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

Downloading data:   0%|          | 0.00/153M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/41584 [00:00<?, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

In [8]:
data[0]

{'doi': '1910.01108',
 'chunk-id': '0',
 'chunk': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof i

## Format the data


In [10]:
data = data.map(lambda x: {
	"id": f'{x["id"]}-{x["chunk-id"]}',
	"text": x["chunk"],
	"metadata": {
		"title": x["title"],
		"url": x["source"],
		"primary_category": x["primary_category"],
		"published": x["published"],
		"updated": x["updated"],
		"text": x["chunk"]
	}
})
# Drop unnecessary columns
data.remove_columns([
	"title", "summary", "source", "authors", "categories", "comment", "journal_ref", "primary_category", "published", "updated", "references", "doi", "chunk-id", "chunk"
])
data

Map:   0%|          | 0/41584 [00:00<?, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references', 'text', 'metadata'],
    num_rows: 41584
})

## Define the Embedding Function and DB Connection


### Define the Embedding Function


In [11]:
import os
import cohere
import getpass

In [12]:
co_api_key = os.getenv("COHERE_API_KEY") or getpass.getpass("Enter your Cohere API key: ")
co = cohere.Client(api_key=co_api_key)

In [59]:
def embed(docs: list[str], input_type: str = "search_document") -> list[list[float]]:
    doc_embeds = co.embed(
            texts=docs,
        input_type=input_type,
        model="embed-english-v3.0"
    )
    return doc_embeds.embeddings


### Define the DB Connection


In [16]:
from pinecone import Pinecone, ServerlessSpec

In [17]:
# Initialize Pinecone connection with your API key
pc_api_key = os.getenv("PINECONE_API_KEY") or getpass.getpass("Enter your Pinecone API key: ")

In [18]:
pc = Pinecone(api_key=pc_api_key)

Creating an index:


In [19]:
import time

In [140]:
index_name = "arxiv-rerankers"

In [139]:
pc.list_indexes()
# pc.delete_index(index_name)

{'indexes': []}

In [141]:
# Check if the index exists
if index_name not in pc.list_indexes().names():
    # If does not exist, create a new index
    pc.create_index(
        name=index_name,
        dimension=1024, # Replace with your model dimensions
        metric="cosine", # Replace with your model metric
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        ) 
    )
    # wait for the index to be created
    while index_name not in pc.list_indexes().names():
        time.sleep(1)

In [142]:
# Connect to index
index = pc.Index(index_name)
time.sleep(1)
# View index stats
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Populating the index with the Cohere's `embed-english-v3.0` model:


In [35]:
from tqdm.auto import tqdm

In [143]:
BATCH_SIZE = 100 # How many embeddings we create and insert at once

In [None]:
for i in tqdm(range(0, len(data), BATCH_SIZE)):
    passed = False
    # find end of batch
    i_end = min(len(data), i+BATCH_SIZE)
    # create batch
    batch = data[i:i_end]
    embeds = embed(batch["text"])
    to_upsert = list(zip(batch["id"], embeds, batch["metadata"]))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

### Test retrieval without Cohere's reranker


In [124]:
def get_docs(query: str, top_k: int = 5) -> dict:
	# Embed query
	query_embed = embed([query], input_type="search_query")[0]
	# Search Pinecone
	results = index.query(vector=query_embed, top_k=top_k, include_metadata=True)
	# Get the docs
	docs = {x["metadata"]["text"]: i for i, x in enumerate(results["matches"])}
	return docs

In [125]:
query = "Can you explain why we would want to do rlhf?"
docs = get_docs(query, top_k=25)
list(docs.keys())[:5]

['preferences and values which are diﬃcult to capture by hard- coded reward functions.\nRLHF works by using a pre-trained LM to generate text, which i s then evaluated by humans by, for example,\nranking two model generations for the same prompt. This data is then collected to learn a reward model\nthat predicts a scalar reward given any generated text. The r eward captures human preferences when\njudging model output. Finally, the LM is optimized against s uch reward model using RL policy gradient\nalgorithms like PPO ( Schulman et al. ,2017). RLHF can be applied directly on top of a general-purpose LM\npre-trained via self-supervised learning. However, for mo re complex tasks, the model’s generations may not\nbe good enough. In such cases, RLHF is typically applied afte r an initial supervised ﬁne-tuning phase using\na small number of expert demonstrations for the correspondi ng downstream task ( Ramamurthy et al. ,2022;\nOuyang et al. ,2022;Stiennon et al. ,2020).\nA successful exam

## Reranking Responses

We can easily get the responses we need when include _many_ responses, but this doesn't work well with LLMs. The recall performance for LLMs **decrease as we add more into the context window** - we call this excessive filling of the context window _"context stuffing"_.

Fortunately reranking offers us a solution that helps us find those records that may not be within the top-3 results, and pull them into a smaller set of results to be given to the LLM.

We will use Cohere's rerank endpoint for this purpose.


In [120]:
rerank_docs = co.rerank(
	query=query,
	documents=list(docs.keys()),
	top_n=25,
    model="rerank-english-v3.0",
	return_documents=True
)

In [121]:
rerank_docs.results[:5]

[RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text='preferences in order to make it more useful. One key component of RLHF is reward modeling,\nwhere the problem is formulated as a regression task to predict a scalar reward given a prompt and\na response (Askell et al., 2021; Ouyang et al., 2022). This approach typically requires large-scale\ncomparison data, where two model responses on the same prompt are compared Ouyang et al.\n(2022). Existing open-source works such as Alpaca, Vicuna, and Dolly (Databricks, 2023) do not\ninvolve RLHF due to the high cost of labeling comparison data. Meanwhile, recent studies show that\nGPT-4 is capable of identifying and ﬁxing its own mistakes, and accurately judging the quality of\nresponses(Peng et al., 2023; Bai et al., 2022; Madaan et al., 2023; Kim et al., 2023). Therefore, to\nfacilitate research on RLHF, we have created comparison data using GPT-4, as described in Section 2.\nFigure 2: The distribution of comparison\n

### The impact of reranking on different queries


In [127]:
docs = get_docs(query, top_k=5)
rerank_docs = co.rerank(
	query=query,
	documents=list(docs.keys()),
	top_n=5,
	model="rerank-english-v3.0",
	return_documents=True
)
docs[rerank_docs.results[0].document.text]

3

In [135]:
def compare(query: str, top_k: int, top_n: int):
	# First get vec search results
	docs = get_docs(query, top_k)
	i2doc = {docs[doc]: doc for doc in list(docs.keys())}
	# Then get rerank results
	rerank_docs = co.rerank(
		query=query,
		documents=list(docs.keys()),
		top_n=top_n,
		model="rerank-english-v3.0",
		return_documents=True
	)
	original_docs = []
	reranked_docs = []
	# Compare order change
	for i, doc in enumerate(rerank_docs.results):
		rerank_i = docs[doc.document.text]
		print(str(i)+"\t->\t"+str(rerank_i))
		if i != rerank_i:
			reranked_docs.append(f"[{rerank_i}]\n"+doc.document.text)
			original_docs.append(f"[{i}]\n"+i2doc[i])
	for orig, rerank in zip(original_docs, reranked_docs):
		print("ORIGINAL:\n"+orig+"\n\nRERANKED:\n"+rerank+"\n\n---\n")


In [136]:
compare(query, 25, 3)

0	->	3
1	->	0
2	->	22
ORIGINAL:
[0]
preferences and values which are diﬃcult to capture by hard- coded reward functions.
RLHF works by using a pre-trained LM to generate text, which i s then evaluated by humans by, for example,
ranking two model generations for the same prompt. This data is then collected to learn a reward model
that predicts a scalar reward given any generated text. The r eward captures human preferences when
judging model output. Finally, the LM is optimized against s uch reward model using RL policy gradient
algorithms like PPO ( Schulman et al. ,2017). RLHF can be applied directly on top of a general-purpose LM
pre-trained via self-supervised learning. However, for mo re complex tasks, the model’s generations may not
be good enough. In such cases, RLHF is typically applied afte r an initial supervised ﬁne-tuning phase using
a small number of expert demonstrations for the correspondi ng downstream task ( Ramamurthy et al. ,2022;
Ouyang et al. ,2022;Stiennon et al. ,

In [137]:
compare("what is red teaming?", top_k=25, top_n=3)

0	->	2
1	->	0
2	->	7
ORIGINAL:
[0]
including limitations and risks that might be exploited by m alicious actors. Further, existing
red teaming approaches are insufﬁcient for addressing thes e concerns in the AI context.
In order for AI developers to make veriﬁable claims about the ir AI systems being safe or secure, they need
processes for surfacing and addressing potential safety an d security risks. Practices such as red teaming
exercises help organizations to discover their own limitat ions and vulnerabilities as well as those of the
AI systems they develop, and to approach them holistically , in a way that takes into account the larger
environment in which they are operating.23
A red team exercise is a structured effort to ﬁnd ﬂaws and vuln erabilities in a plan, organization, or
technical system, often performed by dedicated "red teams" that seek to adopt an attacker’s mindset
and methods. In domains such as computer security , red teams are routinely tasked with emulating
attacke

In [None]:
pc.delete_index(index_name) # Clean up