<a href="https://colab.research.google.com/github/mertcan-basut/nlp/blob/main/retrieve_and_rerank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Background Information

### Context recall

**recall** *(retrieval evaluation metric)* : How many of the relevant documents are retrieved.

`recall@K= # of relevant docs returned / # of relevant documents in dataset`

### LLM recall

![LLM recall](https://www.pinecone.io/_next/image/?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2Fvr8gru94%2Fproduction%2Fca206b6ada9163bffad313e0e18feee0b460c768-1212x688.png&w=1920&q=75)

**LLM recall** refers to the ability of an LLM to find information from the text placed within its context window.

When storing information in the middle of a context window, an LLM's ability to recall that information becomes worse than had it not been provided in the first place.

### Two-stage retrieval

A **reranking model (cross-encoder)** is a type of model that, given a query and document pair, will output a similarity score. Rerankers are much more accurate than embedding models (bi-encoder). But they are slow, so that is why two-stage retrieval is required to perform reranking on a small set of documents retrieved from a large set.

![reranker/cross-encoder](https://www.pinecone.io/_next/image/?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2Fvr8gru94%2Fproduction%2F9f0d2f75571bb58eecf2520a23d300a5fc5b1e2c-2440x1100.png&w=3840&q=75)

A reranker can receive the raw information directly into the large transformer computation, meaning less information loss. Rerankers run at user query time, and this allows analyzing the document's meaning specific to the user query.

![embedding model/bi-encoder](https://www.pinecone.io/_next/image/?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2Fvr8gru94%2Fproduction%2F4509817116ab72e27bae809c38cb48fbf1578b5d-2760x1420.png&w=3840&q=75)

Bi-encoders must compress all of the possible meanings of a document into a single vector resulting in information loss. Additionally, bi-encoders have no context on the query because the embeddings are created before user query time.

### Sources
🌐 https://www.pinecone.io/learn/series/rag/rerankers/

## Implementation

In [1]:
!pip install -q langchain langchain-openai langchain-community
!pip install -q chromadb
!pip install -q python-dotenv

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.0/974.0 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.7/314.7 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.9/124.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.5/325.5 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

In [2]:
!echo "AZURE_OPENAI_API_KEY=editme" > .env
!echo "AZURE_OPENAI_ENDPOINT=editme" >> .env
!echo "OPENAI_API_VERSION=editme" >> .env

In [3]:
from openai import AzureOpenAI

import tiktoken

from langchain_openai.embeddings import AzureOpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document as LangChainDocument

import pandas as pd

from math import exp
import json
from tenacity import retry, wait_random_exponential, stop_after_attempt

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(), override=True) # read local .env file

from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


### Prepare data and vector store

In [76]:
with open("/content/drive/MyDrive/data/corpus_dataset.json", 'r') as f:
  data = json.load(f)

In [77]:
docs = [
  LangChainDocument(
    page_content=element['text'],
    metadata={
      'topic': element['topic']
    }
  ) for element in data
]

In [None]:
vectordb = Chroma.from_documents(
  documents=docs,
  embedding=AzureOpenAIEmbeddings(model="text-embedding-ada-002"),
  persist_directory="/content/drive/MyDrive/data/chroma/"
)
vectordb._collection.count()

28

In [78]:
query = "Hello!"

### Similarity metrics

In [None]:
vectordb = Chroma(persist_directory="/content/drive/MyDrive/data/chroma/")
embeddings = vectordb.get(include=["embeddings"])['embeddings']

In [None]:
# cosine
# l2
# inner product

### Lexical search

#### BM25

### Semantic search

#### LangChain similarity search

In [79]:
vectordb = Chroma(persist_directory="/content/drive/MyDrive/data/chroma/", embedding_function=AzureOpenAIEmbeddings(model="text-embedding-ada-002"))

In [80]:
documents = vectordb.similarity_search_with_score(query="Hello!", k=10) # lower score represents more similarity
documents

[(Document(page_content='AI-powered chatbots are transforming customer service by providing instant responses to queries. These chatbots use natural language processing to understand and respond to customer needs, improving user experience.', metadata={'topic': 'Artificial Intelligence'}),
  0.513806858444452),
 (Document(page_content='AI-powered diagnostic tools are helping doctors make more accurate diagnoses. By analyzing medical images and patient data, these tools can identify patterns and anomalies that may be missed by human doctors.', metadata={'topic': 'Artificial Intelligence'}),
  0.5512987281407215),
 (Document(page_content='The invention of the printing press by Johannes Gutenberg in the mid-15th century revolutionized the dissemination of knowledge. This innovation made books more accessible, fueling the spread of Renaissance ideas across Europe.', metadata={'topic': 'Renaissance'}),
  0.5521388674271172),
 (Document(page_content='The use of AI in autonomous vehicles is s

#### HuggingFace Bi-Encoder

### Reranking

#### HuggingFace Cross-Encoder

#### OpenAI Completions as Cross-Encoder

🌐 https://cookbook.openai.com/examples/search_reranking_with_cross-encoders

In [71]:
client = AzureOpenAI()
llm_model_name = "gpt-35-turbo"

tokenizer = tiktoken.encoding_for_model(llm_model_name)
yes_token, no_token = [tokenizer.encode(token)[0] for token in ["Yes", "No"]]
print("Token ID for 'Yes': ", yes_token)
print("Token ID for 'No': ", no_token)

sys_prompt = '''
You are an Assistant responsible for helping detect whether the retrieved document is relevant to the query. For a given input, you need to output a single token: "Yes" or "No" indicating the retrieved document is relevant to the query.

Query: How to plant a tree?
Document: """Cars were invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.[3][4][5] Cars became widely available during the 20th century. One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced horse-drawn carriages.[6] In Europe and other parts of the world, demand for automobiles did not increase until after World War II.[7] The car is considered an essential part of the developed economy."""
Relevant: No

Query: Has the coronavirus vaccine been approved?
Document: """The Pfizer-BioNTech COVID-19 vaccine was approved for emergency use in the United States on December 11, 2020."""
Relevant: Yes

Query: What is the capital of France?
Document: """Paris, France's capital, is a major European city and a global center for art, fashion, gastronomy and culture. Its 19th-century cityscape is crisscrossed by wide boulevards and the River Seine. Beyond such landmarks as the Eiffel Tower and the 12th-century, Gothic Notre-Dame cathedral, the city is known for its cafe culture and designer boutiques along the Rue du Faubourg Saint-Honoré."""
Relevant: Yes

Query: What are some papers to learn about PPO reinforcement learning?
Document: """Proximal Policy Optimization and its Dynamic Version for Sequence Generation: In sequence generation task, many works use policy gradient for model optimization to tackle the intractable backpropagation issue when maximizing the non-differentiable evaluation metrics or fooling the discriminator in adversarial learning. In this paper, we replace policy gradient with proximal policy optimization (PPO), which is a proved more efficient reinforcement learning algorithm, and propose a dynamic approach for PPO (PPO-dynamic). We demonstrate the efficacy of PPO and PPO-dynamic on conditional sequence generation tasks including synthetic experiment and chit-chat chatbot. The results show that PPO and PPO-dynamic can beat policy gradient by stability and performance."""
Relevant: Yes

Query: Explain sentence embeddings
Document: """Inside the bubble: exploring the environments of reionisation-era Lyman-α emitting galaxies with JADES and FRESCO: We present a study of the environments of 16 Lyman-α emitting galaxies (LAEs) in the reionisation era (5.8<z<8) identified by JWST/NIRSpec as part of the JWST Advanced Deep Extragalactic Survey (JADES). Unless situated in sufficiently (re)ionised regions, Lyman-α emission from these galaxies would be strongly absorbed by neutral gas in the intergalactic medium (IGM). We conservatively estimate sizes of the ionised regions required to reconcile the relatively low Lyman-α velocity offsets (ΔvLyα<300kms−1) with moderately high Lyman-α escape fractions (fesc,Lyα>5%) observed in our sample of LAEs, indicating the presence of ionised ``bubbles'' with physical sizes of the order of 0.1pMpc≲Rion≲1pMpc in a patchy reionisation scenario where the bubbles are embedded in a fully neutral IGM. Around half of the LAEs in our sample are found to coincide with large-scale galaxy overdensities seen in FRESCO at z∼5.8-5.9 and z∼7.3, suggesting Lyman-α transmission is strongly enhanced in such overdense regions, and underlining the importance of LAEs as tracers of the first large-scale ionised bubbles. Considering only spectroscopically confirmed galaxies, we find our sample of UV-faint LAEs (MUV≳−20mag) and their direct neighbours are generally not able to produce the required ionised regions based on the Lyman-α transmission properties, suggesting lower-luminosity sources likely play an important role in carving out these bubbles. These observations demonstrate the combined power of JWST multi-object and slitless spectroscopy in acquiring a unique view of the early stages of Cosmic Reionisation via the most distant LAEs."""
Relevant: No
'''

usr_prompt = '''
Query: {query}
Document: """{document}"""
Relevant:
'''

@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def document_relevance(query, document):
  response = client.chat.completions.create(
    model="gpt-35-16k",
    messages=[
      {'role': 'system', 'content': sys_prompt},
      {'role': 'user', 'content': usr_prompt.format(query=query, document=document)}
    ],
    temperature=0.0,
    logprobs=True,
    # logit_bias={yes_token: 1, no_token:1},
    max_tokens=1
  )

  prediction = response.choices[0].message.content
  probability = exp(response.choices[0].logprobs.content[0].logprob)
  if prediction == "Yes":
    yes_probability = probability
  elif prediction == "No":
    yes_probability = 1 - probability
  else:
    raise ValueError(f"Prediction: '{prediction}' is not a valid prediction. Valid predictions are 'Yes' and 'No'.")

  return (
    query,
    document,
    prediction,
    yes_probability
  )

Token ID for 'Yes':  9642
Token ID for 'No':  2822


In [96]:
output_list = []
for document, score in documents:
  try:
    output_list.append(document_relevance(query, document.page_content) + (document.metadata['topic'],))
  except Exception as e:
    print(e)

output_df = pd.DataFrame(
  output_list, columns=["query", "document", "prediction", "yes_probability", "topic"]
).reset_index()

reranked_df = output_df.sort_values(by=["yes_probability"], ascending=False)
reranked_df

Unnamed: 0,index,query,document,prediction,yes_probability,topic
9,9,Hello!,AI in education is personalizing learning expe...,No,0.373138,Artificial Intelligence
1,1,Hello!,AI-powered diagnostic tools are helping doctor...,No,0.271779,Artificial Intelligence
0,0,Hello!,AI-powered chatbots are transforming customer ...,No,0.186432,Artificial Intelligence
3,3,Hello!,The use of AI in autonomous vehicles is set to...,No,0.147678,Artificial Intelligence
4,4,Hello!,Artificial Intelligence (AI) is transforming i...,No,0.087923,Artificial Intelligence
6,6,Hello!,"Renewable energy technologies such as solar, w...",No,0.087114,Renewable Energy
7,7,Hello!,Smart grids are transforming the way we distri...,No,0.067281,Renewable Energy
8,8,Hello!,"In the retail sector, AI is being used to enha...",No,0.037379,Artificial Intelligence
2,2,Hello!,The invention of the printing press by Johanne...,No,0.027972,Renaissance
5,5,Hello!,"During the Renaissance, scientific inquiry flo...",No,0.024704,Renaissance


### Two-stage retrieval

In [None]:
# langchain compression