<img src="https://www.rp.edu.sg/images/default-source/default-album/rp-logo.png" width="200" alt="Republic Polytechnic"/>

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/koayst-rplesson/C3669C-2025-01/blob/main/L18/L18.ipynb)

# Setup and Installation

You can run this Jupyter notebook either on your local machine or run it at Google Colab.

* For local machine, it is recommended to install Anaconda and create a new development environment called `c3669c`.
* Pip/Conda install the libraries stated below when necessary.
---

# <font color='red'>ATTENTION</font>

## Google Colab
- If you are running this code in Google Colab, **DO NOT** store the API Key in a text file and load the key later from Google Drive. This is insecure and will expose the key.
- **DO NOT** hard code the API Key directly in the Python code, even though it might seem convenient for quick development.
- You need to enter the API key at python code `getpass.getpass()` when ask.

## Local Environment/Laptop
- If you are running this code locally in your laptop, you can create a env.txt and store the API key there.
- Make sure env.txt is in the same directory of this Jupyter notebook.
- You need to install `python-dotenv` and run the Python code to load in the API key.

---
```
%pip install python-dotenv

from dotenv import load_dotenv

load_dotenv('env.txt')
openai_api_key = os.getenv('OPENAI_API_KEY')
```
---

## GitHub/GitLab
- **DO NOT** `commit` or `push` API Key to services like GitHub or GitLab.

# Lesson 18

## Techniques to improve RAG Performance
- `Better Embeddings`: Use high-quality embedding models (e.g., OpenAI’s text-embedding-ada-002, Cohere, or BGE) to improve the semantic relevance of retrieved documents.
- `Hybrid Search`: Combine dense (vector) search with sparse methods like BM25 to balance precision and recall.
- `Query Expansion`: Reformulate queries using LLMs or keyword expansion to improve retrieval accuracy.
- `Re-ranking`: Apply a re-ranker (e.g., Cohere Rerank, Cross-Encoders) to reorder retrieved documents based on relevance.

The list is non-exhaustive and new techniques are being discovered. In this notebook, we will explore `re-ranking`. 

---

Reference: [LangChain Cohere Reranker](https://python.langchain.com/docs/integrations/retrievers/cohere-reranker/)

## Cohere ReRanker
This notebook use Cohere's rerank endpoint in a retriever. You are required to get an API key from Cohere.

- Sign up at [Cohere Dashboard](https://dashboard.cohere.com/welcome/login) to apply an API key.
- Sign up at [Hugging Face](https://huggingface.co/) to apply an API key.

In [None]:
%%capture --no-stderr

%pip install --quiet -U pypdf
%pip install --quiet -U faiss-cpu

%pip install --quiet -U langchain-community
%pip install --quiet -U langchain-huggingface
%pip install --quiet -U langchain-openai
%pip install --quiet -U langchain-cohere

## Run either option 1 or option 2 to set the API keys and Hugging Face token.

In [None]:
# Option 1
# Run the code if you DIDN'T setup secrets in Google Colab

import getpass
import os

# setup the OpenAI API Key
# setup the Cohere API Key
# setup the Hugging Face Token

# get API keys ready and enter them when ask
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API key: ")
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API key: ")
os.environ["HF_TOKEN"] = getpass.getpass("Hugging Face token: ")

In [None]:
# Option 2
# Run the code if you setup secrets in Google Colab

import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
os.environ["COHERE_API_KEY"] = userdata.get("COHERE_API_KEY")
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

---

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.vectorstores import FAISS
from langchain.document_loaders.pdf import PyPDFDirectoryLoader

from langchain_huggingface import HuggingFaceEmbeddings

from langchain.chains import RetrievalQA

from langchain_openai import ChatOpenAI

from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere.rerank import CohereRerank

In [None]:
# helper function for printing docs
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join( [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)] )
    )

In [None]:
# download the zip file from GitHub repository
# Introduction-to-Management-Studies.pdf is in the zip file

!wget https://github.com/koayst-rplesson/SDGAI_LLMforGenAIApp_Labs/raw/refs/heads/main/L15/Introduction-to-Management-Studies.zip
!unzip Introduction-to-Management-Studies.zip
!ls -al

# make sure you can see "Introduction-to-Management-Studies.pdf" in the directory listing 

In [None]:
# load the PDF file from the current directory
pdf_folder_path = "."

loader = PyPDFDirectoryLoader(pdf_folder_path)
docs = loader.load()

print(len(docs))

print(docs[1].page_content)

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(docs)

print(len(texts))

In [None]:
model_name = "BAAI/bge-small-en-v1.5"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cpu'},
    encode_kwargs=encode_kwargs
)

## Set up the base vector store (FAISS) retriever

Initialise a simple vector store retriever and store document (in chunks). We can set up the retriever to retrieve a high number (20) of docs.

In [None]:
vectorstore = FAISS.from_documents(texts, embeddings)

# retrieve first top 20 chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

In [None]:
query = "According to Kelly and Williams what is ethics?"
docs = retriever.get_relevant_documents(query)

pretty_print_docs(docs)

In [None]:
model_name = "gpt-3.5-turbo-16k"

llm = ChatOpenAI(model_name=model_name, temperature=0.1)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever
)

In [None]:
%%time

response = qa.invoke(query)

In [None]:
print(f"Query = {response['query']}")
print(f"Result = {response['result']}")

In [None]:
no_rerank_query = response['query']
no_rerank_response = response['result']

---

## Do Reranking with CohereRerank

Wrap base retriever with a `ContextualCompressionRetriever`. Add an `CohereRerank`, uses the Cohere rerank endpoint to rerank the returned results. Do note that it is mandatory to specify the model name in CohereRerank!

In [None]:
compressor = CohereRerank(model="rerank-english-v3.0")

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)
#
compressed_docs = compression_retriever.get_relevant_documents(query)

pretty_print_docs(compressed_docs)

In [None]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever 
)

In [None]:
%%time

response = qa.invoke(query)

In [None]:
print(f"Query = {response['query']}")
print(f"Result = {response['result']}")

In [None]:
rerank_query = response['query']
rerank_response = response['result']

## Observation

Do you think the reranked response is better?

In [None]:
import pandas as pd

df = pd.DataFrame(
    {
        'Query': [no_rerank_query, rerank_query], 
        'Response': [no_rerank_response, rerank_response]
    },
    index=(["Not Rerank", "Reranked"])
)

df

---

# Another Example Using Infinity Reranker

# Infinity Reranker

Source: [LangChain Infinity Reranker](https://python.langchain.com/docs/integrations/document_transformers/infinity_rerank/)

`Infinity` is a high-throughput, low-latency REST API for serving text-embeddings, reranking models and clip.
For more info, please visit [here](https://github.com/michaelfeil/infinity?tab=readme-ov-file#reranking).

This notebook shows how to use Infinity Reranker for document compression and retrieval.

You can launch an Infinity Server with a reranker model in CLI (Command Line Interface):

```bash
pip install "infinity-emb[all]"
infinity_emb v2 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1
```

In [None]:
%%capture --no-stderr

%pip install --quiet -U "infinity-emb[all]"
%pip install --quiet -U infinity_client
%pip install --quiet -U faiss-cpu
%pip install --quiet -U langchain-community
%pip install --quiet -U langchain-huggingface

In [None]:
%pip install colab-xterm
%load_ext colabxterm

## xterm 

The command in the cell will run "xterm".  

Cut and paste the command string `infinity_emb v2 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1` into the terminal.

<img src="ScreenShot_02.png" width="auto" height="auto">     

Wait for the command to complete its run. It might take a while. Look out for `Application startup complete`.  Take note of the http URL and port number.  In the screen shot shown, it is `http://0.0.0.0:7997`.

<img src="ScreenShot_01.png" width="auto" height="auto">                                                                       

In [None]:
%xterm

In [None]:
# Helper function for printing docs
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

## Set up the base vector store retriever
Let's start by initializing a simple vector store retriever and storing the 2023 State of the Union speech (in chunks). We can set up the retriever to retrieve a high number (20) of docs.

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores.faiss import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
!wget https://github.com/koayst-rplesson/SDGAI_LLMforGenAIApp_Labs/raw/refs/heads/main/L15/state_of_the_union.txt
!ls -al

In [None]:
documents = TextLoader("./state_of_the_union.txt").load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)

retriever = FAISS.from_documents(
    texts, HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
).as_retriever(search_kwargs={"k": 20})

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
docs = retriever.invoke(query)
pretty_print_docs(docs)

## Reranking with InfinityRerank
Now let's wrap our base retriever with a `ContextualCompressionRetriever`. We'll use the `InfinityRerank` to rerank the returned results.

In [None]:
from infinity_client import Client
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors.infinity_rerank import InfinityRerank

In [None]:
# modify base_url to the http URL noted earlier
base_url = "http://0.0.0.0:7997"

client = Client(base_url=base_url)

In [None]:
# model "mixedbread-ai/mxbai-rerank-xsmall-v1" is downloaded from Hugging Face repository
compressor = InfinityRerank(client=client, model="mixedbread-ai/mxbai-rerank-xsmall-v1")

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

In [None]:
compressed_docs = compression_retriever.invoke(
    "What did the president say about Ketanji Jackson Brown"
)

pretty_print_docs(compressed_docs)