<a href="https://colab.research.google.com/github/iamankurraj/RAG/blob/main/Hybrid_Search_in_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hybrid Search in RAG
This notebook shows a minimal working example that combines:
- Dense retrieval (FAISS + HuggingFace Embeddings)
- Sparse retrieval (BM25)
- Hybrid ensemble using LangChain


###1.Install required dependencies

In [1]:
!pip install -q langchain-community

In [2]:
!pip install -q faiss-cpu

In [3]:
pip install -U langchain-huggingface



In [4]:
!pip install -q langchain sentence-transformers pypdf

In [5]:
!pip install -q rank_bm25

In [6]:
!pip install -U bitsandbytes



In [7]:
!pip install -q bitsandbytes accelerate

In [8]:
!pip install -q torch transformers

In [9]:
!pip install -q huggingface_hub

###2.Setup: Import Dependencies

In [10]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import BM25Retriever, EnsembleRetriever

In [23]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

In [12]:
import os
from huggingface_hub import login, InferenceClient

hf_token = os.getenv("HF_API_KEY")
login(hf_token)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

###3.Loading the text file

In [13]:
loader = PyPDFLoader("sample_Text.pdf")
pages = loader.load()

In [29]:
print(pages)

[Document(metadata={'producer': 'Adobe Acrobat 8.1', 'creator': 'Adobe Acrobat 8.1 Combine Files', 'creationdate': '2024-05-06T18:40:23+05:30', 'moddate': '2024-05-06T18:40:23+05:30', 'title': 'Untitled - 05 May 2024 at 21.26.35', 'source': 'sample_Text.pdf', 'total_pages': 100, 'page': 0, 'page_label': '1'}, page_content="SOLAR SYSTEM \nEXPLORATION AND INDIA'S \nCONTRIBUTION\nA Beginner's Guide\nDr. Tirtha Pratim Das\nSCIENCE \nFOR PEOPLE"), Document(metadata={'producer': 'Adobe Acrobat 8.1', 'creator': 'Adobe Acrobat 8.1 Combine Files', 'creationdate': '2024-05-06T18:40:23+05:30', 'moddate': '2024-05-06T18:40:23+05:30', 'title': 'Untitled - 05 May 2024 at 21.26.35', 'source': 'sample_Text.pdf', 'total_pages': 100, 'page': 1, 'page_label': '2'}, page_content='[1]'), Document(metadata={'producer': 'Adobe Acrobat 8.1', 'creator': 'Adobe Acrobat 8.1 Combine Files', 'creationdate': '2024-05-06T18:40:23+05:30', 'moddate': '2024-05-06T18:40:23+05:30', 'title': 'Untitled - 05 May 2024 at 21.

In [14]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(pages)

In [30]:
print(chunks)

[Document(metadata={'producer': 'Adobe Acrobat 8.1', 'creator': 'Adobe Acrobat 8.1 Combine Files', 'creationdate': '2024-05-06T18:40:23+05:30', 'moddate': '2024-05-06T18:40:23+05:30', 'title': 'Untitled - 05 May 2024 at 21.26.35', 'source': 'sample_Text.pdf', 'total_pages': 100, 'page': 0, 'page_label': '1'}, page_content="SOLAR SYSTEM \nEXPLORATION AND INDIA'S \nCONTRIBUTION\nA Beginner's Guide\nDr. Tirtha Pratim Das\nSCIENCE \nFOR PEOPLE"), Document(metadata={'producer': 'Adobe Acrobat 8.1', 'creator': 'Adobe Acrobat 8.1 Combine Files', 'creationdate': '2024-05-06T18:40:23+05:30', 'moddate': '2024-05-06T18:40:23+05:30', 'title': 'Untitled - 05 May 2024 at 21.26.35', 'source': 'sample_Text.pdf', 'total_pages': 100, 'page': 1, 'page_label': '2'}, page_content='[1]'), Document(metadata={'producer': 'Adobe Acrobat 8.1', 'creator': 'Adobe Acrobat 8.1 Combine Files', 'creationdate': '2024-05-06T18:40:23+05:30', 'moddate': '2024-05-06T18:40:23+05:30', 'title': 'Untitled - 05 May 2024 at 21.

###4.Embedding Model

In [16]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


###5.Dense Retriever

In [17]:
vectorstore = FAISS.from_documents(chunks, embedding_model)
vectorstore_retreiver = vectorstore.as_retriever(search_kwargs={"k": 3})

###6.Sparse Retriever

In [18]:
keyword_retriever = BM25Retriever.from_documents(chunks)

###7.Hybrid Retriever

In [19]:
ensemble_retriever = EnsembleRetriever(
    retrievers=[vectorstore_retreiver, keyword_retriever],
    weights=[0.5, 0.5]
)

###8.Query

In [20]:
query = "what all are needed to qualify as planet?"
result = ensemble_retriever.invoke(query)

In [21]:
for i, doc in enumerate(result, 1):
    print(f"{i}. {doc.page_content[:300]}...\n")

1. What all are needed to qualify as a ‘Planet’ .................................. 19 
How to classify the Solar System Bodies ...................................... 21 
What are ‘Asteroids’? ......................................................................... 23 
What are ‘Comets’? .................

2. and four terrestrial planets. The giant planets are also classified 
into Gas giants (Jupiter and Saturn) and Ice giants (Uranus and 
Neptune).  The terrestrial planets are Mercury, Venus, Earth and 
Mars, counting radially outwards from the Sun. They all differ in 
their composition and several oth...

3. [19] 
 
sharing their orbit around the Sun, just like Pluto does, along with 
Charon.  
Pluto was discovered in 1930 by American astronomer Clyde 
Tombaugh. At the time of its discovery, Pluto was considered the 
ninth planet in the solar system. However, it was soon realized 
that Pluto was signifi...

4. [20] 
 
During the year 2006 annual meeting of the International 
Astrono

###9.Chain

In [22]:
model_id = "google/gemma-2b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

# Wrap with pipeline for LangChain
hf_pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256)
llm = HuggingFacePipeline(pipeline=hf_pipe)

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=hf_pipe)


In [24]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=ensemble_retriever,
    return_source_documents=True
)

In [27]:
query = "what all are needed to qualify as planet?"
response = qa_chain.invoke(query)

print("Answer:\n", response['result'])


Answer:
 Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

What all are needed to qualify as a ‘Planet’ .................................. 19 
How to classify the Solar System Bodies ...................................... 21 
What are ‘Asteroids’? ......................................................................... 23 
What are ‘Comets’? ............................................................................ 24 
What are ‘Meteoroids’, ‘Meteors’ and ‘Meteorites’ ? .................. 25 
The Interplanetary Dust ................................................................... 26 
Attributes of a Planetary Body ........................................................ 28 
Orbit ................................................................................................ 29 
Mass .....................................................................................

In [40]:
helpful_answer = response['result'].strip().split("Helpful Answer:")[-1].strip()
print(f"Question: {query}")
print(f"Answer: {helpful_answer}")

Question: what all are needed to qualify as planet?
Answer: According to the IAU, a planet must meet the following three criteria: 1) The celestial body must be in orbit around the Sun, 2) it has sufficient mass for its self-gravity to achieve hydrostatic equilibrium, and 3) it must have cleared the neighbourhood around its orbit.
