

### Summary of Hybrid Search and BM25 in LangChain

#### **What is Hybrid Search?**
Hybrid search combines **keyword-based search** with **semantic search**:  
1. **Keyword Search:** Finds exact matches for words/phrases using algorithms like BM25.  
2. **Semantic Search:** Retrieves results based on contextual meaning using embeddings and vector search.  

The hybrid approach ensures the benefits of both:
- Precise keyword lookups.
- Deeper, context-aware semantic retrieval.

---

#### **Understanding BM25**
- **BM25 Overview:**
  - An old, reliable algorithm (developed in the 70s–80s).
  - Based on **TF-IDF (Term Frequency - Inverse Document Frequency)** principles.
  - Creates **sparse vectors** by counting word frequencies (or N-grams).  
  - Performs exceptionally well even compared to modern deep learning-based embedding techniques in some scenarios.

- **Advantages of BM25:**
  - Lightning-fast computations.
  - Well-suited for large-scale search tasks like those in **Elasticsearch**.

- **Limitations:**  
  BM25 retrieves results by exact term matching. For example:
  - Query: `"Apple"` retrieves documents containing `"Apple"` but doesn't differentiate between `"Apple the fruit"` and `"Apple the company"`.

---

#### **Using BM25 in LangChain**
1. **Sparse Retrieval with BM25:**
   - Implemented using `BM25Retriever` in LangChain.
   - Simply import and set it up:
     ```python
     from langchain.retrievers import BM25Retriever
     retriever = BM25Retriever.from_documents(doc_list)
     ```
   - Pass your documents and queries; it computes sparse vectors automatically.

2. **Keyword Search Example:**
   ```python
   retriever.get_relevant_documents("apple")
   ```
   - Output:
     - Documents with the word `"Apple"` appear (e.g., `"I like computers by Apple"`).

---

#### **Semantic Search with Embeddings**
- **Embeddings:** Contextual representations of text (e.g., OpenAI embeddings).
- **Dense Vector Search:** Unlike BM25, embeddings match on contextual similarity (not keywords).
  - Query: `"green fruit"`
  - Result:
    - `"I like apples"` and `"Apples and oranges are fruits"` (contextually similar).

- **LangChain Integration:**
   ```python
   from langchain.vectorstores import FAISS
   vector_store = FAISS.load_local("path/to/vectorstore")
   ```

---

#### **Combining BM25 and Embeddings: Ensemble Retriever**
- **Why Combine?**
  - BM25 is great for keyword matching.
  - Embeddings are great for semantic understanding.  
  Together, they create a **hybrid search**.

- **Implementation in LangChain:**
  1. Import and configure an **ensemble retriever**.
  2. Combine the **BM25 retriever** and the **embedding-based retriever**.
  3. Define weights to prioritize keyword or semantic results:
     ```python
     from langchain.chains import EnsembleRetriever
     ensemble_retriever = EnsembleRetriever(
         retrievers=[bm25_retriever, embedding_retriever],
         weights=[0.5, 0.5]  # Adjust weights as needed
     )
     ```

- **How It Works:**  
  - Query: `"green fruit"`  
    - BM25 retrieves: `"I love fruit juice"`.
    - Embeddings retrieve: `"I like apples"`.
    - Hybrid result merges and re-ranks these, producing a balanced list.

---

#### **Use Cases**
- **When Hybrid Search Helps:**
  - Exact terms need matching (e.g., names, IDs) **and** contextual results are crucial.
  - Applications in legal, medical, or large-scale document retrieval systems.

- **When Pure Semantic Search Works:**
  - If exact keyword matching isn't critical, semantic search might suffice.

---

#### **Key Takeaways**
1. BM25 excels in speed and keyword retrieval but lacks semantic understanding.
2. Embedding-based retrieval adds depth but can be slower.
3. Hybrid systems (like LangChain's ensemble retrievers) provide the best of both worlds.
4. Experiment with your own documents and weighting systems to fine-tune for your project!

---


---

---

---

---

Ye transcript kaafi detailed hai, lekin chal, isko ekdum clearly aur simple language me samjhte hain. Ye hybrid search aur BM25 ka pura concept breakdown karte hain. 🚀

---

### **Hybrid Search ka Matlab Kya Hai?**
Hybrid search ek combination hai **keyword search** aur **semantic search** ka.  
- **Keyword Search**: Words ka exact match dhundhne ke liye.  
  - *Example*: "Apple" ka matlab hai fruit ya company, bas word match hona chahiye.  
- **Semantic Search**: Sentence ya query ka meaning samajhne ke liye, embeddings ka use karke.  
  - *Example*: "A green fruit" ko "apple" ya "grapes" ke concept se match kar lega.

**Hybrid search** ka benefit:  
- **Keyword matching ki speed** + **semantic search ki deep understanding** dono milti hain.

---

### **BM25 Algorithm ka Overview**
BM25 ek **keyword search algorithm** hai jo bahut purana (1970s-1980s ka) hai, par abhi bhi kaafi effective hai.  
- Ye **TF-IDF (Term Frequency-Inverse Document Frequency)** principles pe kaam karta hai:
  - **Term Frequency**: Ek word document me kitni baar repeat ho raha hai.
  - **Inverse Document Frequency**: Rare words ko weightage dena, common words ko ignore karna.  
- BM25 **sparse vectors** banata hai (jo bas specific terms ko count karta hai, embeddings ki tarah dense nahi hota).

**BM25 ke Advantages:**
1. Super-fast computation, kyunki sirf words count karna hota hai.  
2. ElasticSearch aur fast search tools me widely use hota hai.

---

### **LangChain me BM25 aur Hybrid Search kaise Implement karte hain?**
LangChain me hybrid search setup karna easy hai. Transcript ke steps ko simplify karte hain:

1. **BM25 Sparse Retriever Setup**:  
   BM25 ko implement karna easy hai:
   ```python
   from langchain.retrievers import BM25Retriever

   # Document List
   docs = ["I like apples", "I like oranges", "Apples and oranges are fruits",
           "I like computers by Apple", "I love fruit juice"]

   retriever = BM25Retriever.from_documents(docs)
   retriever.k = 2  # Top 2 results
   print(retriever.get_relevant_documents("apple"))
   ```
   - Query: `"apple"`
   - BM25: Bas **exact words** match karega. Example: `"I like computers by Apple"`.

2. **Embedding Retriever Setup**:  
   Semantic retrieval ke liye embeddings ka use hota hai:
   ```python
   from langchain.embeddings import OpenAIEmbeddings
   from langchain.vectorstores import FAISS

   # Create embeddings and FAISS VectorStore
   embeddings = OpenAIEmbeddings()
   vectorstore = FAISS.from_texts(docs, embeddings)

   # Query results
   print(vectorstore.similarity_search("a green fruit", k=2))
   ```
   - Query: `"a green fruit"`
   - Result: `"I like apples", "Apples and oranges are fruits"` (semantic context ke base par).

3. **Hybrid Search (Ensemble Retriever)**:  
   BM25 + Embeddings ko combine karke hybrid search banate hain:
   ```python
   from langchain.retrievers import EnsembleRetriever

   hybrid_retriever = EnsembleRetriever(
       retrievers=[retriever, vectorstore.as_retriever()],
       weights=[0.5, 0.5]  # BM25 aur embeddings ka weight
   )

   # Query
   results = hybrid_retriever.get_relevant_documents("a green fruit")
   print(results)
   ```
   - Hybrid Search:  
     - **Keyword match** ke basis par `"I like computers by Apple"`.  
     - **Semantic understanding** ke basis par `"Apples and oranges are fruits"`.  

   Weight adjust karke decide kar sakte ho ki keyword ya semantic search ko zyada priority deni hai.

---

### **Hybrid Search ke Practical Use-Cases**
1. **Precise Keyword Search**:
   - Jab specific names ya terms dhundhne ho (e.g., "John Doe").  
2. **Semantic Understanding**:
   - Complex queries ka context samajhna ho (e.g., "A fruit that's green").  
3. **Combination**:
   - Jab exact terms aur context dono important ho.

---

### **Summary**
Hybrid search ka Matlab hai keyword aur semantic search ka combination, jo BM25 aur embeddings ka use karke achieve hota hai.  
- **BM25** fast aur effective hai, but sirf keywords pe kaam karta hai.  
- **Embeddings** semantic context samajhte hain, but slow ho sakte hain.  
- LangChain ka **Ensemble Retriever** dono ko combine karke hybrid search banata hai.

Ab tum isko apne use-case ke hisaab se customize kar ke try karo. Aur agar koi aur doubt ho, toh seedha yahan aake pooch lo. 😎

# Code : Hybrid Search

Hybrid Search = KeyWord Style Seach + Vector Style Seach = Advantages of doing keyword lookup and also the vector lookup.

## BM25 Retriever - Sparse retriever

In [1]:
!pip install -qU "langchain-chroma>=0.1.2"

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m628.3/628.3 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.6/278.6 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00

In [2]:
!pip install -qU langchain-community faiss-cpu

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.6/411.6 kB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
%pip install --upgrade --quiet  langchain-google-genai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [9]:
import getpass
import os

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Provide your Google API key here")

Provide your Google API key here··········


In [15]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.schema import Document

# from langchain.vectorstores import Chroma
from langchain_chroma import Chroma
# from langchain.vectorstores import FAISS
from langchain_community.vectorstores import FAISS

# from langchain.embeddings.openai import OpenAIEmbeddings
# embedding = OpenAIEmbeddings()

# using the below embedding for only the vector store part, not for the bm25 embedding.
# bm25 will valculate the sparse vectors by itself
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

In [13]:
sentence = "This is a test sentence to embed."
vector = embeddings.embed_query(sentence)
print(vector[:5])

[0.008006626740098, 0.032392993569374084, -0.08357647806406021, 0.038629960268735886, 0.04651472344994545]


In [14]:
doc_list = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
    "I like computers by Apple",
    "I love fruit juice"
]

In [17]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [18]:
# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_texts(doc_list)
bm25_retriever.k = 2

In [20]:
import warnings
warnings.filterwarnings("ignore")

In [22]:
bm25_retriever.get_relevant_documents("Apple")

[Document(metadata={}, page_content='I like computers by Apple'),
 Document(metadata={}, page_content='I love fruit juice')]

Please ignore I like computers by Apple wala sentece. Pta nhi vo kyo retrieve ho gya.

BM25 is a key work , very quick search.

In [23]:
bm25_retriever.get_relevant_documents("a green fruit")

[Document(metadata={}, page_content='I love fruit juice'),
 Document(metadata={}, page_content='I like computers by Apple')]

## Embeddings - Dense retrievers FAISS

In [25]:
faiss_vectorstore = FAISS.from_texts(doc_list, embeddings)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

In [26]:
faiss_retriever.get_relevant_documents("A green fruit")

[Document(id='01f9f9af-039b-4928-a812-3ade4bd52cc3', metadata={}, page_content='Apples and oranges are fruits'),
 Document(id='d553dc42-ac91-44d6-be1e-d3ae452c768a', metadata={}, page_content='I like apples')]

See semantic embeddings are quite accurate.

## Ensemble Retriever

In [27]:
# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, faiss_retriever],
                                       weights=[0.5, 0.5])

In [28]:
docs = ensemble_retriever.get_relevant_documents("A green fruit")
docs

[Document(metadata={}, page_content='I love fruit juice'),
 Document(id='01f9f9af-039b-4928-a812-3ade4bd52cc3', metadata={}, page_content='Apples and oranges are fruits'),
 Document(metadata={}, page_content='I like computers by Apple'),
 Document(id='d553dc42-ac91-44d6-be1e-d3ae452c768a', metadata={}, page_content='I like apples')]

In [29]:
docs = ensemble_retriever.get_relevant_documents("Apple Phones")
docs

[Document(metadata={}, page_content='I like computers by Apple'),
 Document(id='d553dc42-ac91-44d6-be1e-d3ae452c768a', metadata={}, page_content='I like apples'),
 Document(metadata={}, page_content='I love fruit juice')]

We are getting here the advantage of hybrid search approach. We are getting both the keyword search looku-up and the semantic lookup.

These responses are ranked based. Highest rank / most similar is on the top.