

###  1. **BM25** – *Semantic or Token-Level Partial Match*

- **Use Case**: Matching documents that **contain some query terms**, not necessarily all.
- **Best for**:
  - Keyword-based search
  - Information retrieval
  - RAG systems
- **Example**:  
  Query: `"deep learning health"`  
  Matches: `"deep learning in healthcare"`, even if "health" ≠ "healthcare".

>  BM25 is great for finding **relevant documents even when only some terms match**.

---


### ✅ Summary: Which One to Choose?

| Feature                    | BM25                        | Fuzzy Search               |
|---------------------------|-----------------------------|----------------------------|
| Partial term match        | ✔ Yes                       | ✖ No (character level only)|
| Typo handling             | ✖ No                        | ✔ Yes                      |
| Meaning/context based     | ✔ Yes (sort of)             | ✖ No                       |
| Works well for RAG        | ✔ Perfect fit               | ✖ Not ideal                |

---

### ✅ Your Case (RAG or AI Search):
**Use BM25** if you care about **relevance and partial keyword matching**.

**Use Fuzzy Search** only if you're trying to correct user **typos** or **misspellings**.



In [60]:
%pip install --upgrade --quiet  rank_bm25  langchain nltk tiktoken fuzzywuzzy


In [35]:
from langchain.retrievers import BM25Retriever
from langchain.schema import Document
import nltk
from nltk.tokenize import sent_tokenize

# Download sentence tokenizer
nltk.download("punkt")
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [36]:
# Step 1: Load and split text

with open("sachin_tendulkar_bio.txt", "r", encoding= "utf-8") as f:
  content =f.read()
content

'\nSachin Ramesh Tendulkar: The Master Blaster of Cricket\n\nEarly Life:\nSachin Tendulkar was born on April 24, 1973, in Mumbai, India. He was introduced to cricket at an early age by his elder brother Ajit Tendulkar, who recognized his extraordinary talent. Under the mentorship of coach Ramakant Achrekar, Sachin began honing his cricketing skills at Shivaji Park. As a young boy, he played for his school team and gained immense attention by scoring centuries regularly in school-level tournaments.\n\nDomestic Debut:\nAt the age of 15, Sachin made his debut in first-class cricket for Mumbai in the Ranji Trophy. He scored a century in his debut match against Gujarat, becoming the youngest Indian to do so at the time. His performance in domestic cricket quickly earned him a place in the national team.\n\nInternational Debut:\nSachin Tendulkar made his international debut for India in a Test match against Pakistan in Karachi on November 15, 1989, at the age of 16. Despite facing a formidab

In [53]:
from langchain.text_splitter import TokenTextSplitter

# Initialize Token-based Text Splitter
text_splitter = TokenTextSplitter(
    chunk_size=50,       # number of tokens per chunk
    chunk_overlap=5      # overlap to preserve context
)

# Split the text
chunks = text_splitter.split_text(content)

chunks[0]

'\nSachin Ramesh Tendulkar: The Master Blaster of Cricket\n\nEarly Life:\nSachin Tendulkar was born on April 24, 1973, in Mumbai, India. He was introduced to cricket at an early age'

In [56]:
# Step 3: Wrap each sentence as a Document object
docs = [Document(page_content=sent) for sent in chunks]
docs[0]

Document(metadata={}, page_content='\nSachin Ramesh Tendulkar: The Master Blaster of Cricket\n\nEarly Life:\nSachin Tendulkar was born on April 24, 1973, in Mumbai, India. He was introduced to cricket at an early age')

In [57]:

# Step 4: Initialize BM25 Retriever
retriever = BM25Retriever.from_documents(docs)
retriever.k = 3  # Number of top results to retrieve

# Step 5: Query
query = "who born on 1973 april"
results = retriever.get_relevant_documents(query)

# Step 6: Print results
print(" Top BM25 Results:\n")
for i, doc in enumerate(results, 1):
    print(f"{i}. {doc.page_content}\n")


 Top BM25 Results:

1. 
Sachin Ramesh Tendulkar: The Master Blaster of Cricket

Early Life:
Sachin Tendulkar was born on April 24, 1973, in Mumbai, India. He was introduced to cricket at an early age

2.  cricket at an early age by his elder brother Ajit Tendulkar, who recognized his extraordinary talent. Under the mentorship of coach Ramakant Achrekar, Sachin began honing his cricketing skills at Shivaji Park. As a

3.  a place in the national team.

International Debut:
Sachin Tendulkar made his international debut for India in a Test match against Pakistan in Karachi on November 15, 1989, at the age of 16. Despite facing a formidable



In [62]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Example Documents
docs = [
    {"page_content": "Sachin Tendulkar was born in 1973, in April."},
    {"page_content": "The history of famous cricketers who were born in April includes Sachin Tendulkar."},
    {"page_content": "The famous cricketer Sachin Tendulkar was born on April 24th, 1973."},
    {"page_content": "Virat Kohli is a modern cricketer who plays for India."},
    {"page_content": "Cricket legends such as Brian Lara and Ricky Ponting shaped the sport."}
]

# Step 1: Query
query = "who born in 1973, in April"

# Step 2: Fuzzy Search (Using fuzzywuzzy's process.extract function)
top_k = 3  # Number of top results to retrieve
results = process.extract(query, [doc["page_content"] for doc in docs], limit=top_k)

# Step 3: Print results with matching score
print("Top Fuzzy Search Results:\n")
for i, (match, score) in enumerate(results, 1):
    print(f"{i}. {match}\nScore: {score}/100\n")


Top Fuzzy Search Results:

1. Sachin Tendulkar was born in 1973, in April.
Score: 86/100

2. The history of famous cricketers who were born in April includes Sachin Tendulkar.
Score: 86/100

3. The famous cricketer Sachin Tendulkar was born on April 24th, 1973.
Score: 86/100

