
### **Fuzzy Search** – *Character-Level Typo/Spell Match*

- **Use Case**: Handling **typos or minor spelling errors** in terms.
- **Best for**:
  - Searching misspelled names, usernames, etc.
  - Typo tolerance
- **Example**:  
  Query: `"heath"`  
  Matches: `"health"` because it's only 1 character different.

> Fuzzy search is great when users **mistype words or names**.

---

###  Summary: Which One to Choose?

| Feature                    | BM25                        | Fuzzy Search               |
|---------------------------|-----------------------------|----------------------------|
| Partial term match        | ✔ Yes                       | ✖ No (character level only)|
| Typo handling             | ✖ No                        | ✔ Yes                      |
| Meaning/context based     | ✔ Yes (sort of)             | ✖ No                       |
| Works well for RAG        | ✔ Perfect fit               | ✖ Not ideal                |

---

###  Your Case (RAG or AI Search):
**Use BM25** if you care about **relevance and partial keyword matching**.

**Use Fuzzy Search** only if you're trying to correct user **typos** or **misspellings**.


In [1]:
%pip install --upgrade --quiet  rank_bm25  langchain nltk tiktoken fuzzywuzzy


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m0.5/1.0 MB[0m [31m15.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/437.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m437.6/437.6 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
from langchain.schema import Document
import nltk
from nltk.tokenize import sent_tokenize
from fuzzywuzzy import process

# Download sentence tokenizer
nltk.download("punkt")
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
# Step 1: Load and split text

with open("sachin_tendulkar_bio.txt", "r", encoding= "utf-8") as f:
  content =f.read()
content

'\nSachin Ramesh Tendulkar: The Master Blaster of Cricket\n\nEarly Life:\nSachin Tendulkar was born on April 24, 1973, in Mumbai, India. He was introduced to cricket at an early age by his elder brother Ajit Tendulkar, who recognized his extraordinary talent. Under the mentorship of coach Ramakant Achrekar, Sachin began honing his cricketing skills at Shivaji Park. As a young boy, he played for his school team and gained immense attention by scoring centuries regularly in school-level tournaments.\n\nDomestic Debut:\nAt the age of 15, Sachin made his debut in first-class cricket for Mumbai in the Ranji Trophy. He scored a century in his debut match against Gujarat, becoming the youngest Indian to do so at the time. His performance in domestic cricket quickly earned him a place in the national team.\n\nInternational Debut:\nSachin Tendulkar made his international debut for India in a Test match against Pakistan in Karachi on November 15, 1989, at the age of 16. Despite facing a formidab

In [14]:
from langchain.text_splitter import TokenTextSplitter
from fuzzywuzzy import fuzz

# Initialize Token-based Text Splitter
text_splitter = TokenTextSplitter(
    chunk_size=20,       # number of tokens per chunk
    chunk_overlap=5      # overlap to preserve context
)

# Split the text
chunks = text_splitter.split_text(content)

chunks[0]

'\nSachin Ramesh Tendulkar: The Master Blaster of Cricket\n\nEarly Life'

In [15]:
# Step 3: Wrap each sentence as a Document object
docs = [Document(page_content=sent) for sent in chunks]
docs[0]

Document(metadata={}, page_content='\nSachin Ramesh Tendulkar: The Master Blaster of Cricket\n\nEarly Life')

In [19]:
top_k = 3  # Number of top results to retrieve
query = "in April 24, 1973  who born"
# Extract document contents for fuzzy matching
doc_contents = [doc.page_content for doc in docs]


# Perform fuzzy search on the document contents and calculate match score
results = []

for doc in doc_contents:
    score = fuzz.partial_ratio(query, doc)  # Calculate score using partial matching
    results.append((doc, score))

# Step 4: Sort results by score (in descending order) and get top_k results
results.sort(key=lambda x: x[1], reverse=True)
top_results = results[:top_k]

# Step 5: Print results with matching score
print("Top Fuzzy Search Results:\n")
for i, (match, score) in enumerate(top_results, 1):
    print(f"{i}. {match}\nScore: {score}/100\n")

Top Fuzzy Search Results:

1.  Cricket

Early Life:
Sachin Tendulkar was born on April 24, 1973
Score: 73/100

2.  on April 24, 1973, in Mumbai, India. He was introduced to cricket at an early age
Score: 70/100

3. 63 ODIs.
- Tendulkar was part of the Indian cricket team that won the 2011
Score: 41/100

