<a href="https://colab.research.google.com/github/harshxmishra/zepto-advanced-search/blob/main/zepto's_advanced_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Replicating Zepto's Advanced Search Correction System

**Objective:** To build a functional Proof of Concept that replicates the intelligent search system

This system is designed to handle ambiguous user queries—including misspellings, slang, and multilingual terms—by using a Retrieval-Augmented Generation (RAG) architecture. The goal is to interpret user intent accurately and retrieve the correct products from our catalog.

### The Two-Stage Strategy

Our implementation follows the exact same strategy outlined by Zepto:

1.  **Semantic Retrieval:** We first take the user's raw query and find a list of `top-k` potentially relevant products from our entire catalog. This is done by comparing the query's vector embedding against the embeddings of our products stored in a vector database. This step provides the necessary context.

2.  **LLM-Powered Correction and Selection:** The retrieved products (the context) and the original query are then passed to a Large Language Model (LLM). The LLM's task is not just to correct spelling, but to analyze the context and *select the most likely product* the user intended to find. It then returns a clean, corrected query and the reasoning behind its decision in a structured format.

## Phase 1: Environment Setup & Data Preparation

### Step 1.1: Install Dependencies

First, we install the necessary Python libraries. We'll use `langchain` for orchestrating the components, `langchain-groq` for the fast LLM inference, `fastembed` for efficient embeddings, `langchain-chroma` for the vector database, and `pandas` for data handling.

In [None]:
!pip install -q pandas langchain langchain-core langchain-groq langchain-chroma fastembed langchain-community

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/2.5 MB[0m [31m18.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m45.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h

### Step 1.2: Create an Expanded and Complex Dummy Dataset

To thoroughly test the system, we need a dataset that reflects real-world challenges. This CSV includes:
- A wider variety of products (20+).
- Common brand names (e.g., `Coca-Cola`, `Maggi`).
- Multilingual and vernacular terms (`dhaniya`, `kanda`, `nimbu`).
- Potentially ambiguous items (`cheese spread`, `cheese slices`).

In [None]:
import pandas as pd
from io import StringIO

csv_data = """product_id,product_name,category,tags
1,Aashirvaad Select Atta 5kg,Staples,"atta, flour, gehu, aata, wheat"
2,Amul Gold Milk 1L,Dairy,"milk, doodh, paal, full cream milk"
3,Tata Salt 1kg,Staples,"salt, namak, uppu"
4,Kellogg's Corn Flakes 475g,Breakfast,"cornflakes, breakfast cereal, makkai"
5,Parle-G Gold Biscuit 1kg,Snacks,"biscuit, cookies, biscuits"
6,Cadbury Dairy Milk Silk,Chocolates,"chocolate, choco, silk, dairy milk"
7,Haldiram's Classic Banana Chips,Snacks,"kele chips, banana wafers, chips"
8,MDH Deggi Mirch Masala,Spices,"mirchi, masala, spice, red chili powder"
9,Fresh Coriander Bunch (Dhaniya),Vegetables,"coriander, dhaniya, kothimbir, cilantro"
10,Fresh Mint Leaves Bunch (Pudina),Vegetables,"mint, pudhina, pudina patta"
11,Taj Mahal Red Label Tea 500g,Beverages,"tea, chai, chaha, red label"
12,Nescafe Classic Coffee 100g,Beverages,"coffee, koffee, nescafe"
13,Onion 1kg (Kanda),Vegetables,"onion, kanda, pyaz"
14,Tomato 1kg,Vegetables,"tomato, tamatar"
15,Coca-Cola Original Taste 750ml,Beverages,"coke, coca-cola, soft drink, cold drink"
16,Maggi 2-Minute Noodles Masala,Snacks,"maggi, noodles, instant food"
17,Amul Cheese Slices 100g,Dairy,"cheese, cheese slice, paneer slice"
18,Britannia Cheese Spread 180g,Dairy,"cheese, cheese spread, creamy cheese"
19,Fresh Lemon 4pcs (Nimbu),Vegetables,"lemon, nimbu, lime"
20,Saffola Gold Edible Oil 1L,Staples,"oil, tel, cooking oil, saffola"
21,Basmati Rice 1kg,Staples,"rice, chawal, basmati"
22,Kurkure Masala Munch,Snacks,"kurkure, snacks, chips"
"""

df = pd.read_csv(StringIO(csv_data))

print("Product Catalog successfully loaded.")
df.head()

Product Catalog successfully loaded.


Unnamed: 0,product_id,product_name,category,tags
0,1,Aashirvaad Select Atta 5kg,Staples,"atta, flour, gehu, aata, wheat"
1,2,Amul Gold Milk 1L,Dairy,"milk, doodh, paal, full cream milk"
2,3,Tata Salt 1kg,Staples,"salt, namak, uppu"
3,4,Kellogg's Corn Flakes 475g,Breakfast,"cornflakes, breakfast cereal, makkai"
4,5,Parle-G Gold Biscuit 1kg,Snacks,"biscuit, cookies, biscuits"


## Phase 2: Building the Core RAG Components

### Step 2.1: Initialize a Vector Database

We will convert our product data into numerical representations (embeddings) that capture semantic meaning. We use `FastEmbed` for this, as it's fast and runs locally. These embeddings are stored in `ChromaDB`, a lightweight vector store.

**Embedding Strategy:** For each product, we create a single text document that combines its name, category, and tags. This creates a rich, descriptive embedding that improves the chances of a successful semantic match.

In [None]:
import os
import json
from langchain.schema import Document
from langchain.embeddings import FastEmbedEmbeddings
from langchain_chroma import Chroma

# Create LangChain Documents
documents = [
    Document(
        page_content=f"{row['product_name']}. Category: {row['category']}. Tags: {row['tags']}",
        metadata={
            "product_id": row['product_id'],
            "product_name": row['product_name'],
            "category": row['category']
        }
    ) for _, row in df.iterrows()
]

# Initialize embedding model and vector store
embedding_model = FastEmbedEmbeddings(model_name="BAAI/bge-small-en-v1.5")
vectorstore = Chroma.from_documents(documents, embedding_model)

# The retriever will be used to fetch the top-k most similar documents
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

print("Vector database initialized and retriever is ready.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model_optimized.onnx:   0%|          | 0.00/66.5M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

Vector database initialized and retriever is ready.


### Step 2.2: Design the Advanced LLM Prompt

This is the most critical step. We design a prompt that instructs the LLM to act as an expert query interpreter. The prompt forces the LLM to follow a strict process and return a structured JSON object. This ensures the output is predictable and easy to use in our application.

**Key features of the prompt:**
- **Clear Role:** The LLM is told it's an expert system for a grocery store.
- **Context is Key:** It must base its decision on the list of retrieved products.
- **Mandatory JSON Output:** We instruct it to return a JSON object with a specific schema: `corrected_query`, `identified_product`, `confidence`, and `reasoning`. This is crucial for system reliability.

In [None]:
from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate

# IMPORTANT: Set your Groq API key here or as an environment variable
os.environ["GROQ_API_KEY"] = "" # Replace with your key

llm = ChatGroq(
    temperature=0,
    model_name="llama3-8b-8192",
    model_kwargs={"response_format": {"type": "json_object"}},
)

prompt_template = """
You are a world-class search query interpretation engine for a grocery delivery service like Zepto.
Your primary goal is to understand the user's *intent*, even if their query is misspelled, in a different language, or uses slang.

Analyze the user's `RAW QUERY` and the `CONTEXT` of semantically similar products retrieved from our catalog.
Based on this, determine the most likely product the user is searching for.

**INSTRUCTIONS:**
1. Compare the `RAW QUERY` against the product names in the `CONTEXT`.
2. Identify the single best match from the `CONTEXT`.
3. Generate a clean, corrected search query for that product.
4. Provide a confidence score (High, Medium, Low) and a brief reasoning for your choice.
5. Return a single JSON object with the following schema:
   - "corrected_query": A clean, corrected search term.
   - "identified_product": The full name of the single most likely product from the context.
   - "confidence": Your confidence in the decision: "High", "Medium", or "Low".
   - "reasoning": A brief, one-sentence explanation of why you made this choice.

If the query is too ambiguous or has no good match in the context, confidence should be "Low" and `identified_product` can be `null`.

---
CONTEXT:
{context}

RAW QUERY:
{query}
---

JSON OUTPUT:
"""


prompt = ChatPromptTemplate.from_template(prompt_template)

print("LLM and Prompt Template are configured.")

LLM and Prompt Template are configured.


## Phase 3: Creating the End-to-End Pipeline

We now chain all the components together using LangChain Expression Language (LCEL). This creates a seamless flow from query to final result.

**Pipeline Flow:**
1. The user's query is passed to the `retriever` to fetch context.
2. The context and original query are formatted and fed into the `prompt`.
3. The formatted prompt is sent to the `LLM`.
4. The LLM's JSON output is parsed into a Python dictionary.

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    """Formats the retrieved documents for the prompt."""
    return "\n".join([f"- {d.metadata['product_name']}" for d in docs])

# The main RAG chain
rag_chain = (
    {"context": retriever | format_docs, "query": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

def search_pipeline(query: str):
    """Executes the full search and correction pipeline."""
    print(f"\n{'='*50}")
    print(f"Executing Pipeline for Query: '{query}'")
    print(f"{'='*50}")

    # --- Stage 1: Semantic Retrieval ---
    initial_context = retriever.get_relevant_documents(query)
    print("\n[Stage 1: Semantic Retrieval]")
    print("Found the following products for context:")
    for doc in initial_context:
        print(f"  - {doc.metadata['product_name']}")

    # --- Stage 2: LLM Correction & Selection ---
    print("\n[Stage 2: LLM Correction & Selection]")
    llm_output_str = rag_chain.invoke(query)

    try:
        llm_output = json.loads(llm_output_str)
        print("LLM successfully parsed the query and returned:")
        print(json.dumps(llm_output, indent=2))
        corrected_query = llm_output.get('corrected_query', query)
    except (json.JSONDecodeError, AttributeError) as e:
        print(f"LLM output failed to parse. Error: {e}")
        print(f"Raw LLM output: {llm_output_str}")
        corrected_query = query # Fallback to original query

    # --- Final Step: Search with Corrected Query ---
    print("\n[Final Step: Search with Corrected Query]")
    print(f"Searching for the corrected term: '{corrected_query}'")
    final_results = vectorstore.similarity_search(corrected_query, k=3)
    print("\nTop 3 Product Results:")
    for i, doc in enumerate(final_results):
        print(f"  {i+1}. {doc.metadata['product_name']} (ID: {doc.metadata['product_id']})")
    print(f"{'='*50}\n")


print("End-to-end search pipeline is ready.")

End-to-end search pipeline is ready.


## Phase 4: Demonstration & Results

Now, let's test the system with a variety of challenging queries to see how it performs.

In [None]:
# --- Test Case 1: Simple Misspelling ---
search_pipeline("aata")

# --- Test Case 2: Vernacular Term ---
search_pipeline("kanda")

# --- Test Case 3: Brand Name + Misspelling ---
search_pipeline("cococola")

# --- Test Case 4: Ambiguous Query ---
search_pipeline("chese")

# --- Test Case 5: Highly Ambiguous / Vague Query ---
search_pipeline("drink")


Executing Pipeline for Query: 'aata'

[Stage 1: Semantic Retrieval]
Found the following products for context:
  - Aashirvaad Select Atta 5kg
  - Tata Salt 1kg
  - Maggi 2-Minute Noodles Masala
  - Amul Gold Milk 1L
  - Tomato 1kg

[Stage 2: LLM Correction & Selection]


  initial_context = retriever.get_relevant_documents(query)


LLM successfully parsed the query and returned:
{
  "corrected_query": "atta",
  "identified_product": "Aashirvaad Select Atta 5kg",
  "confidence": "High",
  "reasoning": "The query 'aata' is a common misspelling of 'atta', which is a well-known term in Indian cuisine, and the top match in the context is Aashirvaad Select Atta 5kg."
}

[Final Step: Search with Corrected Query]
Searching for the corrected term: 'atta'

Top 3 Product Results:
  1. Aashirvaad Select Atta 5kg (ID: 1)
  2. Tata Salt 1kg (ID: 3)
  3. Tomato 1kg (ID: 14)


Executing Pipeline for Query: 'kanda'

[Stage 1: Semantic Retrieval]
Found the following products for context:
  - Onion 1kg (Kanda)
  - Aashirvaad Select Atta 5kg
  - Fresh Lemon 4pcs (Nimbu)
  - Basmati Rice 1kg
  - Tata Salt 1kg

[Stage 2: LLM Correction & Selection]
LLM successfully parsed the query and returned:
{
  "corrected_query": "Onion",
  "identified_product": "Onion 1kg (Kanda)",
  "confidence": "High",
  "reasoning": "The query 'kanda' is a c

## Conclusion

This system successfully replicates the core strategy of Zepto's advanced search system. By combining fast semantic retrieval with intelligent LLM-based analysis, the system can:

- **Correct misspellings and slang** with high accuracy.
- **Understand multilingual queries** by matching them to the correct products.
- **Disambiguate queries** by using retrieved context to infer user intent (e.g., choosing between "cheese slices" and "cheese spread").
- **Provide structured, auditable outputs**, showing not just the correction but also the *reasoning* behind it.

This RAG-based architecture is robust, scalable, and demonstrates a clear path to significantly improving user experience and search conversion rates.