# Introduction to Search Systems

## Learning Objectives
By the end of this notebook, you will be able to:
- Understand what a search system is and why it matters in e-commerce
- Identify the key components: query processing, indexing, retrieval, ranking, presentation
- Build and play with simple search functions over a toy product catalog
- Evaluate search quality with precision@K, recall@K, MRR, and NDCG



## Key Components of a Search System

1. Query Processing: tokenization, normalization, synonyms/typo handling
2. Indexing: build inverted index for fast lookup
3. Retrieval: collect candidate documents matching query terms
4. Ranking: score candidates (e.g., TF-IDF) and order by relevance
5. Presentation: facets, filters, suggestions, UI formatting



In [15]:
import pandas as pd
import numpy as np

# Sample product data (tiny catalog for fast iteration)
products_data = [
    ("P001", "iPhone 14 Pro", "Latest Apple smartphone with advanced camera", "Electronics", 999.99, "Apple"),
    ("P002", "Samsung Galaxy S23", "Android smartphone with great camera", "Electronics", 799.99, "Samsung"),
    ("P003", "Nike Air Max", "Comfortable running shoes", "Shoes", 129.99, "Nike"),
    ("P004", "Adidas Ultraboost", "Premium running shoes", "Shoes", 180.00, "Adidas"),
    ("P005", "MacBook Pro 16", "Professional laptop for work", "Electronics", 2499.99, "Apple"),
    ("P006", "Dell XPS 13", "Ultrabook laptop", "Electronics", 1199.99, "Dell"),
    ("P007", "Red Dress", "Elegant evening dress", "Clothing", 89.99, "Fashion Brand"),
    ("P008", "Blue Jeans", "Classic denim jeans", "Clothing", 59.99, "Denim Co"),
]

df_products = pd.DataFrame(products_data, columns=["id","title","description","category","price","brand"])
print("Sample product Data (1 row):")
print("="*80)
print(df_products.head(1).T)
print(f"\nDataset Shape: {df_products.shape}")
print(f"Categories: {df_products.category.unique()}")
print(f"Brands: {df_products.brand.unique()}")
print(f"Price Range: ${df_products.price.min():.2f} - ${df_products.price.max():.2f}")



Sample product Data (1 row):
                                                        0
id                                                   P001
title                                       iPhone 14 Pro
description  Latest Apple smartphone with advanced camera
category                                      Electronics
price                                              999.99
brand                                               Apple

Dataset Shape: (8, 6)
Categories: ['Electronics' 'Shoes' 'Clothing']
Brands: ['Apple' 'Samsung' 'Nike' 'Adidas' 'Dell' 'Fashion Brand' 'Denim Co']
Price Range: $59.99 - $2499.99


In [16]:
# Quick data exploration examples
print("\nFind all Apple products:")
print(df_products[df_products.brand == "Apple"][['title','brand']])

print("\nFind all products under $200:")
print(df_products[df_products.price < 200][['title','price']])




Find all Apple products:
            title  brand
0   iPhone 14 Pro  Apple
4  MacBook Pro 16  Apple

Find all products under $200:
               title   price
2       Nike Air Max  129.99
3  Adidas Ultraboost  180.00
6          Red Dress   89.99
7         Blue Jeans   59.99


In [17]:
# Simple lexical search (playground)
def simple_search(products_df: pd.DataFrame, query: str):
    query_lower = str(query).lower()
    results = []
    for _, row in products_df.iterrows():
        title_match = query_lower in str(row['title']).lower()
        desc_match = query_lower in str(row['description']).lower()
        if title_match or desc_match:
            score = 2 if title_match else 1
            if desc_match:
                score += 1
            results.append({'product': row.to_dict(), 'score': score})
    results.sort(key=lambda x: x['score'], reverse=True)
    return results

print('Testing simple search for "iPhone":')
for i, r in enumerate(simple_search(df_products, 'iPhone'), 1):
    p = r['product']
    print(f"{i}. {p['title']} - ${p['price']} (Score: {r['score']})")



Testing simple search for "iPhone":
1. iPhone 14 Pro - $999.99 (Score: 2)


In [20]:
# Improved search with tokenization and multi-term matching
def tokenize_text(text: str):
    punctuation_chars = ".,!?;:()[]{}'\"-"
    text = text.lower()
    for ch in punctuation_chars:
        text = text.replace(ch, ' ')
    return [w for w in text.split() if w]

def improved_search(products_df: pd.DataFrame, query: str):
    query_tokens = tokenize_text(query)
    results = []
    for _, row in products_df.iterrows():
        searchable_tokens = tokenize_text(f"{row['title']} {row['description']}")
        matches = sum(1 for t in query_tokens if t in searchable_tokens)
        if matches > 0:
            score = matches + (1 if any(t in row['title'].lower() for t in query_tokens) else 0)
            results.append({'product': row.to_dict(), 'score': score, 'matches': matches})
    results.sort(key=lambda x: x['score'], reverse=True)
    return results

print("Testing improved search for 'running shoes':")
for i, r in enumerate(improved_search(df_products, 'running shoes'), 1):
    p = r['product']
    print(f"{i}. {p['title']} - ${p['price']} (Score: {r['score']}, Matches: {r['matches']})")



Testing improved search for 'running shoes':
1. Nike Air Max - $129.99 (Score: 2, Matches: 2)
2. Adidas Ultraboost - $180.0 (Score: 2, Matches: 2)


## Why these metrics vs others (quick guide)
- Precision@K: top‑K quality; simple to interpret; ignores rank positions beyond inclusion.
- Recall@K: coverage; useful for discovery; can overvalue less relevant items if K is large.
- MRR: time‑to‑first‑relevant; great for navigational queries; ignores subsequent relevant items.
- MAP: emphasizes ranking all relevant items high; binary labels; less aligned with graded relevance.
- NDCG: graded relevance + position discounting; aligns with user behavior; normalized for cross‑query comparability.

Rule of thumb:
- Navigational → MRR/Precision@K (small K)
- Discovery/browse → Recall@K + NDCG@K (larger K)
- Offline robustness → NDCG@K + MAP
- Always pair with online guardrails (CTR, Conversion, RPS, latency, zero‑results)


## How to choose the right evaluation metric (and why)

- Precision@K
  - What it measures: fraction of top-K results that are relevant
  - When it matters most: navigational/transactional queries where users expect high-quality results at the very top (e.g., brand, SKU, "iPhone 14 Pro")
  - Trade‑off: Raising precision can reduce recall if you return fewer diverse results

- Recall@K
  - What it measures: fraction of all relevant items that appear in the top K
  - When it matters most: exploratory/browse queries, long-tail attributes, when users might scroll or filter to find more options (e.g., "running shoes")
  - Trade‑off: Raising recall can lower precision by including more borderline results

- MAP (Mean Average Precision)
  - What it measures: average precision across ranks for each query, then averaged across queries
  - Why it’s useful: rewards ranking all relevant items high (not just the first one); stable across queries with multiple relevant items
  - When to prefer: offline relevance benchmarking with multiple relevant items per query

- MRR (Mean Reciprocal Rank)
  - What it measures: how quickly (at which rank) the first relevant item appears
  - Why it’s useful: simple, focused on the first hit; excellent for navigational queries (e.g., user wants a specific product)
  - When to prefer: scenarios where “first correct result fast” matters (e.g., product detail lookup)

- NDCG (Normalized Discounted Cumulative Gain)
  - What it measures: graded relevance with position discounting; rewards highly relevant results at the top
  - Why it’s useful: captures both relevance intensity and rank position; robust to multiple relevance levels
  - When to prefer: realistic ranking tasks with graded labels (e.g., “perfect”, “good”, “ok”); especially in e‑commerce where not all relevant items are equal

- Putting it together for e‑commerce
  - Checkout-focused searches (e.g., brand/SKU): prioritize MRR/Precision@K (K in [1,3])
  - Browse/discovery (e.g., “black running shoes”): prioritize Recall@K and NDCG@K (K in [10,20])
  - Overall offline evaluation: use NDCG@K and MAP for robustness; report Precision/Recall for interpretability
  - Online A/B guardrails: CTR, Add‑to‑Cart, Conversion Rate, Revenue/Search, plus Zero‑Results Rate and latency (p95/p99)



## Why NDCG (vs only Precision/Recall)

1) Quick refresher:
- Precision@K: fraction of top‑K that are relevant
- Recall@K: fraction of all relevant items retrieved in top‑K

Limitations: both treat all retrieved relevant items equally and ignore rank position; they’re binary (relevant vs not) and very sensitive to the choice of K.

2) Enter NDCG: graded relevance + position discounting
- DCG@K uses graded labels r_i (e.g., 2=highly relevant, 1=relevant, 0=irrelevant) with logarithmic discount so top ranks matter more.
- NDCG@K = DCG@K / IDCG@K normalizes by the ideal ranking, giving a 0–1 score comparable across queries.

3) Why NDCG is often better for search
- Accounts for rank position: a relevant result at rank 1 is better than at rank 10
- Supports graded relevance: not all relevant items are equal in e‑commerce (exact SKU match vs category match)
- Stable across queries: normalization makes scores comparable when query difficulty or number of relevant items differ
- Aligns with user behavior: users click top ranks more; discounts lower positions

4) When to use which
- Use MRR/Precision@K for navigational queries (find 1 best item fast)
- Use Recall@K when coverage matters (browse/discovery, facet exploration)
- Use NDCG@K for overall ranking quality with graded labels and position sensitivity (most e‑commerce use‑cases)
- Use MAP when you care about ranking all relevant items high without graded labels

5) Practical choices (retail search)
- Report NDCG@K with K in {3, 10, 20} (K=3 for first row; K=10/20 for page‑level)
- Pair with Precision@K for interpretability and Recall@K for coverage
- Always check zero‑results rate and latency; great relevance with poor coverage/latency fails users

6) Pitfalls and tips
- Garbage in, garbage out: NDCG needs sensible graded relevance labels
- Beware label imbalance (few positives): add multiple K’s and query buckets
- Don’t overfit K: pick K based on UX (e.g., products above the fold)
- Compare distributions, not just means; include confidence intervals when reporting

7) Interview sound‑bites you can use
- “Precision/Recall ignore rank; NDCG discounts lower positions, matching user scan behavior.”
- “NDCG supports graded relevance, critical when exact match ≻ partial match.”
- “We report NDCG@3/10 plus Precision@K and Recall@K, with business guardrails (CTR, conversion, latency).”
- “Normalization (IDCG) makes NDCG comparable across queries with different numbers of relevant items.”


## Practice: Build intuition by tweaking the code

Try these small experiments (change code above and re-run):

1) Precision vs Recall
- Change K in the `precision_at_k`/`recall_at_k` loop (e.g., K ∈ {1,2,3,5,10}). Explain why Precision usually decreases as K grows while Recall increases.
- Modify `query_relevance` to add/remove relevant IDs. Predict and then verify impacts on Precision/Recall.

2) Title boost and ranking
- In `improved_search`, change the title boost from `+1` to `+2` or `0`. How does this affect the top-3 for queries like "Apple", "running shoes"? Why?
- Add a small price-related boost (e.g., favor cheaper or mid-range) and explain trade-offs for business goals.

3) MRR sensitivity
- Introduce an irrelevant top-1 result for a query by altering tokenization or boosts. Observe MRR drop and explain why MRR is sensitive to the first relevant rank.

4) NDCG with graded relevance
- Replace the toy `scores` with graded labels (e.g., 2 for exact match, 1 for category match, 0 otherwise) computed from your catalog.
- Compare NDCG@3 vs NDCG@10 for the same query. Why might they differ?

5) Offline to online mapping
- Propose an offline metric target (e.g., +0.02 NDCG@10) and map it to online guardrails (CTR, Conversion, Revenue/Search, zero-results, latency). What risks should you watch for in e‑commerce?

6) Edge cases
- Create a query with zero results. Add synonyms (e.g., "tee" → "t-shirt") to fix it; measure the change.
- Create a highly ambiguous query (e.g., "apple"). How would you balance relevance with diversity? What metrics would you monitor?



## MAP (Mean Average Precision)

- Per-query Average Precision (AP): average of the precision at each rank where a relevant item appears.

$$\mathrm{AP}(q) = \frac{1}{\lvert R_q \rvert} \sum_{k=1}^{K} \mathbb{1}[rel_k] \cdot \mathrm{P@}k$$

- Mean Average Precision over queries:

$$\mathrm{MAP} = \frac{1}{\lvert Q \rvert} \sum_{q\in Q} \mathrm{AP}(q)$$

Why MAP: rewards ranking all relevant items high (not just the first), more stable when multiple relevant items exist per query. Prefer MAP when labels are binary and you care about full ordering of relevant items.



In [None]:
# Compute MAP over example queries using improved_search outputs
from typing import List

def average_precision(relevant: set, retrieved: List[str]) -> float:
    if not relevant:
        return 0.0
    hits = 0
    precisions = []
    for k, rid in enumerate(retrieved, start=1):
        if rid in relevant:
            hits += 1
            precisions.append(hits / k)
    return float(np.mean(precisions)) if precisions else 0.0

# Fallback local relevance mapping if not already defined above
if 'query_relevance' not in globals():
    query_relevance = {
        'iPhone': {'P001'},
        'laptop': {'P005','P006'},
        'shoes': {'P003','P004'},
    }

queries = ['iPhone','laptop','shoes']
APs = []
for q in queries:
    results = improved_search(df_products, q)
    retrieved = [r['product']['id'] for r in results]
    relevant = set(query_relevance[q])
    APs.append(average_precision(relevant, retrieved))

print("Per-query AP:")
for q, ap in zip(queries, APs):
    print(f"  {q}: {ap:.3f}")
print("MAP:", f"{np.mean(APs):.3f}")


Per-query AP:
  iPhone: 1.000
  laptop: 1.000
  shoes: 1.000
MAP: 1.000


## Evaluation Metrics:

- Precision@K: Of the top K results, what fraction are relevant?

$$\mathrm{P@K} = \frac{\lvert\text{Relevant in top K}\rvert}{K}$$

- Recall@K: Of all relevant items, how many did we retrieve in top K?

$$\mathrm{R@K} = \frac{\lvert\text{Relevant in top K}\rvert}{\lvert\text{Total relevant}\rvert}$$

- Reciprocal Rank for a query (0 if none):

$$\mathrm{RR} = \frac{1}{\text{rank of first relevant}}$$

- Mean Reciprocal Rank over queries Q:

$$\mathrm{MRR} = \frac{1}{\lvert Q \rvert} \sum_{q \in Q} \mathrm{RR}_q$$

- Discounted Cumulative Gain@K (graded relevance r_i):

$$\mathrm{DCG@K} = \sum_{i=1}^{K} \frac{2^{r_i}-1}{\log_2(i+1)}$$

- Normalized DCG@K:

$$\mathrm{NDCG@K} = \frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}}$$



In [24]:
# Inline metric helpers (binary relevance)

def precision_at_k(relevant_items, retrieved_items, k: int) -> float:
    if k == 0 or not retrieved_items:
        return 0.0
    top_k = set(retrieved_items[:k])
    rel = set(relevant_items)
    return len(top_k & rel) / min(k, len(retrieved_items))


def recall_at_k(relevant_items, retrieved_items, k: int) -> float:
    rel = set(relevant_items)
    if not rel:
        return 0.0
    top_k = set(retrieved_items[:k])
    return len(top_k & rel) / len(rel)

query_relevance = {
    'iPhone': {'P001'},
    'laptop': {'P005','P006'},
    'shoes': {'P003','P004'},
}

q = 'iPhone'
ret = [r['product']['id'] for r in improved_search(df_products, q)]
rel = query_relevance[q]

for k in [1,2,3]:
    p = precision_at_k(rel, ret, k)
    r = recall_at_k(rel, ret, k)
    print(f"{q}: P@{k}={p:.3f}, R@{k}={r:.3f}")



iPhone: P@1=1.000, R@1=1.000
iPhone: P@2=1.000, R@2=1.000
iPhone: P@3=1.000, R@3=1.000


In [32]:
# MRR across multiple queries (binary relevance)
def reciprocal_rank_for_query(relevant_ids, retrieved_ids):
    relevant = set(relevant_ids)
    for idx, pid in enumerate(retrieved_ids, start=1):
        if pid in relevant:
            return 1.0/idx
    return 0.0

queries = ['iPhone','laptop','shoes']

query_relevance = {
    'iPhone': {'P001'},
    'laptop': {'P005','P006'},
    'shoes': {'P003','P004'},
}

rrs = []
for q in queries:
    ret = [r['product']['id'] for r in improved_search(df_products, q)]
    rel = query_relevance[q]
    rrs.append(reciprocal_rank_for_query(rel, ret))

mrr = float(np.mean(rrs)) if rrs else 0.0
print(f"MRR over {queries}: {mrr:.3f}")



MRR over ['iPhone', 'laptop', 'shoes']: 1.000


In [33]:
# Graded relevance NDCG demo (keeps intuition clear)

def dcg_at_k(relevance_scores, k):
    dcg = 0.0
    for i in range(min(k, len(relevance_scores))):
        dcg += (2**relevance_scores[i] - 1) / np.log2(i + 2)
    return dcg

def ndcg_at_k(relevance_scores, k):
    dcg = dcg_at_k(relevance_scores, k)
    ideal = sorted(relevance_scores, reverse=True)
    idcg = dcg_at_k(ideal, k)
    return dcg / idcg if idcg > 0 else 0.0

scores = [2,0,1,0,2]
for k in [1,3,5]:
    print(f"DCG@{k}={dcg_at_k(scores,k):.3f}, NDCG@{k}={ndcg_at_k(scores,k):.3f}")



DCG@1=3.000, NDCG@1=1.000
DCG@3=3.500, NDCG@3=0.649
DCG@5=4.661, NDCG@5=0.864


## Organizing Code with a Class
Benefits: encapsulation, reusability, maintainability. We'll expose a clean `search()` and reuse earlier tokenization and scoring ideas.



In [35]:
class SimpleSearchSystem:
    def __init__(self, products_df: pd.DataFrame):
        self.products_df = products_df
        self.query_relevance = {
            'iPhone': {'P001'},
            'laptop': {'P005','P006'},
            'shoes': {'P003','P004'},
        }
    def tokenize_text(self, text: str):
        punctuation_chars = ".,!?;:()[]{}'\"-"
        text = str(text).lower()
        for ch in punctuation_chars:
            text = text.replace(ch, ' ')
        return [w for w in text.split() if w]
    def search(self, query: str, top_k: int = 5):
        qtokens = self.tokenize_text(query)
        results = []
        for _, row in self.products_df.iterrows():
            searchable = self.tokenize_text(f"{row['title']} {row['description']}")
            matches = sum(1 for t in qtokens if t in searchable)
            if matches > 0:
                score = matches + (1 if any(t in str(row['title']).lower() for t in qtokens) else 0)
                results.append({'product': row.to_dict(), 'score': score, 'matches': matches})
        results.sort(key=lambda x: x['score'], reverse=True)
        return results[:top_k]
    def evaluate_query(self, query: str, k: int = 3):
        results = self.search(query, top_k=k)
        retrieved = [r['product']['id'] for r in results]
        relevant = self.query_relevance.get(query, set())
        p = precision_at_k(relevant, retrieved, k)
        r = recall_at_k(relevant, retrieved, k)
        return {'query': query, 'retrieved': retrieved, 'relevant': list(relevant), f'precision@{k}': p, f'recall@{k}': r}

s = SimpleSearchSystem(df_products)
for q in ['iPhone','laptop','shoes']:
    print(s.evaluate_query(q, k=3))



{'query': 'iPhone', 'retrieved': ['P001'], 'relevant': ['P001'], 'precision@3': 1.0, 'recall@3': 1.0}
{'query': 'laptop', 'retrieved': ['P005', 'P006'], 'relevant': ['P006', 'P005'], 'precision@3': 1.0, 'recall@3': 1.0}
{'query': 'shoes', 'retrieved': ['P003', 'P004'], 'relevant': ['P003', 'P004'], 'precision@3': 1.0, 'recall@3': 1.0}


## Search vs Recommendations

| Aspect | Search | Recommendations |
|---|---|---|
| Trigger | User types a query | System suggests proactively |
| Intent | Specific, known | Exploratory, discovery |
| Query | Required | None required |
| Context | Current session | User history, behavior |
| Goal | Find exact/close match | Discover related/new items |
| Control | User-driven | System-driven |



## Interview Prep and Key Takeaways

- Basic: how search works; search vs recommendations; measuring quality
- Technical: typos/normalization; inverted index; ranking signals
- Scale: billions of docs; latency budgets; freshness; caching
- Business: conversion; A/B tests; internationalization; zero-results handling

Key points:
- Relevance and latency drive conversion in retail search
- Evaluate with P@K, R@K, MRR, NDCG; connect to business metrics
- Clean interfaces enable iteration and experiments

Next steps: dive into TF-IDF/BM25, query understanding, and learning-to-rank.



## Interview Study Guide: What to Know Cold

- Core IR concepts
  - Inverted index, postings lists, document frequency (DF), term frequency (TF)
  - Text processing: tokenization, normalization, stemming vs lemmatization, stopwords, phrase queries
  - Query-likelihood vs vector-space models; BM25 intuition vs TF–IDF
- Ranking signals and scoring
  - Matching features: exact term match, field boosts (title vs description), BM25/TF–IDF
  - Business boosts: popularity, freshness, availability, margin, personalization hooks
  - Re-ranking: rule-based boosts, learning-to-rank (pointwise/pairwise/listwise)
- Query understanding
  - Spell correction (edit distance), synonyms, query expansion, entity/attribute detection
  - Facets and filters; handling numeric ranges (price), units (inch vs in), categories
- Evaluation (offline)
  - Precision@K, Recall@K, MAP, MRR, NDCG; graded vs binary relevance
  - Labeling strategies: heuristics, click-derived labels, human annotation
- Evaluation (online)
  - A/B testing basics: metrics (CTR, Conversion, Revenue/Search, Add-to-cart), guardrails (latency, zero-results rate)
  - Power, MDE, significance, sample size; experiment length and bucketing
- E-commerce specifics
  - Query intent (navigational vs informational vs transactional), SKU vs product grouping
  - Typos, variants ("tee" vs "t-shirt"), attribute-aware search (color/size/brand), internationalization
  - Merchandising rules, inventory/freshness, cold start, seasonality
- System design at a glance
  - Indexing pipeline, shards/replicas, caching (Redis), search API (FastAPI), logging/monitoring (latency, p95/p99)
  - Backfills, reindexing strategies, blue/green deploys, safety switches
