# Real Embeddings Demo

This notebook demonstrates **actual embeddings** using `sentence-transformers` with the `all-MiniLM-L6-v2` model.

**Model Details:**
- Size: ~80MB (downloads on first run)
- Dimensions: 384
- Speed: Fast, works on CPU

**What we'll explore:**
1. Basic semantic similarity
2. Sentence embeddings vs word embeddings
3. Embedding failure modes (negation blindness, entity confusion)
4. Simple semantic search

**Prerequisites:**
```bash
pip install sentence-transformers
```


In [9]:
# ==========================================================================
# STEP 1: Load the embedding model
# ==========================================================================
# sentence-transformers wraps HuggingFace models for easy sentence embedding.
# all-MiniLM-L6-v2 is a popular choice: small, fast, and good quality.
#
# First run downloads the model (~80MB). Subsequent runs use cached version.
# ==========================================================================

from sentence_transformers import SentenceTransformer
import numpy as np

print("Loading all-MiniLM-L6-v2...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Model loaded! Embedding dimension: {model.get_sentence_embedding_dimension()}")


Loading all-MiniLM-L6-v2...
Model loaded! Embedding dimension: 384


In [10]:
# ==========================================================================
# STEP 2: Basic embedding generation
# ==========================================================================
# Let's see what an embedding actually looks like.
# It's just a list of 384 floating point numbers!
# ==========================================================================

text = "The quick brown fox jumps over the lazy dog."
embedding = model.encode(text)

print(f"Text: '{text}'")
print(f"\nEmbedding shape: {embedding.shape}")
print(f"\nFirst 10 dimensions: {embedding[:10].round(4)}")
print(f"\nEmbedding range: [{embedding.min():.4f}, {embedding.max():.4f}]")
print(f"Embedding norm: {np.linalg.norm(embedding):.4f}")
print("\n(Embeddings are normalized, so norm ≈ 1.0)")


Text: 'The quick brown fox jumps over the lazy dog.'

Embedding shape: (384,)

First 10 dimensions: [ 0.0439  0.0589  0.0482  0.0775  0.0267 -0.0376 -0.0026 -0.0599 -0.0025
  0.0221]

Embedding range: [-0.1330, 0.1768]
Embedding norm: 1.0000

(Embeddings are normalized, so norm ≈ 1.0)


In [11]:
# ==========================================================================
# STEP 3: Semantic similarity — the core use case
# ==========================================================================
# Embeddings capture MEANING. Similar meanings → similar vectors.
# We measure similarity using cosine similarity (dot product for unit vectors).
#
# This is the foundation of:
#   - Semantic search (find similar documents)
#   - RAG (retrieve relevant context)
#   - Clustering (group similar items)
# ==========================================================================

def cosine_similarity(a, b):
    """Cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Test sentences
sentences = [
    "I love programming in Python.",
    "Python is my favorite programming language.",
    "I enjoy coding with Python.",
    "The weather is nice today.",
    "It's a beautiful sunny day.",
    "My cat is sleeping on the couch.",
]

# Encode all sentences at once (batched for efficiency)
embeddings = model.encode(sentences)

print("Semantic Similarity Matrix")
print("=" * 70)
print("\nSentences:")
for i, s in enumerate(sentences):
    print(f"  [{i}] {s}")

print("\nSimilarity scores (1.0 = identical meaning, 0.0 = unrelated):")
print("-" * 70)

# Compare first sentence to all others
base = embeddings[0]
for i, (sent, emb) in enumerate(zip(sentences, embeddings)):
    sim = cosine_similarity(base, emb)
    bar = "█" * int(sim * 40)
    print(f"  [0] vs [{i}]: {sim:.3f} {bar}")

print("\n" + "-" * 70)
print("Notice: [0], [1], [2] are semantically similar (all about Python coding).")
print("[3], [4] are about weather. [5] is about cats — unrelated to [0].")


Semantic Similarity Matrix

Sentences:
  [0] I love programming in Python.
  [1] Python is my favorite programming language.
  [2] I enjoy coding with Python.
  [3] The weather is nice today.
  [4] It's a beautiful sunny day.
  [5] My cat is sleeping on the couch.

Similarity scores (1.0 = identical meaning, 0.0 = unrelated):
----------------------------------------------------------------------
  [0] vs [0]: 1.000 ████████████████████████████████████████
  [0] vs [1]: 0.878 ███████████████████████████████████
  [0] vs [2]: 0.931 █████████████████████████████████████
  [0] vs [3]: 0.063 ██
  [0] vs [4]: 0.138 █████
  [0] vs [5]: 0.095 ███

----------------------------------------------------------------------
Notice: [0], [1], [2] are semantically similar (all about Python coding).
[3], [4] are about weather. [5] is about cats — unrelated to [0].


In [12]:
# ==========================================================================
# STEP 4: Sentence embeddings vs word embeddings
# ==========================================================================
# The famous "king - man + woman = queen" works with WORD embeddings
# (like Word2Vec), not sentence embeddings.
#
# Sentence embeddings capture the meaning of the WHOLE sentence.
# Single words get embedded, but the arithmetic doesn't work the same way.
#
# Let's see what happens:
# ==========================================================================

words = ["king", "queen", "man", "woman", "prince", "princess"]
word_embeddings = {w: model.encode(w) for w in words}

print("Word Embedding Similarities")
print("=" * 70)

pairs = [
    ("king", "queen"),
    ("king", "man"),
    ("queen", "woman"),
    ("prince", "princess"),
    ("king", "prince"),
]

for w1, w2 in pairs:
    sim = cosine_similarity(word_embeddings[w1], word_embeddings[w2])
    bar = "█" * int(sim * 40)
    print(f"  {w1:10} ↔ {w2:10}: {sim:.3f} {bar}")

print("\n" + "-" * 70)
print("Vector arithmetic: king - man + woman = ?")
print("-" * 70)

result = word_embeddings["king"] - word_embeddings["man"] + word_embeddings["woman"]

# Find the actual closest word (excluding input words)
best_word = None
best_sim = -1
for word, emb in word_embeddings.items():
    if word not in ["king", "man", "woman"]:
        sim = cosine_similarity(result, emb)
        if sim > best_sim:
            best_sim = sim
            best_word = word

print("\nFinding closest word to (king - man + woman):")
for word, emb in word_embeddings.items():
    sim = cosine_similarity(result, emb)
    bar = "█" * int(sim * 40) if sim > 0 else ""
    marker = " ← closest!" if word == best_word else ""
    print(f"  {word:10}: {sim:.3f} {bar}{marker}")

print("\n" + "-" * 70)
print(f"Result: Closest match is '{best_word}' with similarity {best_sim:.3f}")
if best_word == "queen":
    print("SUCCESS: The analogy worked!")
else:
    print("The analogy didn't perfectly work — this is EXPECTED.")
print()
print("NOTE: Sentence embeddings don't preserve word-level arithmetic.")
print("For classic king-man+woman=queen, use Word2Vec or GloVe.")
print("Sentence embeddings excel at comparing SENTENCES, not single words.")


Word Embedding Similarities
  king       ↔ queen     : 0.681 ███████████████████████████
  king       ↔ man       : 0.322 ████████████
  queen      ↔ woman     : 0.439 █████████████████
  prince     ↔ princess  : 0.681 ███████████████████████████
  king       ↔ prince    : 0.588 ███████████████████████

----------------------------------------------------------------------
Vector arithmetic: king - man + woman = ?
----------------------------------------------------------------------

Finding closest word to (king - man + woman):
  king      : 0.631 █████████████████████████
  queen     : 0.579 ███████████████████████ ← closest!
  man       : -0.236 
  woman     : 0.628 █████████████████████████
  prince    : 0.388 ███████████████
  princess  : 0.442 █████████████████

----------------------------------------------------------------------
Result: Closest match is 'queen' with similarity 0.579
SUCCESS: The analogy worked!

NOTE: Sentence embeddings don't preserve word-level arithmetic.


In [13]:
# ==========================================================================
# STEP 5: FAILURE MODE — Negation Blindness
# ==========================================================================
# This is a CRITICAL limitation for RAG systems!
#
# Embeddings are trained via contrastive learning: they learn that
# "X is secure" and "X is not secure" appear in similar contexts.
#
# Result: Negation is often IGNORED in similarity calculations.
# ==========================================================================

print("Negation Blindness Demo")
print("=" * 70)
print()
print("PROBLEM: Embeddings often treat 'X' and 'not X' as similar!")
print("This is dangerous for retrieval — you might get opposite meaning.")
print()

negation_pairs = [
    ("The system is secure.", "The system is not secure."),
    ("I love this movie.", "I hate this movie."),
    ("The product works well.", "The product does not work."),
    ("This is allowed.", "This is not allowed."),
    ("Payment was successful.", "Payment failed."),
]

print("Similarity between sentences and their NEGATIONS:")
print("-" * 70)

for pos, neg in negation_pairs:
    emb_pos = model.encode(pos)
    emb_neg = model.encode(neg)
    sim = cosine_similarity(emb_pos, emb_neg)
    
    # Color code: high similarity for opposites = BAD
    warning = "⚠️ HIGH!" if sim > 0.7 else "  (good)" if sim < 0.5 else ""
    bar = "█" * int(sim * 40)
    
    print(f"\n  '{pos}'")
    print(f"  '{neg}'")
    print(f"  Similarity: {sim:.3f} {bar} {warning}")

print("\n" + "-" * 70)
print("TAKEAWAY: Don't rely on embeddings alone for negation-sensitive queries.")
print("Consider hybrid search (embedding + keyword) or reranking.")


Negation Blindness Demo

PROBLEM: Embeddings often treat 'X' and 'not X' as similar!
This is dangerous for retrieval — you might get opposite meaning.

Similarity between sentences and their NEGATIONS:
----------------------------------------------------------------------

  'The system is secure.'
  'The system is not secure.'
  Similarity: 0.829 █████████████████████████████████ ⚠️ HIGH!

  'I love this movie.'
  'I hate this movie.'
  Similarity: 0.721 ████████████████████████████ ⚠️ HIGH!

  'The product works well.'
  'The product does not work.'
  Similarity: 0.642 █████████████████████████ 

  'This is allowed.'
  'This is not allowed.'
  Similarity: 0.875 ██████████████████████████████████ ⚠️ HIGH!

  'Payment was successful.'
  'Payment failed.'
  Similarity: 0.660 ██████████████████████████ 

----------------------------------------------------------------------
TAKEAWAY: Don't rely on embeddings alone for negation-sensitive queries.
Consider hybrid search (embedding + keywor

In [14]:
# ==========================================================================
# STEP 6: FAILURE MODE — Entity Confusion
# ==========================================================================
# Embeddings capture semantic SIMILARITY, not identity.
# Similar entities (e.g., PostgreSQL vs MySQL) look similar in embedding space.
#
# This can be a problem when you need to distinguish between specific entities.
# ==========================================================================

print("Entity Confusion Demo")
print("=" * 70)
print()
print("PROBLEM: Embeddings see 'database config' as similar regardless of WHICH database!")
print()

# Real-world scenario: Generic query that doesn't mention the entity
# This is where confusion happens - user knows what they want but query is vague
print("Scenario: User wants PostgreSQL help but query doesn't mention it")
print()

query = "How do I configure database connection pooling?"  # No entity name!
docs = [
    ("PostgreSQL uses pgBouncer for connection pooling configuration.", "PostgreSQL"),
    ("MySQL connection pooling is configured in my.cnf with max_connections.", "MySQL"),
    ("MongoDB connection pools are set in the connection string options.", "MongoDB"),
]

query_emb = model.encode(query)

print(f"Query: '{query}'")
print()
print("Document rankings by embedding similarity:")
print("-" * 70)

results = []
for doc_text, db_name in docs:
    doc_emb = model.encode(doc_text)
    sim = cosine_similarity(query_emb, doc_emb)
    results.append((sim, doc_text, db_name))

# Sort by similarity
results.sort(reverse=True)

for i, (sim, doc_text, db_name) in enumerate(results, 1):
    bar = "█" * int(sim * 40)
    print(f"  {i}. [{sim:.3f}] {bar}")
    print(f"     [{db_name}] {doc_text}")
    print()

# Calculate how close the scores are
scores = [r[0] for r in results]
score_spread = scores[0] - scores[-1]

print("-" * 70)
print(f"OBSERVATION: Score spread is only {score_spread:.3f}!")
print("All three databases score similarly for the same semantic concept.")
print()
print("If the user wanted PostgreSQL specifically, they might get MySQL docs!")
print("The embedding doesn't know WHICH database the user cares about.")
print()
print("TAKEAWAY: For entity-specific queries, combine embeddings with:")
print("  - Metadata filtering (filter by database='PostgreSQL' FIRST)")
print("  - Keyword matching (BM25 boosts 'PostgreSQL' keyword)")
print("  - Query expansion (explicitly add entity name to query)")


Entity Confusion Demo

PROBLEM: Embeddings see 'database config' as similar regardless of WHICH database!

Scenario: User wants PostgreSQL help but query doesn't mention it

Query: 'How do I configure database connection pooling?'

Document rankings by embedding similarity:
----------------------------------------------------------------------
  1. [0.635] █████████████████████████
     [MySQL] MySQL connection pooling is configured in my.cnf with max_connections.

  2. [0.616] ████████████████████████
     [PostgreSQL] PostgreSQL uses pgBouncer for connection pooling configuration.

  3. [0.604] ████████████████████████
     [MongoDB] MongoDB connection pools are set in the connection string options.

----------------------------------------------------------------------
OBSERVATION: Score spread is only 0.030!
All three databases score similarly for the same semantic concept.

If the user wanted PostgreSQL specifically, they might get MySQL docs!
The embedding doesn't know WHICH data

In [15]:
# ==========================================================================
# STEP 7: Simple Semantic Search
# ==========================================================================
# This is the foundation of RAG retrieval.
# Given a query, find the most similar documents.
# ==========================================================================

print("Semantic Search Demo")
print("=" * 70)

# Simulated document corpus (in real RAG, these are your chunks)
documents = [
    "Python is a high-level programming language known for its readability.",
    "Machine learning models require large datasets for training.",
    "PostgreSQL is a powerful open-source relational database system.",
    "REST APIs use HTTP methods like GET, POST, PUT, DELETE.",
    "Docker containers package applications with their dependencies.",
    "Git is a distributed version control system for tracking code changes.",
    "Neural networks are inspired by biological brain structure.",
    "Kubernetes orchestrates containerized applications at scale.",
    "SQL queries retrieve data from relational databases.",
    "TensorFlow and PyTorch are popular deep learning frameworks.",
]

# Pre-compute document embeddings (do this once, store in vector DB)
doc_embeddings = model.encode(documents)

def semantic_search(query, top_k=3):
    """Find top_k most similar documents to query."""
    query_emb = model.encode(query)
    
    # Compute similarities
    similarities = [cosine_similarity(query_emb, doc_emb) for doc_emb in doc_embeddings]
    
    # Sort by similarity (descending)
    ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)
    
    return [(documents[i], sim) for i, sim in ranked[:top_k]]

# Test queries
queries = [
    "How do I write SQL to fetch data?",
    "What tools are used for deep learning?",
    "How to manage code versions?",
]

for query in queries:
    print(f"\nQuery: '{query}'")
    print("-" * 60)
    results = semantic_search(query)
    for i, (doc, sim) in enumerate(results, 1):
        print(f"  {i}. [{sim:.3f}] {doc}")

print("\n" + "=" * 70)
print("This is the core of RAG retrieval!")
print("In production, use a vector database (Pinecone, Weaviate, pgvector).")


Semantic Search Demo

Query: 'How do I write SQL to fetch data?'
------------------------------------------------------------
  1. [0.675] SQL queries retrieve data from relational databases.
  2. [0.248] PostgreSQL is a powerful open-source relational database system.
  3. [0.111] REST APIs use HTTP methods like GET, POST, PUT, DELETE.

Query: 'What tools are used for deep learning?'
------------------------------------------------------------
  1. [0.657] TensorFlow and PyTorch are popular deep learning frameworks.
  2. [0.363] Machine learning models require large datasets for training.
  3. [0.248] Neural networks are inspired by biological brain structure.

Query: 'How to manage code versions?'
------------------------------------------------------------
  1. [0.538] Git is a distributed version control system for tracking code changes.
  2. [0.138] Docker containers package applications with their dependencies.
  3. [0.101] Python is a high-level programming language known for it

In [16]:
# ==========================================================================
# STEP 8: Why Hybrid Search?
# ==========================================================================
# Pure embedding search has limitations (negation, entities).
# Hybrid search combines embeddings with keyword matching.
#
# This demo shows WHY you might want hybrid search.
# ==========================================================================

print("Why Hybrid Search?")
print("=" * 70)
print()

# Scenario: User wants PostgreSQL, but MySQL doc is semantically similar
corpus = [
    "PostgreSQL supports advanced indexing with GIN and GiST.",
    "MySQL uses InnoDB as its default storage engine.",
    "Database indexing improves query performance significantly.",
]

query = "PostgreSQL indexing"
query_emb = model.encode(query)
corpus_embs = model.encode(corpus)

print(f"Query: '{query}'")
print("\nPure Embedding Search (semantic only):")
print("-" * 60)

for i, (doc, emb) in enumerate(zip(corpus, corpus_embs)):
    sim = cosine_similarity(query_emb, emb)
    print(f"  [{sim:.3f}] {doc}")

print("\n" + "-" * 60)
print("PROBLEM: MySQL doc might rank close to PostgreSQL doc!")
print("\nWith hybrid search, we'd also check for 'PostgreSQL' keyword.")
print("Keyword match would boost the PostgreSQL doc's rank.")
print("\nHybrid = Semantic similarity + Keyword relevance (e.g., BM25)")


Why Hybrid Search?

Query: 'PostgreSQL indexing'

Pure Embedding Search (semantic only):
------------------------------------------------------------
  [0.667] PostgreSQL supports advanced indexing with GIN and GiST.
  [0.137] MySQL uses InnoDB as its default storage engine.
  [0.603] Database indexing improves query performance significantly.

------------------------------------------------------------
PROBLEM: MySQL doc might rank close to PostgreSQL doc!

With hybrid search, we'd also check for 'PostgreSQL' keyword.
Keyword match would boost the PostgreSQL doc's rank.

Hybrid = Semantic similarity + Keyword relevance (e.g., BM25)


In [17]:
# ==========================================================================
# STEP 9: FAILURE MODE — Numerical Blindness
# ==========================================================================
# Embeddings treat numbers as tokens, not quantities.
# "€100" and "€10,000" are just different token sequences — no math!
# ==========================================================================

print("Numerical Blindness Demo")
print("=" * 70)
print()
print("PROBLEM: Embeddings don't understand numbers as quantities!")
print()

# Pairs that should be VERY different but embeddings see as similar
number_pairs = [
    ("The price is €100", "The price is €10,000"),
    ("Server uptime is 99.9%", "Server uptime is 9.9%"),
    ("Deadline is January 5", "Deadline is January 25"),
    ("2024 annual report", "2014 annual report"),
]

print("Pairs with VERY different meanings:")
print("-" * 70)

for text1, text2 in number_pairs:
    emb1 = model.encode(text1)
    emb2 = model.encode(text2)
    sim = cosine_similarity(emb1, emb2)
    
    bar = "█" * int(sim * 40)
    warning = "⚠️ HIGH!" if sim > 0.8 else ""
    
    print(f"\n  '{text1}'")
    print(f"  '{text2}'")
    print(f"  Similarity: {sim:.3f} {bar} {warning}")

print("\n" + "-" * 70)
print("WHY THIS HAPPENS:")
print("-" * 70)
print()
print("To embeddings, '100' and '10,000' are just different tokens.")
print("There's no built-in understanding that 10,000 > 100.")
print()
print("TAKEAWAY: Extract numbers into structured metadata for comparison.")
print("  → price_euros: 100 vs price_euros: 10000")
print("  → Use numeric filters, not embedding similarity for numbers")


Numerical Blindness Demo

PROBLEM: Embeddings don't understand numbers as quantities!

Pairs with VERY different meanings:
----------------------------------------------------------------------

  'The price is €100'
  'The price is €10,000'
  Similarity: 0.905 ████████████████████████████████████ ⚠️ HIGH!

  'Server uptime is 99.9%'
  'Server uptime is 9.9%'
  Similarity: 0.973 ██████████████████████████████████████ ⚠️ HIGH!

  'Deadline is January 5'
  'Deadline is January 25'
  Similarity: 0.955 ██████████████████████████████████████ ⚠️ HIGH!

  '2024 annual report'
  '2014 annual report'
  Similarity: 0.694 ███████████████████████████ 

----------------------------------------------------------------------
WHY THIS HAPPENS:
----------------------------------------------------------------------

To embeddings, '100' and '10,000' are just different tokens.
There's no built-in understanding that 10,000 > 100.

TAKEAWAY: Extract numbers into structured metadata for comparison.
  → pric

In [19]:
# ==========================================================================
# STEP 10: Hybrid Search with ChromaDB — Production-Ready RAG
# ==========================================================================
# ChromaDB is a lightweight vector database that supports:
#   - Semantic search (embeddings)
#   - Metadata filtering (exact matches)
#   - Hybrid approaches combining both
#
# This demo shows a REAL failure mode: VERSION CONFUSION
# Embeddings see "Python installation" as similar regardless of version!
#
# Prerequisites: pip install chromadb
# ==========================================================================

import chromadb

print("Hybrid Search with ChromaDB")
print("=" * 70)
print()
print("SCENARIO: Product documentation — user needs SPECIFIC product info")
print("PROBLEM: User context is 'Pro plan', but query doesn't mention it!")
print()

# Create in-memory ChromaDB client
client = chromadb.Client()

# Create collection (uses default embedding function)
collection = client.create_collection(
    name="docs",
    metadata={"hnsw:space": "cosine"}
)

# Our document corpus - REALISTIC SCENARIO
# All docs are about "API rate limits" but for DIFFERENT products!
documents = [
    "API rate limits: 100 requests per minute. Exceeding this will return 429 errors. Contact support for increases.",
    "API rate limits: 1000 requests per minute. Burst capacity available. Enterprise SLA guarantees 99.9% uptime.",
    "API rate limits: 500 requests per minute. Includes webhook support and priority queue processing.",
    "API authentication uses OAuth 2.0. Generate tokens in the dashboard. Tokens expire after 24 hours.",
    "API versioning follows semver. Current stable version is v3. Deprecation notices sent 6 months in advance.",
]

# Metadata - THIS IS THE KEY!
metadatas = [
    {"product": "free", "topic": "rate-limits"},
    {"product": "enterprise", "topic": "rate-limits"},
    {"product": "pro", "topic": "rate-limits"},
    {"product": "all", "topic": "authentication"},
    {"product": "all", "topic": "versioning"},
]

# Add documents to collection
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

print("Documents indexed in ChromaDB:")
for i, (doc, meta) in enumerate(zip(documents, metadatas)):
    print(f"  [{meta['product']:10}] {doc[:50]}...")
print()

# The query - user is on Pro plan but doesn't say so in query!
query = "What are the API rate limits?"

print(f"Query: '{query}'")
print()
print("-" * 70)

# ========== METHOD 1: Pure Semantic Search ==========
print("METHOD 1: Pure Semantic Search (embedding only)")
print("-" * 70)

results_semantic = collection.query(
    query_texts=[query],
    n_results=3
)

print("Top 3 results (user is on PRO plan but didn't say so):")
for i, (doc, distance, meta) in enumerate(zip(
    results_semantic['documents'][0],
    results_semantic['distances'][0],
    results_semantic['metadatas'][0]
)):
    similarity = 1 - distance  # ChromaDB returns distance, not similarity
    product = meta['product']
    marker = "✓ CORRECT!" if product == "pro" else f"← WRONG ({product})!"
    print(f"  {i+1}. [{similarity:.3f}] [{product:10}] {doc[:40]}... {marker}")

print()
print("PROBLEM: User gets FREE or ENTERPRISE limits, not their PRO plan!")
print("All 'rate limit' docs are semantically identical to embeddings.")
print()

# ========== METHOD 2: Metadata Filtering ==========
print("-" * 70)
print("METHOD 2: Semantic Search + Metadata Filter")
print("-" * 70)

# In production: user's plan is known from their session/auth context
user_plan = "pro"

results_filtered = collection.query(
    query_texts=[query],
    n_results=3,
    where={"product": user_plan}  # Filter by user's actual plan!
)

print(f"Filter: product = '{user_plan}' (from user session)")
print("Results:")
for i, (doc, distance, meta) in enumerate(zip(
    results_filtered['documents'][0],
    results_filtered['distances'][0],
    results_filtered['metadatas'][0]
)):
    similarity = 1 - distance
    product = meta['product']
    print(f"  {i+1}. [{similarity:.3f}] [{product:10}] {doc[:40]}... ✓")

print()
print("SUCCESS: User gets their PRO plan limits!")
print()

# ========== METHOD 3: Combine topic + product filters ==========
print("-" * 70)
print("METHOD 3: Multi-Filter (topic + product)")
print("-" * 70)

results_multi = collection.query(
    query_texts=[query],
    n_results=3,
    where={"$and": [{"product": "pro"}, {"topic": "rate-limits"}]}
)

print(f"Filter: product='pro' AND topic='rate-limits'")
print("Results:")
for i, (doc, distance, meta) in enumerate(zip(
    results_multi['documents'][0],
    results_multi['distances'][0],
    results_multi['metadatas'][0]
)):
    similarity = 1 - distance
    product = meta['product']
    print(f"  {i+1}. [{similarity:.3f}] [{product:10}] {doc[:40]}... ✓")

print()
print("PRECISE: Exact match on both product AND topic.")

print()
print("-" * 70)
print("KEY TAKEAWAYS:")
print("-" * 70)
print()
print("1. Pure semantic search FAILS when context matters")
print("   → All 'rate limit' docs look identical to embeddings")
print("   → User gets wrong product's info (could be 10x off!)")
print()
print("2. Metadata filtering uses CONTEXT (user plan, region, etc)")
print("   → User's plan is known from session — inject as filter")
print("   → No need for user to say 'Pro plan' in every query")
print()
print("3. Multi-filter combines constraints")
print("   → product=X AND topic=Y for precise results")
print()
print("PRODUCTION PATTERN:")
print("  1. Extract metadata at ingestion: product, version, region, date")
print("  2. Get user context from session: plan, permissions, locale")
print("  3. Inject context as filters BEFORE semantic search")
print("  4. User query stays natural: 'What are the rate limits?'")


Hybrid Search with ChromaDB

SCENARIO: Product documentation — user needs SPECIFIC product info
PROBLEM: User context is 'Pro plan', but query doesn't mention it!

Documents indexed in ChromaDB:
  [free      ] API rate limits: 100 requests per minute. Exceedin...
  [enterprise] API rate limits: 1000 requests per minute. Burst c...
  [pro       ] API rate limits: 500 requests per minute. Includes...
  [all       ] API authentication uses OAuth 2.0. Generate tokens...
  [all       ] API versioning follows semver. Current stable vers...

Query: 'What are the API rate limits?'

----------------------------------------------------------------------
METHOD 1: Pure Semantic Search (embedding only)
----------------------------------------------------------------------
Top 3 results (user is on PRO plan but didn't say so):
  1. [0.763] [free      ] API rate limits: 100 requests per minute... ← WRONG (free)!
  2. [0.727] [enterprise] API rate limits: 1000 requests per minut... ← WRONG (enterpris

In [20]:
# ==========================================================================
# STEP 11: Hierarchical Retrieval — Summary-First Strategy
# ==========================================================================
# Instead of searching full documents directly:
#   1. First pass: Search summaries (cheap, fast)
#   2. Second pass: Load full docs for top matches only
#
# This dramatically reduces token usage in production RAG systems.
# ==========================================================================

import chromadb

print("Hierarchical Retrieval Demo")
print("=" * 70)
print()

# Create fresh client
client = chromadb.Client()

# Simulate a document corpus with summaries and full text
documents = [
    {
        "id": "doc_1",
        "summary": "Refund policy overview. 30-day money-back guarantee for all products.",
        "full_text": """REFUND POLICY - Complete Guide

Our company offers a comprehensive 30-day money-back guarantee on all products. 
If you're not satisfied with your purchase, you can request a full refund within 
30 days of delivery. The product must be in original condition with all packaging.

To request a refund:
1. Log into your account
2. Go to Order History
3. Click "Request Refund" on the relevant order
4. Fill out the reason for return
5. Print the prepaid shipping label

Refunds are processed within 5-7 business days after we receive the returned item.
Original shipping costs are non-refundable unless the return is due to our error.

For digital products, refunds are available within 14 days if the product hasn't 
been downloaded or accessed more than once."""
    },
    {
        "id": "doc_2", 
        "summary": "Shipping information. Free shipping over $50, 3-5 business days delivery.",
        "full_text": """SHIPPING POLICY - Detailed Information

Standard Shipping: 3-5 business days, $5.99 flat rate
Express Shipping: 1-2 business days, $15.99
Free Shipping: Orders over $50 qualify for free standard shipping

We ship to all 50 US states and select international destinations.
International shipping rates vary by destination and package weight.

Tracking information is sent via email within 24 hours of shipment.
All orders are processed within 1 business day of payment confirmation."""
    },
    {
        "id": "doc_3",
        "summary": "Account security. Two-factor authentication and password requirements.",
        "full_text": """ACCOUNT SECURITY - Best Practices

We take your security seriously. All accounts support two-factor authentication (2FA).
To enable 2FA, go to Settings > Security > Enable 2FA.

Password requirements:
- Minimum 12 characters
- At least one uppercase letter
- At least one number
- At least one special character

We recommend using a password manager and never reusing passwords across sites.
If you suspect unauthorized access, contact support immediately."""
    },
    {
        "id": "doc_4",
        "summary": "Payment methods. Credit cards, PayPal, and installment options available.",
        "full_text": """PAYMENT OPTIONS - Complete Guide

We accept the following payment methods:
- Visa, Mastercard, American Express, Discover
- PayPal and PayPal Credit
- Apple Pay and Google Pay
- Affirm (buy now, pay later)

For orders over $100, Affirm offers 0% APR financing for 3-12 months.
All transactions are encrypted with 256-bit SSL security."""
    },
    {
        "id": "doc_5",
        "summary": "Product warranty. 1-year manufacturer warranty on all electronics.",
        "full_text": """WARRANTY INFORMATION

All electronics come with a 1-year manufacturer warranty covering defects in 
materials and workmanship. This warranty does not cover damage from misuse, 
accidents, or unauthorized modifications.

To file a warranty claim:
1. Contact customer support with your order number
2. Describe the issue in detail
3. Provide photos if applicable
4. We'll arrange repair or replacement

Extended warranty options are available at checkout for an additional fee."""
    },
]

# Create two collections: summaries and full documents
summaries_collection = client.create_collection(name="summaries")
full_docs_collection = client.create_collection(name="full_docs")

# Index both
for doc in documents:
    summaries_collection.add(
        documents=[doc["summary"]],
        ids=[doc["id"]],
        metadatas=[{"doc_id": doc["id"]}]
    )
    full_docs_collection.add(
        documents=[doc["full_text"]],
        ids=[doc["id"]],
        metadatas=[{"doc_id": doc["id"]}]
    )

print(f"Indexed {len(documents)} documents")
print(f"  Summaries collection: ~{sum(len(d['summary'].split()) for d in documents)} words")
print(f"  Full docs collection: ~{sum(len(d['full_text'].split()) for d in documents)} words")
print()

# The query
query = "How do I get a refund?"

print(f"Query: '{query}'")
print()
print("-" * 70)
print("STEP 1: Search summaries (fast, cheap)")
print("-" * 70)

# First pass: search summaries
summary_results = summaries_collection.query(
    query_texts=[query],
    n_results=2  # Get top 2
)

print("Top 2 matching summaries:")
for i, (doc_id, summary, distance) in enumerate(zip(
    summary_results['ids'][0],
    summary_results['documents'][0],
    summary_results['distances'][0]
)):
    sim = 1 - distance
    print(f"  {i+1}. [{sim:.3f}] {doc_id}: {summary[:50]}...")

# Get the IDs of relevant docs
relevant_ids = summary_results['ids'][0]

print()
print("-" * 70)
print("STEP 2: Fetch full docs for top matches only")
print("-" * 70)

# Second pass: get full documents by ID
full_results = full_docs_collection.get(ids=relevant_ids)

print(f"Loaded {len(relevant_ids)} full documents:")
for doc_id, full_text in zip(full_results['ids'], full_results['documents']):
    word_count = len(full_text.split())
    print(f"  {doc_id}: {word_count} words — '{full_text[:60]}...'")

print()
print("-" * 70)
print("TOKEN SAVINGS ANALYSIS")
print("-" * 70)
print()

# Calculate savings
total_summary_tokens = sum(len(d['summary'].split()) * 1.3 for d in documents)
total_full_tokens = sum(len(d['full_text'].split()) * 1.3 for d in documents)
loaded_full_tokens = sum(len(d.split()) * 1.3 for d in full_results['documents'])

print(f"NAIVE approach (search all full docs):")
print(f"  Index/search: {int(total_full_tokens):,} tokens")
print()
print(f"HIERARCHICAL approach:")
print(f"  Step 1 - Search summaries: {int(total_summary_tokens):,} tokens")
print(f"  Step 2 - Load top 2 full:  {int(loaded_full_tokens):,} tokens")
print(f"  Total:                     {int(total_summary_tokens + loaded_full_tokens):,} tokens")
print()

savings_pct = (1 - (total_summary_tokens + loaded_full_tokens) / total_full_tokens) * 100
print(f"SAVINGS: {savings_pct:.0f}% fewer tokens!")
print()
print("At scale (10K docs, 50K queries/day):")
naive_daily = 50000 * (total_full_tokens / len(documents)) * 10000 / 1000 * 0.003
hier_daily = 50000 * ((total_summary_tokens + loaded_full_tokens) / len(documents)) * 100 / 1000 * 0.003
print(f"  Naive:        Would be impractical (full docs in every search)")
print(f"  Hierarchical: Search summaries, load only what's needed")
print()
print("PRODUCTION PATTERN:")
print("  1. Pre-compute summaries at ingestion (LLM or extractive)")
print("  2. Index summaries in vector DB")
print("  3. First pass: semantic search on summaries")
print("  4. Second pass: fetch full docs by ID for top-k")
print("  5. Feed only relevant full docs to LLM")


Hierarchical Retrieval Demo

Indexed 5 documents
  Summaries collection: ~43 words
  Full docs collection: ~389 words

Query: 'How do I get a refund?'

----------------------------------------------------------------------
STEP 1: Search summaries (fast, cheap)
----------------------------------------------------------------------
Top 2 matching summaries:
  1. [-0.005] doc_1: Refund policy overview. 30-day money-back guarante...
  2. [-0.394] doc_4: Payment methods. Credit cards, PayPal, and install...

----------------------------------------------------------------------
STEP 2: Fetch full docs for top matches only
----------------------------------------------------------------------
Loaded 2 full documents:
  doc_1: 125 words — 'REFUND POLICY - Complete Guide

Our company offers a compreh...'
  doc_4: 54 words — 'PAYMENT OPTIONS - Complete Guide

We accept the following pa...'

----------------------------------------------------------------------
TOKEN SAVINGS ANALYSIS
----------

## Summary

**What embeddings are good at:**
- Capturing semantic similarity ("happy" ≈ "joyful")
- Finding related content even with different words
- Powering semantic search and RAG retrieval

**What embeddings struggle with:**
- Negation ("is secure" ≈ "is not secure")
- Entity disambiguation (PostgreSQL ≈ MySQL)
- Exact matching requirements
- Numerical reasoning

**Production patterns demonstrated:**
- **Hybrid search** (STEP 10): Semantic + metadata filtering
- **Hierarchical retrieval** (STEP 11): Summary-first, then full docs

**Practical recommendations:**
- Use hybrid search (embeddings + metadata) for production
- Add metadata filtering for entity-specific queries
- Use hierarchical retrieval for large document sets
- Consider reranking for critical applications
- Test with your actual use cases — embedding quality varies by domain
