# Vector Databases and Embeddings

## GenAI Foundation Training - Day 2

---

### What You'll Learn

In this notebook, you'll learn the foundational concepts behind semantic search and RAG (Retrieval-Augmented Generation) systems:

1. **What embeddings are** and why they're crucial for modern AI applications
2. **Generate embeddings** using OpenAI, Google AI, and open-source models
3. **Implement similarity search** with different metrics (cosine, Euclidean)
4. **Choose appropriate chunking strategies** for your documents
5. **Work with vector databases** (ChromaDB) for semantic search
6. **Know when to use** vector databases vs traditional databases

### Why This Matters

Embeddings and vector databases are the foundation of:
- **RAG systems** - Chatbots with custom knowledge
- **Semantic search** - Finding documents by meaning, not just keywords
- **Recommendation engines** - "Users who liked X also liked Y"
- **Duplicate detection** - Finding similar content at scale

### Connection to Previous Notebook

In the previous notebook, we learned how to:
- Make API calls to LLMs (OpenAI, Anthropic, Google AI)
- Get structured outputs with Pydantic

Now we'll learn how to:
- Store and retrieve relevant context for those LLMs
- Build the "R" in RAG (Retrieval-Augmented Generation)

### Duration

This notebook takes approximately **2 hours** to complete.

### Prerequisites

- Basic Python knowledge
- Completed Environment Setup notebook (or have API keys ready)
- OpenAI and/or Google AI API keys

Let's get started!

---

## Section 2: Package Installation

Let's install all the packages we'll need for this notebook.

### Packages We'll Use:

- **openai** - OpenAI API client for generating embeddings
- **google-generativeai** - Google AI client for Gemini embeddings
- **sentence-transformers** - Open-source embedding models that run locally
- **chromadb** - Vector database for storing and searching embeddings
- **numpy** - Numerical operations on vectors
- **scikit-learn** - Machine learning utilities (similarity metrics)

Run the cell below to install everything:

In [None]:
# Install all required packages
!pip install openai google-generativeai sentence-transformers chromadb numpy scikit-learn -q

print("‚úÖ All packages installed successfully!")

### Import Libraries

In [None]:
import os
import json
from typing import List, Dict
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

# LLM clients
import openai
import google.generativeai as genai
from sentence_transformers import SentenceTransformer

# Vector database
import chromadb
from chromadb.config import Settings

print("‚úÖ All libraries imported successfully!")

### Setup API Keys (Using Google Colab Secrets)

We'll use the same secure approach from the previous notebook.

**Setting Up Google Colab Secrets:**

1. Click the **üîë (key icon)** in the left sidebar
2. Add these secrets:
   - `OPENAI_API_KEY` - Your OpenAI API key
   - `GOOGLE_API_KEY` - Your Google AI API key (optional)
3. Toggle **"Notebook access"** ON for each key

**Note:** You'll need at least OpenAI API key for this notebook. Google AI is optional.

In [None]:
# Import userdata for Colab secrets
from google.colab import userdata

# Retrieve API keys from Colab secrets
try:
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    openai_client = openai.OpenAI(api_key=OPENAI_API_KEY)
    print("‚úÖ OpenAI API key loaded successfully!")
except Exception as e:
    print(f"‚ùå Error loading OpenAI API key: {e}")
    print("Please set OPENAI_API_KEY in Google Colab Secrets.")

# Setup Google AI (optional)
try:
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
    genai.configure(api_key=GOOGLE_API_KEY)
    print("‚úÖ Google AI API key loaded successfully!")
except Exception as e:
    print(f"‚ö†Ô∏è  Google AI API key not found (optional): {e}")

print("\n‚úÖ API keys configured!")

---

## Section 3: What Are Embeddings?

### The Core Concept

**Embeddings** are dense numerical representations of text that capture semantic meaning.

Think of embeddings as coordinates in a multi-dimensional space where:
- Similar concepts are close together
- Different concepts are far apart
- The distance between vectors = similarity in meaning

### Why Embeddings Matter

**Traditional keyword search:**
- "cat" only matches documents containing "cat"
- Misses "feline", "kitten", "pet"

**Semantic search with embeddings:**
- "cat" finds documents about cats, felines, kittens, pets
- Understands meaning, not just exact words

### The Famous Example

Embeddings capture relationships:
```
King - Man + Woman ‚âà Queen
```

This works because embeddings capture semantic relationships in vector space!

### Dimensionality

Embeddings are vectors with many dimensions:
- **384 dimensions** - Lightweight models (all-MiniLM-L6-v2)
- **768 dimensions** - Medium models (Google embedding-001)
- **1536 dimensions** - Large models (OpenAI text-embedding-3-small)
- **3072 dimensions** - Extra large (OpenAI text-embedding-3-large)

More dimensions = more nuanced representations, but also more storage and computation.

### How LLMs Use Embeddings

LLMs work with numbers, not text:
1. Text ‚Üí Embeddings (numbers)
2. LLM processes numbers
3. Output numbers ‚Üí Text

Embeddings are the bridge between human language and machine learning.

### Let's See an Example

We'll generate an embedding for a simple sentence and look at the raw numbers:

In [None]:
# Generate a sample embedding
sample_text = "Machine learning is a subset of artificial intelligence"

response = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=sample_text
)

embedding = response.data[0].embedding

print(f"Text: {sample_text}")
print(f"\nEmbedding dimensions: {len(embedding)}")
print(f"\nFirst 10 values: {embedding[:10]}")
print(f"\nThese {len(embedding)} numbers capture the semantic meaning of the text!")

### Visual Understanding

Imagine a 2D space (in reality it's 1536D, but we can't visualize that!):

```
       ^                     
       |   dog              
       | cat  puppy          
       |                    
  -----.------------------>  
       |                     
       |         car         
       |      vehicle        
```

- "cat", "dog", "puppy" are close (similar meanings)
- "car", "vehicle" are close
- Animals and vehicles are far apart

This is how embeddings represent meaning in high-dimensional space!

---

## Section 4: Generating Embeddings - OpenAI

OpenAI offers two main embedding models:

| Model | Dimensions | Cost per 1M tokens | Quality | Use Case |
|-------|------------|-------------------|---------|----------|
| text-embedding-3-small | 1536 | $0.02 | Excellent | Production, cost-effective |
| text-embedding-3-large | 3072 | $0.13 | Best | When quality is critical |

### Single Embedding

Let's create a helper function to generate embeddings:

In [None]:
def get_openai_embedding(text: str, model: str = "text-embedding-3-small") -> List[float]:
    """
    Generate embedding using OpenAI.
    
    Args:
        text: Text to embed
        model: OpenAI embedding model to use
    
    Returns:
        List of floats representing the embedding
    """
    response = openai_client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding

# Test it
sample_text = "Machine learning is a subset of artificial intelligence"
embedding = get_openai_embedding(sample_text)

print(f"Text: {sample_text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print("\n‚úÖ OpenAI embedding generated successfully!")

### Batch Embeddings (More Efficient)

When embedding multiple texts, it's much more efficient to send them in a batch:
- **Faster** - Single API call instead of multiple
- **Cheaper** - Reduced network overhead
- **Better rate limits** - Fewer requests

In [None]:
def get_openai_embeddings_batch(texts: List[str], model: str = "text-embedding-3-small") -> List[List[float]]:
    """
    Generate embeddings for multiple texts efficiently.
    
    Args:
        texts: List of texts to embed
        model: OpenAI embedding model to use
    
    Returns:
        List of embeddings (each embedding is a list of floats)
    """
    response = openai_client.embeddings.create(
        model=model,
        input=texts
    )
    return [item.embedding for item in response.data]

# Test with multiple texts
texts = [
    "Python is a programming language",
    "Machine learning uses algorithms to learn from data",
    "Neural networks are inspired by biological neurons"
]

embeddings = get_openai_embeddings_batch(texts)

print(f"Generated {len(embeddings)} embeddings in a single API call")
print(f"Each embedding has {len(embeddings[0])} dimensions")
print("\n‚úÖ Batch embeddings generated successfully!")

### When to Use Which Model

**Use text-embedding-3-small when:**
- Building most production applications
- Cost is a consideration
- Quality is already sufficient (it's quite good!)

**Use text-embedding-3-large when:**
- Quality is absolutely critical
- Working with complex, nuanced content
- Budget allows for premium quality

**Pro tip:** Start with small, upgrade to large only if needed!

---

## Section 5: Generating Embeddings - Google AI & Open-Source

OpenAI isn't the only option! Let's explore alternatives:

### Provider Comparison

| Provider | Model | Dimensions | Cost | Speed | Quality | Privacy |
|----------|-------|------------|------|-------|---------|----------|
| OpenAI | text-embedding-3-small | 1536 | $0.02/1M | Fast | Excellent | API call |
| OpenAI | text-embedding-3-large | 3072 | $0.13/1M | Medium | Best | API call |
| Google | embedding-001 | 768 | Free tier | Fast | Good | API call |
| Open-source | all-MiniLM-L6-v2 | 384 | Free | Fastest | Good | 100% local |

### Google AI Embeddings

In [None]:
def get_google_embedding(text: str) -> List[float]:
    """
    Generate embedding using Google AI.
    
    Args:
        text: Text to embed
    
    Returns:
        List of floats representing the embedding
    """
    result = genai.embed_content(
        model="models/embedding-001",
        content=text,
        task_type="retrieval_document"  # Optimized for retrieval
    )
    return result['embedding']

# Test it
sample_text = "Machine learning is a subset of artificial intelligence"
embedding = get_google_embedding(sample_text)

print(f"Text: {sample_text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print("\n‚úÖ Google AI embedding generated successfully!")

### Open-Source Embeddings (Sentence Transformers)

**Sentence Transformers** is a Python library that provides state-of-the-art sentence, text, and image embeddings.

**Benefits:**
- Runs 100% locally (no API calls)
- Complete privacy (data never leaves your machine)
- Free (no usage costs)
- Fast (especially on GPU)

**Popular models:**
- `all-MiniLM-L6-v2` - 384 dimensions, fast, good quality
- `all-mpnet-base-v2` - 768 dimensions, higher quality
- `multi-qa-MiniLM-L6-cos-v1` - Optimized for Q&A

In [None]:
# Load a local model (downloads on first use, then cached)
print("Loading Sentence Transformer model...")
local_model = SentenceTransformer('all-MiniLM-L6-v2')
print("‚úÖ Model loaded!\n")

def get_local_embedding(text: str) -> List[float]:
    """
    Generate embedding using local Sentence Transformer model.
    
    Args:
        text: Text to embed
    
    Returns:
        List of floats representing the embedding
    """
    return local_model.encode(text).tolist()

# Test it
sample_text = "Machine learning is a subset of artificial intelligence"
embedding = get_local_embedding(sample_text)

print(f"Text: {sample_text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print("\n‚úÖ Local embedding generated successfully!")
print("\nNote: This ran entirely on your machine - no API call!")

### When to Use Each Provider

**Use OpenAI (text-embedding-3-small) when:**
- Building production applications
- Need excellent quality
- Budget allows ($0.02/1M tokens is quite affordable)

**Use OpenAI (text-embedding-3-large) when:**
- Quality is absolutely critical
- Working with complex, nuanced content

**Use Google AI when:**
- Experimenting and learning
- Have free tier quota
- Quality is good enough for your use case

**Use Open-Source when:**
- Privacy is critical (healthcare, legal, finance)
- Working offline or air-gapped environments
- Want zero ongoing API costs
- Have sufficient compute resources

**Pro tip:** Start with OpenAI small for simplicity, consider alternatives based on your constraints!

---

## Section 6: Similarity Search Fundamentals

Now that we can generate embeddings, how do we compare them?

### Similarity Metrics

**1. Cosine Similarity** (Most Common)
- Range: -1 to 1
  - 1 = Identical meaning
  - 0 = Unrelated
  - -1 = Opposite meaning
- Measures angle between vectors
- Ignores magnitude, focuses on direction

**2. Euclidean Distance**
- Range: 0 to infinity
  - 0 = Identical
  - Larger = More different
- Straight-line distance between points
- Considers magnitude

**3. Dot Product**
- Similar to cosine but includes magnitude
- Used when vector lengths matter

For most semantic search use cases, **cosine similarity** is the standard choice.

### Let's Compare Some Texts

In [None]:
def calculate_cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
    """
    Calculate cosine similarity between two vectors.
    
    Args:
        vec1: First vector
        vec2: Second vector
    
    Returns:
        Similarity score between -1 and 1
    """
    return cosine_similarity([vec1], [vec2])[0][0]

# Example texts
text1 = "The cat sat on the mat"
text2 = "A feline rested on the rug"
text3 = "Python is a programming language"

# Generate embeddings
emb1 = get_openai_embedding(text1)
emb2 = get_openai_embedding(text2)
emb3 = get_openai_embedding(text3)

# Calculate similarities
sim_1_2 = calculate_cosine_similarity(emb1, emb2)
sim_1_3 = calculate_cosine_similarity(emb1, emb3)

print("Similarity Comparison:\n")
print(f"Text 1: {text1}")
print(f"Text 2: {text2}")
print(f"Similarity: {sim_1_2:.4f}")
print("\n---\n")
print(f"Text 1: {text1}")
print(f"Text 3: {text3}")
print(f"Similarity: {sim_1_3:.4f}")
print("\n" + "="*60)
print("Notice: Text 1 and 2 are highly similar (about cats/mats)")
print("Text 1 and 3 have low similarity (different topics)")

### Understanding the Scores

**High similarity (0.7 - 1.0):**
- Same topic, similar meaning
- Synonyms, paraphrases
- Related concepts

**Medium similarity (0.4 - 0.7):**
- Related but distinct topics
- Tangentially connected

**Low similarity (0.0 - 0.4):**
- Different topics
- Unrelated content

### Top-K Search

In real applications, we want to find the **K most similar** documents from a collection:

In [None]:
def find_most_similar(query_embedding: List[float], 
                     document_embeddings: List[List[float]], 
                     documents: List[str],
                     k: int = 3) -> List[tuple]:
    """
    Find top-k most similar documents to a query.
    
    Args:
        query_embedding: Embedding of the search query
        document_embeddings: List of document embeddings
        documents: List of document texts
        k: Number of results to return
    
    Returns:
        List of tuples: (index, document, similarity_score)
    """
    # Calculate similarities
    similarities = [
        cosine_similarity([query_embedding], [doc_emb])[0][0]
        for doc_emb in document_embeddings
    ]
    
    # Get top k indices (argsort returns ascending, so reverse it)
    top_k_indices = np.argsort(similarities)[-k:][::-1]
    
    # Return results
    return [
        (idx, documents[idx], similarities[idx]) 
        for idx in top_k_indices
    ]

# Example documents
documents = [
    "Python is a popular programming language for data science",
    "Machine learning uses algorithms to learn from data",
    "Deep learning uses neural networks with multiple layers",
    "Natural language processing helps computers understand human language",
    "The cat sat on the mat and looked out the window",
    "Computer vision enables machines to interpret visual information"
]

# Generate embeddings for all documents
doc_embeddings = get_openai_embeddings_batch(documents)

# Search query
query = "How do neural networks work?"
query_embedding = get_openai_embedding(query)

# Find top 3 most similar
results = find_most_similar(query_embedding, doc_embeddings, documents, k=3)

print(f"Query: {query}\n")
print("Top 3 most similar documents:\n")
for rank, (idx, doc, score) in enumerate(results, 1):
    print(f"{rank}. Similarity: {score:.4f}")
    print(f"   Document: {doc}")
    print()

### Other Distance Metrics (Brief Overview)

While cosine similarity is most common, here are other options:

**Euclidean Distance:**

In [None]:
def calculate_euclidean_distance(vec1: List[float], vec2: List[float]) -> float:
    """Calculate Euclidean distance (smaller = more similar)"""
    return euclidean_distances([vec1], [vec2])[0][0]

# Compare
dist = calculate_euclidean_distance(emb1, emb2)
print(f"Euclidean distance between similar texts: {dist:.4f}")
print("(Smaller distance = more similar)")

**Dot Product:**

In [None]:
def calculate_dot_product(vec1: List[float], vec2: List[float]) -> float:
    """Calculate dot product (larger = more similar)"""
    return np.dot(vec1, vec2)

# Compare
dot = calculate_dot_product(emb1, emb2)
print(f"Dot product between similar texts: {dot:.4f}")
print("(Larger value = more similar)")

### Which Metric to Use?

**For most semantic search: Use Cosine Similarity**
- Standard in the industry
- Works well with normalized embeddings
- Easy to interpret (0-1 range)

**When to consider alternatives:**
- Euclidean: When magnitude matters
- Dot product: For certain specialized models

When in doubt, stick with cosine!

### When to Use Each Similarity Metric - Detailed Guide

Now that you know **how** to calculate each metric, let's understand **when** to use each one.

#### üìä Quick Decision Guide

| Metric | Use When | Common Applications | Don't Use When |
|--------|----------|---------------------|----------------|
| **Cosine Similarity** | Direction matters, magnitude doesn't | Text search, RAG, semantic similarity | Vectors need exact magnitude comparison |
| **Euclidean Distance** | Magnitude matters, absolute distance counts | Clustering, spatial data, image similarity | Comparing different-length vectors |
| **Dot Product** | Both direction AND magnitude matter | Ranking, scoring, specialized models | General semantic search |

---

#### üéØ Cosine Similarity - The Default Choice (90%+ of use cases)

**Use for:**
- ‚úÖ **Text/document similarity** - Finding semantically similar content
- ‚úÖ **RAG systems** - Retrieving relevant context for LLMs
- ‚úÖ **Semantic search** - "Find documents about machine learning"
- ‚úÖ **Recommendation engines** - "Users who liked X also liked Y"
- ‚úÖ **Normalized embeddings** - When vectors are already normalized (most embedding models)

**Why it works:**
- Focuses on **direction** (meaning), not magnitude (length)
- Range 0-1 is easy to interpret (or -1 to 1 if including opposite meanings)
- Works consistently across different embedding models
- Not affected by document length or vector normalization

**Real-world examples:**
```python
# Example: Finding similar product descriptions
query = "wireless bluetooth headphones"
# Cosine similarity finds:
# 1. "noise cancelling wireless earbuds" (0.89)
# 2. "bluetooth audio headset" (0.85)
# 3. "over-ear wireless headphones" (0.83)
```

**When NOT to use:**
- ‚ùå Vectors represent actual measurements/coordinates (use Euclidean)
- ‚ùå Model specifically designed for dot product similarity

---

#### üìè Euclidean Distance - For Clustering & Spatial Data

**Use for:**
- ‚úÖ **Clustering algorithms** - K-means, DBSCAN (spatial proximity)
- ‚úÖ **Image embeddings** - Comparing visual similarity where magnitude matters
- ‚úÖ **User profiles with features** - Age, income, preferences (actual measurements)
- ‚úÖ **Anomaly detection** - Finding outliers based on distance

**Why it works:**
- Measures **straight-line distance** in vector space
- Considers both direction AND magnitude
- Natural for spatial/geometric data
- Works well when all dimensions are comparable/normalized

**Real-world examples:**
```python
# Example: Clustering user profiles
user1 = [25, 50000, 3.5, 2]  # age, income, avg_rating, purchases
user2 = [27, 52000, 3.7, 3]  # Similar user (small Euclidean distance)
user3 = [65, 150000, 4.8, 50]  # Different segment (large distance)

# K-means clustering uses Euclidean distance to group similar users
```

**When NOT to use:**
- ‚ùå Text embeddings (direction matters more than length)
- ‚ùå Comparing vectors of very different scales (need normalization first)
- ‚ùå Semantic similarity tasks

**Key difference from Cosine:**
```python
vec_a = [1, 0]
vec_b = [10, 0]  # Same direction, 10x magnitude

# Cosine: 1.0 (identical direction)
# Euclidean: 9.0 (far apart due to magnitude)
```

---

#### ‚ö° Dot Product - For Ranking & Specialized Models

**Use for:**
- ‚úÖ **Ranking models** - When model trained with dot product similarity
- ‚úÖ **Scoring relevance** - Combining direction and magnitude for ranking
- ‚úÖ **Certain recommendation systems** - Where magnitude encodes confidence/importance
- ‚úÖ **Dense retrieval models** - Some models (DPR) use dot product natively

**Why it works:**
- Combines **both direction and magnitude**
- Can be faster to compute (no normalization needed)
- Some neural models optimize for dot product directly
- Magnitude can encode additional signal (confidence, importance)

**Real-world examples:**
```python
# Example: Search ranking where confidence matters
query_vec = [0.8, 0.6, 0.9]  # High confidence query

doc1_vec = [0.9, 0.7, 0.85]  # High confidence, similar direction ‚Üí High score
doc2_vec = [0.1, 0.1, 0.15]  # Low confidence, similar direction ‚Üí Low score

# Dot product ranks doc1 higher (captures both similarity AND confidence)
```

**When NOT to use:**
- ‚ùå General semantic search (use cosine instead)
- ‚ùå When you don't know if model is trained for dot product
- ‚ùå Comparing unnormalized vectors from different sources

**Relationship to Cosine:**
```
Dot Product = Cosine Similarity √ó ||A|| √ó ||B||

For normalized vectors (||A|| = ||B|| = 1):
Dot Product = Cosine Similarity
```

---

#### üß™ Practical Comparison Example

Let's see how each metric behaves with the same data:
```python
# Same query, same documents, different metrics

query = "machine learning tutorial"
doc1 = "beginner guide to ML algorithms"
doc2 = "introduction to machine learning"
doc3 = "advanced deep learning research paper"

# Results with Cosine Similarity:
# 1. doc2 (0.89) - Very similar meaning
# 2. doc1 (0.82) - Related but slightly different focus
# 3. doc3 (0.65) - Same domain but different level

# Results with Euclidean Distance (smaller = better):
# 1. doc2 (0.15) - Closest in vector space
# 2. doc1 (0.22) -
# 3. doc3 (0.45) - Farthest

# Results with Dot Product:
# (Depends heavily on vector normalization)
# If normalized: Similar to cosine
# If not normalized: Magnitude affects ranking
```

---

#### üí° Best Practice Recommendations

**Default choice:** Start with **Cosine Similarity**
- Works for 90%+ of semantic search use cases
- Easy to interpret (0-1 range)
- Robust across different embedding models

**When to reconsider:**
1. **Use Euclidean** if:
   - Doing clustering (K-means, hierarchical)
   - Vectors represent measurements/coordinates
   - Magnitude has semantic meaning

2. **Use Dot Product** if:
   - Model documentation specifically recommends it
   - Doing ranking/scoring tasks
   - Working with models like DPR, ColBERT

**Testing approach:**
```python
# Test all three metrics with your data
results_cosine = search_with_cosine(query)
results_euclidean = search_with_euclidean(query)
results_dot = search_with_dot_product(query)

# Manually review top 10 results from each
# Choose the metric that gives best quality
```

---

#### üìö Summary Cheat Sheet

**For beginners:**
- Building a chatbot? ‚Üí **Cosine Similarity**
- Semantic search? ‚Üí **Cosine Similarity**
- RAG system? ‚Üí **Cosine Similarity**
- Not sure? ‚Üí **Cosine Similarity**

**For specific tasks:**
- Clustering users/documents? ‚Üí **Euclidean Distance**
- Image similarity with CNNs? ‚Üí **Euclidean Distance**
- Using DPR or specialized ranking model? ‚Üí **Dot Product**

**When in doubt:**
Test cosine first. It's the safest default for semantic similarity tasks!

---

## Section 7: Chunking Strategies

### Why Do We Need Chunking?

Real-world documents are often too long for embedding models:
- **Token limits:** Most embedding models have 8K token limits
- **Precision:** Smaller chunks = more precise retrieval
- **Context:** LLMs work better with focused, relevant chunks

**The challenge:** How do we split documents effectively?

### Chunking Strategy Comparison

| Strategy | Chunk Size | Pros | Cons | Use Case |
|----------|-----------|------|------|----------|
| Fixed-size | 256-512 tokens | Simple, predictable | May split mid-sentence | General purpose |
| Sentence-based | 3-5 sentences | Natural boundaries | Variable size | Articles, blogs |
| Semantic | Variable | Intelligent splits | Complex, slower | Long documents |

### 1. Fixed-Size Chunking

Split text into chunks of fixed token/word count with overlap:

In [None]:
def chunk_by_tokens(text: str, chunk_size: int = 512, overlap: int = 50) -> List[str]:
    """
    Split text into fixed-size chunks with overlap.
    
    Args:
        text: Text to chunk
        chunk_size: Number of words per chunk
        overlap: Number of overlapping words between chunks
    
    Returns:
        List of text chunks
    """
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:  # Don't add empty chunks
            chunks.append(chunk)
    
    return chunks

# Example
long_text = """Machine learning is a subset of artificial intelligence that focuses on 
building systems that can learn from data. Deep learning is a subset of machine learning 
that uses neural networks with multiple layers. These neural networks are inspired by the 
structure of the human brain. Natural language processing is another important area of AI 
that helps computers understand and generate human language. Computer vision enables machines 
to interpret and analyze visual information from the world."""

chunks = chunk_by_tokens(long_text, chunk_size=20, overlap=5)

print(f"Original text length: {len(long_text.split())} words\n")
print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk.split())} words):")
    print(f"  {chunk}\n")

### 2. Sentence-Based Chunking

Group sentences together (more natural boundaries):

In [None]:
def chunk_by_sentences(text: str, sentences_per_chunk: int = 3) -> List[str]:
    """
    Split text by sentences, group into chunks.
    
    Args:
        text: Text to chunk
        sentences_per_chunk: Number of sentences per chunk
    
    Returns:
        List of text chunks
    """
    # Simple sentence splitting (for production, use spaCy or NLTK)
    sentences = text.replace('\n', ' ').split('. ')
    chunks = []
    
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = '. '.join(sentences[i:i + sentences_per_chunk])
        if not chunk.endswith('.'):
            chunk += '.'
        chunks.append(chunk.strip())
    
    return chunks

# Example
chunks_sentences = chunk_by_sentences(long_text, sentences_per_chunk=2)

print(f"Number of sentence-based chunks: {len(chunks_sentences)}\n")
for i, chunk in enumerate(chunks_sentences, 1):
    print(f"Chunk {i}:")
    print(f"  {chunk}\n")

### 3. Chunking with Metadata

**Critical concept:** Always include metadata with chunks!

Metadata helps with:
- **Citations** - Show users where information came from
- **Filtering** - Search within specific sources or date ranges
- **Debugging** - Track down retrieval issues
- **Quality** - Prioritize trusted sources

In [None]:
class DocumentChunk:
    """
    Represents a text chunk with metadata.
    """
    def __init__(self, text: str, metadata: Dict):
        self.text = text
        self.metadata = metadata  # source, page, section, date, author, etc.
        self.embedding = None
    
    def embed(self, embedding_function):
        """Generate embedding for this chunk."""
        self.embedding = embedding_function(self.text)
    
    def __repr__(self):
        return f"DocumentChunk(text='{self.text[:50]}...', metadata={self.metadata})"

# Example: Creating chunks with rich metadata
chunks_with_metadata = [
    DocumentChunk(
        text="Machine learning is a subset of artificial intelligence that focuses on building systems that can learn from data.",
        metadata={
            "source": "ai_introduction.pdf",
            "page": 1,
            "section": "Introduction to Machine Learning",
            "date": "2024-01-15",
            "author": "Dr. Smith"
        }
    ),
    DocumentChunk(
        text="Deep learning uses neural networks with multiple layers to learn complex patterns.",
        metadata={
            "source": "ai_introduction.pdf",
            "page": 2,
            "section": "Deep Learning Fundamentals",
            "date": "2024-01-15",
            "author": "Dr. Smith"
        }
    )
]

# Generate embeddings for chunks
for chunk in chunks_with_metadata:
    chunk.embed(get_openai_embedding)

print("Chunks with metadata:")
for i, chunk in enumerate(chunks_with_metadata, 1):
    print(f"\nChunk {i}:")
    print(f"  Text: {chunk.text}")
    print(f"  Source: {chunk.metadata['source']}")
    print(f"  Page: {chunk.metadata['page']}")
    print(f"  Section: {chunk.metadata['section']}")
    print(f"  Embedding: {len(chunk.embedding)} dimensions")

### Best Practices for Chunking

**1. Chunk Size:**
- **Sweet spot:** 256-512 tokens (~200-400 words)
- Too small: Lost context, too many chunks
- Too large: Less precise retrieval, may hit limits

**2. Overlap:**
- **Recommended:** 10-20% of chunk size
- Maintains context continuity between chunks
- Prevents important information from being split

**3. Metadata:**
- **Always include:** source, page/location
- **Often useful:** date, author, section, category
- **For citations:** exact page numbers, URLs

**4. Testing:**
- Test different strategies with your data
- Measure retrieval quality
- Iterate based on results

### Semantic Chunking (Advanced)

**Note:** We won't implement this, but you should know it exists.

Semantic chunking uses NLP to:
- Detect topic boundaries
- Group related sentences
- Create more coherent chunks

**Tools:**
- LangChain's SemanticChunker
- spaCy for sentence detection
- Custom models for topic segmentation

**When to use:**
- Very long documents
- Multiple topics per document
- Quality is more important than speed

---

## Section 8: Introduction to Vector Databases

### What is a Vector Database?

A **vector database** is a specialized database optimized for storing and searching high-dimensional vectors (embeddings).

**Key difference from traditional databases:**
- Traditional DB: Exact match queries ("WHERE name = 'John'")
- Vector DB: Similarity queries ("Find documents similar to this query")

### Traditional DB vs Vector DB

| Feature | Traditional DB | Vector DB |
|---------|---------------|----------|
| **Storage** | Rows, columns (structured data) | Vectors (embeddings) |
| **Query Type** | SQL, exact match | Similarity search |
| **Use Case** | CRUD operations, transactions | Semantic search, recommendations |
| **Search Method** | Indexes (B-tree, hash) | HNSW, IVF indexes |
| **Performance** | Fast exact queries | Fast similarity queries |
| **Examples** | PostgreSQL, MySQL, MongoDB | Chroma, Pinecone, Weaviate |

### Example Queries

**Traditional DB:**
```sql
SELECT * FROM products WHERE price < 100 AND category = 'electronics'
```

**Vector DB:**
```python
# Find products similar to "wireless headphones with noise cancellation"
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=10
)
```

### Popular Vector Databases

**1. Chroma (We'll use this)**
- Open-source, easy to use
- Great for learning and prototyping
- Local or client-server deployment
- Python-first design

**2. Pinecone**
- Managed cloud service
- Production-ready, highly scalable
- Pay-as-you-go pricing
- Excellent documentation

**3. Weaviate**
- Open-source, GraphQL API
- Hybrid search (vector + keyword)
- Self-hosted or cloud
- Multi-tenant support

**4. FAISS (Facebook AI)**
- Library, not a full database
- Extremely fast
- Requires more manual setup
- Best for researchers/advanced users

**Others:** Qdrant, Milvus, Vespa, pgvector (PostgreSQL extension)

### When to Use Vector DB vs Traditional DB

**Use Vector Database when:**
- Semantic search is needed
- Building RAG applications
- Recommendation systems
- Similarity-based queries
- Content discovery
- Duplicate detection

**Use Traditional Database when:**
- Exact match queries
- Transactions (ACID guarantees)
- Structured data with relationships
- Traditional CRUD operations
- Complex joins and aggregations

**Use Both (Hybrid Architecture):**
```
PostgreSQL (user data, transactions)
    +
ChromaDB (document embeddings, semantic search)
    +
Redis (caching)
    =
Complete production system
```

### How Vector Databases Work (Simplified)

1. **Indexing:**
   - Store vectors with metadata
   - Build specialized indexes (HNSW, IVF)
   - Optimize for fast similarity search

2. **Querying:**
   - Convert query to embedding
   - Use approximate nearest neighbor (ANN) search
   - Return top-K most similar vectors

3. **Optimization:**
   - Quantization (reduce memory)
   - Sharding (distribute load)
   - Caching (speed up repeated queries)

Don't worry about the details - the vector DB handles this for you!

---

## Section 9: Hands-On with ChromaDB

Let's get practical with ChromaDB - one of the easiest vector databases to use.

### Why ChromaDB?

- Simple Python API
- No separate server needed (can run in-memory)
- Perfect for learning and prototyping
- Can scale to production with client-server mode

### Step 1: Initialize ChromaDB

In [None]:
# Initialize Chroma client (in-memory mode)
chroma_client = chromadb.Client(Settings(
    anonymized_telemetry=False  # Disable telemetry
))

# Create a collection (like a table in SQL)
collection = chroma_client.create_collection(
    name="my_documents",
    metadata={"description": "A collection of document embeddings"}
)

print("‚úÖ ChromaDB initialized!")
print(f"Collection: {collection.name}")
print(f"Count: {collection.count()} documents")

### Step 2: Add Documents to ChromaDB

ChromaDB stores:
- **documents**: The actual text
- **embeddings**: Vector representations
- **metadata**: Additional information
- **ids**: Unique identifiers

In [None]:
# Sample documents about AI/ML
documents = [
    "Machine learning is a subset of AI that uses algorithms to learn from data.",
    "Deep learning uses neural networks with multiple layers to learn complex patterns.",
    "Python is a popular programming language for data science and machine learning.",
    "Natural language processing helps computers understand human language.",
    "Computer vision enables machines to interpret visual information from images and videos.",
    "Reinforcement learning teaches agents to make decisions through trial and error."
]

# Generate embeddings
print("Generating embeddings...")
embeddings = get_openai_embeddings_batch(documents)

# Add to ChromaDB
collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[
        {"source": "training_data", "topic": "machine_learning", "index": i} 
        for i in range(len(documents))
    ]
)

print(f"\n‚úÖ Added {len(documents)} documents to ChromaDB")
print(f"Total documents in collection: {collection.count()}")

### Step 3: Query ChromaDB (Semantic Search)

Now comes the magic - searching by meaning, not keywords!

In [None]:
def search_chroma(query: str, n_results: int = 3) -> Dict:
    """
    Search ChromaDB for similar documents.
    
    Args:
        query: Search query
        n_results: Number of results to return
    
    Returns:
        Dictionary with documents, metadata, and distances
    """
    # Generate query embedding
    query_embedding = get_openai_embedding(query)
    
    # Search
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )
    
    return results

# Test semantic search
query = "How do neural networks work?"
results = search_chroma(query, n_results=3)

print(f"Query: {query}\n")
print("Top 3 results:\n")
for i, (doc, metadata, distance) in enumerate(zip(
    results['documents'][0],
    results['metadatas'][0],
    results['distances'][0]
), 1):
    similarity = 1 - distance  # Convert distance to similarity
    print(f"{i}. Similarity: {similarity:.4f}")
    print(f"   Document: {doc}")
    print(f"   Metadata: {metadata}")
    print()

### Understanding the Results

Notice that:
- The query mentions "neural networks"
- Top result talks about "deep learning" and "neural networks"
- This is **semantic search** - understanding meaning, not just matching keywords!

### Step 4: Filtering with Metadata

ChromaDB supports filtering results by metadata:

In [None]:
# Search with metadata filter
query_embedding = get_openai_embedding("Tell me about AI")

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    where={"topic": "machine_learning"}  # Filter by topic
)

print("Results filtered by metadata (topic='machine_learning'):\n")
for i, (doc, metadata) in enumerate(zip(
    results['documents'][0],
    results['metadatas'][0]
), 1):
    print(f"{i}. {doc}")
    print(f"   Topic: {metadata['topic']}")
    print()

### Step 5: Update and Delete

You can also update or delete documents:

In [None]:
# Get a document
print("Original count:", collection.count())

# Delete a document
collection.delete(ids=["doc_0"])
print("After deletion:", collection.count())

# Add it back
collection.add(
    documents=[documents[0]],
    embeddings=[embeddings[0]],
    ids=["doc_0"],
    metadatas=[{"source": "training_data", "topic": "machine_learning", "index": 0}]
)
print("After adding back:", collection.count())

print("\n‚úÖ Update and delete operations work!")

### ChromaDB Key Operations Summary

**Create collection:**
```python
collection = chroma_client.create_collection(name="my_collection")
```

**Add documents:**
```python
collection.add(
    documents=[...],
    embeddings=[...],
    ids=[...],
    metadatas=[...]
)
```

**Query (semantic search):**
```python
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    where={"key": "value"}  # Optional metadata filter
)
```

**Delete:**
```python
collection.delete(ids=["doc_1", "doc_2"])
```

**Get count:**
```python
count = collection.count()
```

---

## Section 10: Practical Example - Document Search System

Let's build a complete, production-ready document search engine class.

This will encapsulate everything we've learned into a clean, reusable API.

In [None]:
class VectorSearchEngine:
    """
    A complete vector search engine using ChromaDB and OpenAI embeddings.
    
    Features:
    - Add documents with metadata
    - Semantic search
    - Metadata filtering
    - Similarity scores
    """
    
    def __init__(self, collection_name: str = "documents", embedding_function=None):
        """
        Initialize the search engine.
        
        Args:
            collection_name: Name for the ChromaDB collection
            embedding_function: Function to generate embeddings (defaults to OpenAI)
        """
        self.client = chromadb.Client(Settings(anonymized_telemetry=False))
        self.collection_name = collection_name
        
        # Create or get existing collection
        try:
            self.collection = self.client.create_collection(collection_name)
        except:
            self.collection = self.client.get_collection(collection_name)
        
        # Use provided embedding function or default to OpenAI
        self.embedding_function = embedding_function or get_openai_embedding
    
    def add_documents(self, documents: List[str], metadatas: List[Dict] = None) -> None:
        """
        Add documents to the search engine.
        
        Args:
            documents: List of document texts
            metadatas: Optional list of metadata dicts (one per document)
        """
        # Generate embeddings (batch for efficiency)
        if self.embedding_function == get_openai_embedding:
            embeddings = get_openai_embeddings_batch(documents)
        else:
            embeddings = [self.embedding_function(doc) for doc in documents]
        
        # Generate IDs
        start_id = self.collection.count()
        ids = [f"doc_{start_id + i}" for i in range(len(documents))]
        
        # Use provided metadata or create empty dicts
        if metadatas is None:
            metadatas = [{} for _ in documents]
        
        # Add to collection
        self.collection.add(
            documents=documents,
            embeddings=embeddings,
            ids=ids,
            metadatas=metadatas
        )
        
        print(f"‚úÖ Added {len(documents)} documents. Total: {self.collection.count()}")
    
    def search(self, query: str, n_results: int = 5, metadata_filter: Dict = None) -> Dict:
        """
        Search for similar documents.
        
        Args:
            query: Search query
            n_results: Number of results to return
            metadata_filter: Optional metadata filter (e.g., {"source": "paper.pdf"})
        
        Returns:
            Dictionary with query, results (document, metadata, similarity)
        """
        # Generate query embedding
        query_emb = self.embedding_function(query)
        
        # Search
        search_kwargs = {
            "query_embeddings": [query_emb],
            "n_results": n_results
        }
        
        if metadata_filter:
            search_kwargs["where"] = metadata_filter
        
        results = self.collection.query(**search_kwargs)
        
        # Format results
        formatted_results = {
            "query": query,
            "results": [
                {
                    "document": doc,
                    "metadata": meta,
                    "similarity": 1 - dist  # Convert distance to similarity
                }
                for doc, meta, dist in zip(
                    results['documents'][0],
                    results['metadatas'][0],
                    results['distances'][0]
                )
            ]
        }
        
        return formatted_results
    
    def get_stats(self) -> Dict:
        """Get statistics about the search engine."""
        return {
            "collection_name": self.collection_name,
            "total_documents": self.collection.count()
        }

print("‚úÖ VectorSearchEngine class defined!")

### Let's Test Our Search Engine!

We'll create a search engine with documents about AI and programming:

In [None]:
# Create search engine
engine = VectorSearchEngine(collection_name="test_search")

# Add AI-related documents
ai_docs = [
    "Machine learning models learn patterns from training data.",
    "Neural networks consist of interconnected layers of artificial neurons.",
    "Deep learning has revolutionized computer vision and NLP tasks.",
    "Transformers use self-attention mechanisms for sequence processing."
]

ai_metadata = [
    {"category": "ai", "topic": "machine_learning"},
    {"category": "ai", "topic": "neural_networks"},
    {"category": "ai", "topic": "deep_learning"},
    {"category": "ai", "topic": "transformers"}
]

engine.add_documents(ai_docs, ai_metadata)

# Add programming-related documents
prog_docs = [
    "Python uses dynamic typing and automatic memory management.",
    "JavaScript is the primary language for web browser scripting.",
    "React is a JavaScript library for building user interfaces.",
    "Docker containers package applications with their dependencies."
]

prog_metadata = [
    {"category": "programming", "topic": "python"},
    {"category": "programming", "topic": "javascript"},
    {"category": "programming", "topic": "react"},
    {"category": "programming", "topic": "docker"}
]

engine.add_documents(prog_docs, prog_metadata)

print(f"\n{engine.get_stats()}")

### Test 1: Search for AI Topics

In [None]:
# Search for AI-related content
result = engine.search("How do neural networks learn?", n_results=3)

print(f"Query: {result['query']}\n")
print("Top 3 results:\n")
for i, r in enumerate(result['results'], 1):
    print(f"{i}. Similarity: {r['similarity']:.4f}")
    print(f"   Document: {r['document']}")
    print(f"   Category: {r['metadata']['category']}")
    print(f"   Topic: {r['metadata']['topic']}")
    print()

### Test 2: Search with Metadata Filter

In [None]:
# Search only in programming documents
result = engine.search(
    "Tell me about web development",
    n_results=3,
    metadata_filter={"category": "programming"}
)

print(f"Query: {result['query']}")
print("Filter: category='programming'\n")
print("Results:\n")
for i, r in enumerate(result['results'], 1):
    print(f"{i}. Similarity: {r['similarity']:.4f}")
    print(f"   Document: {r['document']}")
    print(f"   Topic: {r['metadata']['topic']}")
    print()

### Test 3: Cross-Category Search

In [None]:
# General search across all categories
result = engine.search("What programming tools are popular?", n_results=5)

print(f"Query: {result['query']}\n")
print("Top 5 results (all categories):\n")
for i, r in enumerate(result['results'], 1):
    print(f"{i}. Similarity: {r['similarity']:.4f} | Category: {r['metadata']['category']}")
    print(f"   {r['document']}")
    print()

### What We Built

We now have a complete, production-ready vector search engine that:

‚úÖ **Stores documents** with rich metadata
‚úÖ **Performs semantic search** (meaning-based, not keyword)
‚úÖ **Filters by metadata** (category, topic, etc.)
‚úÖ **Returns similarity scores** for ranking
‚úÖ **Handles batch operations** efficiently

This is the foundation of:
- RAG chatbots
- Document search engines
- Recommendation systems
- Content discovery platforms

---

## Section 11: Best Practices & Common Pitfalls

### Best Practices

**1. Embedding Model Selection**

‚úÖ **DO:**
- Use the **same model** for indexing and querying
- Start with cost-effective models (OpenAI small, open-source)
- Test quality before choosing expensive models

‚ùå **DON'T:**
- Mix embeddings from different models
- Assume larger models are always better

**2. Chunking Strategy**

‚úÖ **DO:**
- Use 256-512 token chunks for most use cases
- Add 10-20% overlap between chunks
- Include rich metadata (source, page, section)
- Test different strategies with your data

‚ùå **DON'T:**
- Use chunks that are too large (>1000 tokens)
- Use chunks that are too small (<100 tokens)
- Forget to add metadata

**3. Indexing**

‚úÖ **DO:**
- Batch embeddings API calls (cheaper, faster)
- Cache embeddings when possible
- Update indexes incrementally
- Monitor API costs

‚ùå **DON'T:**
- Generate embeddings one at a time
- Regenerate embeddings unnecessarily
- Ignore rate limits

**4. Search Quality**

‚úÖ **DO:**
- Set minimum similarity thresholds (e.g., >0.7)
- Combine vector search with metadata filters
- Consider hybrid search (vector + keyword)
- Use reranking for top results

‚ùå **DON'T:**
- Return all results regardless of similarity
- Rely solely on vector search for everything
- Ignore metadata filtering capabilities

**5. Performance**

‚úÖ **DO:**
- Use appropriate vector DB for your scale
  - Small: ChromaDB, FAISS
  - Large: Pinecone, Weaviate
- Monitor query latency
- Implement caching for frequent queries
- Consider async operations for scale

‚ùå **DON'T:**
- Use in-memory DB for large datasets
- Ignore performance metrics
- Block on embedding generation

### Common Pitfalls

**1. Mixing Embedding Models**

‚ùå **Problem:**
```python
# Index with model A
embeddings = get_openai_embedding(text)  

# Query with model B
query_emb = get_google_embedding(query)  # ‚ùå Different model!
```

‚úÖ **Solution:**
```python
# Use same model for both
embeddings = get_openai_embedding(text)
query_emb = get_openai_embedding(query)  # ‚úÖ Same model
```

**2. Wrong Chunk Sizes**

‚ùå **Problem:**
```python
# Too large - loses precision
chunks = chunk_by_tokens(text, chunk_size=2000)  

# Too small - loses context
chunks = chunk_by_tokens(text, chunk_size=50)  
```

‚úÖ **Solution:**
```python
# Just right - 256-512 tokens
chunks = chunk_by_tokens(text, chunk_size=512, overlap=50)
```

**3. Missing Metadata**

‚ùå **Problem:**
```python
# No metadata - can't filter or cite
collection.add(documents=docs, embeddings=embs, ids=ids)
```

‚úÖ **Solution:**
```python
# Rich metadata for filtering and citations
collection.add(
    documents=docs,
    embeddings=embs,
    ids=ids,
    metadatas=[{"source": "paper.pdf", "page": 5, "date": "2024-01-15"}]
)
```

**4. No Text Preprocessing**

‚ùå **Problem:**
```python
# Raw text with HTML, special characters
text = "<div>Machine learning is...</div> \n\n\n   "
embedding = get_openai_embedding(text)
```

‚úÖ **Solution:**
```python
# Clean text before embedding
import re
text = re.sub('<.*?>', '', text)  # Remove HTML
text = ' '.join(text.split())  # Normalize whitespace
embedding = get_openai_embedding(text)
```

**5. Not Testing Similarity Thresholds**

‚ùå **Problem:**
```python
# Return all results, even low-quality ones
results = collection.query(query_embeddings=[emb], n_results=10)
return results  # May include irrelevant documents
```

‚úÖ **Solution:**
```python
# Filter by similarity threshold
results = collection.query(query_embeddings=[emb], n_results=10)
filtered = [
    r for r in results 
    if (1 - r['distance']) > 0.7  # Only high-quality matches
]
return filtered
```

### Production Checklist

Before deploying to production:

- [ ] Same embedding model for index and query
- [ ] Optimal chunk size tested (256-512 tokens)
- [ ] Overlap between chunks (10-20%)
- [ ] Rich metadata for all documents
- [ ] Similarity threshold configured
- [ ] Batch API calls for efficiency
- [ ] Error handling and retries
- [ ] Monitoring and logging
- [ ] Caching strategy
- [ ] Rate limiting
- [ ] Cost monitoring

---

## Section 12: When to Use Vector Databases

Understanding when to use (and not use) vector databases is critical for building effective systems.

### Use Vector Databases When:

**1. Semantic Search**

Finding documents by **meaning**, not exact keywords.

**Example:**
- Query: "budget management strategies"
- Matches: "financial planning", "cost control", "expense tracking"

**Use cases:**
- Document search engines
- Knowledge bases
- FAQ systems

**2. RAG (Retrieval-Augmented Generation)**

Retrieving relevant context for LLMs.

**Example:**
- User asks: "What's our refund policy?"
- Vector DB retrieves relevant policy documents
- LLM generates answer using retrieved context

**Use cases:**
- Chatbots with custom knowledge
- Q&A systems
- Customer support agents

**3. Recommendation Systems**

Finding similar items based on embeddings.

**Example:**
- "Users who liked this article also liked..."
- "Similar products to what you viewed"

**Use cases:**
- Content recommendations
- Product recommendations
- User matching

**4. Duplicate Detection**

Finding similar or duplicate content at scale.

**Example:**
- Detect duplicate support tickets
- Find plagiarized content
- Identify similar documents

**Use cases:**
- Content moderation
- Data deduplication
- Plagiarism detection

**5. Anomaly Detection**

Finding outliers in high-dimensional data.

**Example:**
- Unusual transaction patterns
- Anomalous user behavior
- Fraud detection

**Use cases:**
- Security monitoring
- Fraud prevention
- Quality control

### Don't Use Vector Databases When:

**1. Exact Match Queries**

‚ùå **Wrong tool:**
```python
# Finding user by email
vector_db.query("user@example.com")  # Overkill!
```

‚úÖ **Use traditional DB:**
```sql
SELECT * FROM users WHERE email = 'user@example.com'
```

**2. Transactional Data**

‚ùå **Wrong tool:**
- Banking transactions
- E-commerce orders
- User account management

‚úÖ **Use traditional DB:**
- Need ACID guarantees
- PostgreSQL, MySQL
- Relational integrity

**3. Simple Keyword Search**

‚ùå **Wrong tool:**
```python
# Searching by exact title or ID
vector_db.query("DOC-12345")
```

‚úÖ **Use full-text search:**
- Elasticsearch
- PostgreSQL full-text search
- Simple keyword matching

**4. Structured Relationships**

‚ùå **Wrong tool:**
- Complex joins (users ‚Üí orders ‚Üí products)
- Foreign key relationships
- Multi-table aggregations

‚úÖ **Use relational DB:**
- PostgreSQL, MySQL
- SQL for complex queries
- Referential integrity

**5. Real-time Analytics**

‚ùå **Wrong tool:**
```sql
-- Aggregations, GROUP BY, SUM
SELECT category, COUNT(*) FROM products GROUP BY category
```

‚úÖ **Use analytics DB:**
- ClickHouse
- BigQuery
- Time-series databases

### Hybrid Architecture (Best Practice)

Most production systems use **both** traditional and vector databases:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ         Application Layer               ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
           ‚îÇ            ‚îÇ           ‚îÇ
           ‚ñº            ‚ñº           ‚ñº
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇPostgreSQL‚îÇ  ‚îÇ ChromaDB ‚îÇ  ‚îÇRedis ‚îÇ
    ‚îÇ          ‚îÇ  ‚îÇ          ‚îÇ  ‚îÇ      ‚îÇ
    ‚îÇUser data ‚îÇ  ‚îÇEmbeddings‚îÇ  ‚îÇCache ‚îÇ
    ‚îÇOrders    ‚îÇ  ‚îÇSemantic  ‚îÇ  ‚îÇ      ‚îÇ
    ‚îÇ          ‚îÇ  ‚îÇsearch    ‚îÇ  ‚îÇ      ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Example: E-commerce with RAG**

```python
# User data and orders ‚Üí PostgreSQL
user = postgres.query("SELECT * FROM users WHERE id = 123")
orders = postgres.query("SELECT * FROM orders WHERE user_id = 123")

# Product descriptions and reviews ‚Üí Vector DB
similar_products = chroma.search("wireless headphones noise cancelling")

# Frequent queries ‚Üí Redis cache
cached_results = redis.get("popular_products")

# Combine results
recommendations = combine(user, orders, similar_products)
```

### Decision Matrix

| Scenario | Use This |
|----------|----------|
| Semantic search | Vector DB |
| RAG chatbot | Vector DB |
| Recommendations | Vector DB |
| User authentication | Traditional DB |
| E-commerce orders | Traditional DB |
| Simple keyword search | Full-text search |
| Analytics queries | Analytics DB |
| Caching | Redis/Memcached |

### Summary

**Vector databases are excellent for:**
- Semantic similarity
- Meaning-based search
- RAG applications
- Recommendations

**But they're not a replacement for:**
- Traditional databases (exact queries, transactions)
- Full-text search (keyword matching)
- Analytics databases (aggregations, reporting)

**Use the right tool for each job!**

---

## Section 13: Summary & Next Steps

Congratulations! You've completed the Vector Databases and Embeddings notebook.

### What We Learned Today

‚úÖ **Embeddings Fundamentals**
- Text ‚Üí vectors that capture semantic meaning
- Different providers: OpenAI, Google, open-source
- When to use each model (cost, quality, privacy trade-offs)

‚úÖ **Similarity Search**
- Cosine similarity, Euclidean distance, dot product
- Top-k search implementation
- Measuring semantic similarity between texts

‚úÖ **Chunking Strategies**
- Fixed-size chunking (256-512 tokens)
- Sentence-based chunking
- Importance of overlap (10-20%) and metadata

‚úÖ **Vector Databases**
- Hands-on experience with ChromaDB
- When to use vector DB vs traditional DB
- Production considerations and hybrid architecture

‚úÖ **Best Practices**
- Consistent embedding models
- Proper chunking with metadata
- Similarity thresholds
- Batch operations for efficiency
- Common pitfalls to avoid

### Key Takeaways

**1. Embeddings are the foundation of modern AI applications**
- Enable semantic search (meaning-based, not keyword-based)
- Critical for RAG, recommendations, and more

**2. Chunking strategy matters**
- Chunk size: 256-512 tokens (sweet spot)
- Add overlap for context continuity
- Always include rich metadata

**3. Use the right tool for the job**
- Vector DB: Semantic search, RAG, recommendations
- Traditional DB: Exact queries, transactions
- Hybrid: Most production systems use both

**4. Start simple, scale smart**
- Begin with OpenAI small or open-source
- Use ChromaDB for prototyping
- Upgrade models/infrastructure based on needs

**5. Test and iterate**
- Different chunking strategies for different data
- Set similarity thresholds based on testing
- Monitor quality and costs

### Preview: Next Session - RAG (Retrieval-Augmented Generation)

In the next notebook, we'll combine everything we've learned:

**Topics we'll cover:**
- Building complete RAG pipeline
- Document loading (PDF, TXT, web)
- Combining vector search with LLMs
- Citation and source tracking
- Handling multi-document queries
- **Hands-on project:** Build a RAG chatbot

**You'll be able to:**
- Load documents from various sources
- Chunk and embed them efficiently
- Retrieve relevant context for queries
- Generate answers with citations
- Build production-ready RAG applications

### Resources for Deeper Learning

**Official Documentation:**
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [Sentence Transformers](https://www.sbert.net/)

**Vector Database Comparisons:**
- [Vector Database Benchmark](https://github.com/erikbern/ann-benchmarks)
- [Vector DB Landscape](https://www.pinecone.io/learn/vector-database/)

**Advanced Topics:**
- Hybrid search (vector + keyword)
- Reranking models
- Fine-tuning embedding models
- Quantization and compression

### Practice Exercises

Before the next session, try:

1. **Build a mini search engine** for your own documents
2. **Compare embedding models** - test OpenAI vs Google vs open-source on your data
3. **Experiment with chunking** - try different sizes and overlap percentages
4. **Add metadata filtering** - categorize documents and filter searches

### Questions?

Common questions:

**Q: Which embedding model should I use?**
A: Start with OpenAI text-embedding-3-small (good quality/cost balance). Upgrade if needed.

**Q: How do I know if my chunk size is right?**
A: Test retrieval quality with sample queries. If results are too broad, use smaller chunks. If too fragmented, use larger chunks.

**Q: Should I use ChromaDB in production?**
A: ChromaDB works well for small-medium datasets. For large-scale production, consider Pinecone or Weaviate.

**Q: How much does this cost?**
A: OpenAI embeddings: $0.02 per 1M tokens. For a 10,000 document corpus (~500 tokens each), that's ~$0.10.

### You're Ready for RAG!

You now understand the core building blocks:
- ‚úÖ How to generate embeddings
- ‚úÖ How to measure similarity
- ‚úÖ How to chunk documents
- ‚úÖ How to store and search vectors

Next, we'll put it all together and build a RAG chatbot that can answer questions using your custom knowledge base!

---

**Great work! See you in the next session! üöÄ**