# Lab 1: Embeddings & Semantic Search

**Week 4 - RAG Fundamentals**

**Provided by:** ADC ENGINEERING & CONSULTING LTD

## Objectives

In this lab, you will:
- Understand text embeddings and vector representations
- Generate embeddings using OpenAI's embedding models
- Implement similarity search algorithms
- Build a vector database from scratch
- Work with different distance metrics
- Create a semantic search engine
- Optimize embedding storage and retrieval
- Handle multi-lingual embeddings
- Evaluate search quality

## Prerequisites

- Completed Week 3 labs
- Understanding of vectors and linear algebra basics
- OpenAI API key configured
- Python 3.9+

## Setup and Installation

In [None]:
# Install required packages
!pip install openai python-dotenv tiktoken numpy scikit-learn pandas matplotlib seaborn --quiet

In [None]:
import os
import json
import numpy as np
from typing import List, Dict, Optional, Tuple, Any
from dataclasses import dataclass, field
from datetime import datetime
import pickle

from openai import OpenAI
from dotenv import load_dotenv
import tiktoken

# For similarity computations
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.decomposition import PCA

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Load environment variables
load_dotenv()

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

print("✓ Setup complete!")

## Part 1: Understanding Text Embeddings

Text embeddings are dense vector representations of text that capture semantic meaning. Similar texts have similar embeddings.

### Key Concepts:

1. **Embedding Models**: Transform text → vectors
2. **Dimensionality**: Number of dimensions in the vector (e.g., 1536 for text-embedding-3-small)
3. **Semantic Similarity**: Similar meaning → close vectors
4. **Distance Metrics**: Measure similarity between vectors

Let's start by generating embeddings:

In [None]:
def get_embedding(text: str, model: str = "text-embedding-3-small") -> List[float]:
    """
    Get embedding for a text using OpenAI's API.
    
    Args:
        text: Text to embed
        model: Embedding model to use
    
    Returns:
        Embedding vector
    """
    # Replace newlines with spaces
    text = text.replace("\n", " ")
    
    response = client.embeddings.create(
        input=[text],
        model=model
    )
    
    return response.data[0].embedding

# Test with simple examples
texts = [
    "The cat sits on the mat",
    "A feline rests on the rug",
    "The dog plays in the garden",
    "Python is a programming language",
    "I love eating pizza"
]

print("="*80)
print("GENERATING EMBEDDINGS")
print("="*80)

embeddings = []
for text in texts:
    embedding = get_embedding(text)
    embeddings.append(embedding)
    print(f"\nText: {text}")
    print(f"Embedding dimension: {len(embedding)}")
    print(f"First 5 values: {embedding[:5]}")

# Convert to numpy array for easier manipulation
embeddings = np.array(embeddings)
print(f"\n✓ Generated {len(embeddings)} embeddings with shape {embeddings.shape}")

### Exercise 1.1: Explore Embedding Properties

Generate embeddings and explore their properties:

In [None]:
# TODO: Explore embedding properties

# 1. TODO: Generate embeddings for these sentence pairs
sentence_pairs = [
    ("The weather is sunny", "It's a bright day"),
    ("I am happy", "I am sad"),
    ("Machine learning is fascinating", "AI is interesting"),
    ("The car is red", "The vehicle is crimson"),
    ("Hello world", "Goodbye universe")
]

# 2. TODO: For each pair, calculate the cosine similarity
# Hint: Use sklearn's cosine_similarity or implement manually

# 3. TODO: Which pairs are most similar? Least similar?

# 4. TODO: Generate embeddings for the same text with small variations
# Example: "Hello" vs "Hello!" vs "hello" vs "HELLO"
# Do punctuation and case affect embeddings?

# Your code here:
# for pair in sentence_pairs:
#     emb1 = get_embedding(pair[0])
#     emb2 = get_embedding(pair[1])
#     similarity = ...
#     print(f"{pair[0]} <-> {pair[1]}: {similarity:.4f}")

## Part 2: Similarity Metrics

Different metrics for measuring similarity between vectors.

In [None]:
class SimilarityCalculator:
    """
    Calculate various similarity metrics between vectors.
    """
    
    @staticmethod
    def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
        """
        Cosine similarity: measures angle between vectors.
        Range: [-1, 1], where 1 = identical direction
        """
        dot_product = np.dot(vec1, vec2)
        norm1 = np.linalg.norm(vec1)
        norm2 = np.linalg.norm(vec2)
        return dot_product / (norm1 * norm2)
    
    @staticmethod
    def euclidean_distance(vec1: np.ndarray, vec2: np.ndarray) -> float:
        """
        Euclidean distance: straight-line distance.
        Range: [0, ∞], where 0 = identical
        """
        return np.linalg.norm(vec1 - vec2)
    
    @staticmethod
    def dot_product(vec1: np.ndarray, vec2: np.ndarray) -> float:
        """
        Dot product: measures similarity and magnitude.
        Range: [-∞, ∞]
        """
        return np.dot(vec1, vec2)
    
    @staticmethod
    def manhattan_distance(vec1: np.ndarray, vec2: np.ndarray) -> float:
        """
        Manhattan distance: sum of absolute differences.
        Range: [0, ∞], where 0 = identical
        """
        return np.sum(np.abs(vec1 - vec2))
    
    @staticmethod
    def compare_all_metrics(vec1: np.ndarray, vec2: np.ndarray) -> Dict[str, float]:
        """Compare using all metrics."""
        return {
            "cosine_similarity": SimilarityCalculator.cosine_similarity(vec1, vec2),
            "euclidean_distance": SimilarityCalculator.euclidean_distance(vec1, vec2),
            "dot_product": SimilarityCalculator.dot_product(vec1, vec2),
            "manhattan_distance": SimilarityCalculator.manhattan_distance(vec1, vec2)
        }

# Test with our earlier examples
calc = SimilarityCalculator()

print("\n" + "="*80)
print("SIMILARITY COMPARISON")
print("="*80)

comparisons = [
    (0, 1, "cat on mat", "feline on rug"),
    (0, 2, "cat on mat", "dog in garden"),
    (0, 3, "cat on mat", "Python language"),
    (3, 4, "Python language", "eating pizza")
]

for idx1, idx2, text1, text2 in comparisons:
    metrics = calc.compare_all_metrics(embeddings[idx1], embeddings[idx2])
    
    print(f"\n{text1} <-> {text2}")
    print("-" * 80)
    for metric, value in metrics.items():
        print(f"  {metric:20s}: {value:>10.4f}")

### Exercise 2.1: Compare Distance Metrics

Compare different metrics for semantic search:

In [None]:
# TODO: Compare distance metrics for search

# 1. TODO: Create a small corpus of documents (10-15 sentences on different topics)
corpus = [
    # TODO: Add diverse sentences
    "The quick brown fox jumps over the lazy dog",
    # Add more...
]

# 2. TODO: Generate embeddings for all documents

# 3. TODO: Pick a query (e.g., "animals in nature")

# 4. TODO: Rank documents using each metric
# - Cosine similarity (higher = better)
# - Euclidean distance (lower = better)
# - Dot product (higher = better)

# 5. TODO: Do different metrics give different rankings?

# Your implementation:

## Part 3: Building a Simple Vector Database

Implement a basic vector database for semantic search.

In [None]:
@dataclass
class Document:
    """A document with its embedding."""
    id: str
    text: str
    embedding: np.ndarray
    metadata: Dict[str, Any] = field(default_factory=dict)
    created_at: str = field(default_factory=lambda: datetime.now().isoformat())

class SimpleVectorDB:
    """
    A simple in-memory vector database.
    """
    
    def __init__(self, embedding_model: str = "text-embedding-3-small"):
        """
        Initialize vector database.
        
        Args:
            embedding_model: OpenAI embedding model to use
        """
        self.embedding_model = embedding_model
        self.documents: List[Document] = []
        self.embeddings_matrix: Optional[np.ndarray] = None
    
    def add_document(
        self,
        text: str,
        doc_id: Optional[str] = None,
        metadata: Optional[Dict] = None
    ):
        """
        Add a document to the database.
        
        Args:
            text: Document text
            doc_id: Optional document ID (auto-generated if not provided)
            metadata: Optional metadata
        """
        if doc_id is None:
            doc_id = f"doc_{len(self.documents)}"
        
        # Generate embedding
        embedding = np.array(get_embedding(text, self.embedding_model))
        
        # Create document
        doc = Document(
            id=doc_id,
            text=text,
            embedding=embedding,
            metadata=metadata or {}
        )
        
        self.documents.append(doc)
        
        # Rebuild embeddings matrix
        self._rebuild_embeddings_matrix()
        
        return doc_id
    
    def add_documents(self, texts: List[str], metadatas: Optional[List[Dict]] = None):
        """Add multiple documents."""
        if metadatas is None:
            metadatas = [{}] * len(texts)
        
        for text, metadata in zip(texts, metadatas):
            self.add_document(text, metadata=metadata)
    
    def _rebuild_embeddings_matrix(self):
        """Rebuild the embeddings matrix."""
        if self.documents:
            self.embeddings_matrix = np.vstack([doc.embedding for doc in self.documents])
    
    def search(
        self,
        query: str,
        top_k: int = 5,
        metric: str = "cosine"
    ) -> List[Tuple[Document, float]]:
        """
        Search for similar documents.
        
        Args:
            query: Search query
            top_k: Number of results to return
            metric: Similarity metric ('cosine', 'euclidean', 'dot')
        
        Returns:
            List of (document, score) tuples
        """
        if not self.documents:
            return []
        
        # Get query embedding
        query_embedding = np.array(get_embedding(query, self.embedding_model))
        
        # Calculate similarities
        if metric == "cosine":
            # Cosine similarity (higher is better)
            similarities = cosine_similarity(
                query_embedding.reshape(1, -1),
                self.embeddings_matrix
            )[0]
            # Sort descending
            top_indices = np.argsort(similarities)[::-1][:top_k]
            scores = similarities[top_indices]
        
        elif metric == "euclidean":
            # Euclidean distance (lower is better)
            distances = euclidean_distances(
                query_embedding.reshape(1, -1),
                self.embeddings_matrix
            )[0]
            # Sort ascending
            top_indices = np.argsort(distances)[:top_k]
            # Convert to similarity score (inverse distance)
            scores = 1 / (1 + distances[top_indices])
        
        elif metric == "dot":
            # Dot product (higher is better)
            scores = np.dot(self.embeddings_matrix, query_embedding)
            top_indices = np.argsort(scores)[::-1][:top_k]
            scores = scores[top_indices]
        
        else:
            raise ValueError(f"Unknown metric: {metric}")
        
        # Return results
        results = [
            (self.documents[idx], float(score))
            for idx, score in zip(top_indices, scores)
        ]
        
        return results
    
    def get_document(self, doc_id: str) -> Optional[Document]:
        """Get document by ID."""
        for doc in self.documents:
            if doc.id == doc_id:
                return doc
        return None
    
    def delete_document(self, doc_id: str) -> bool:
        """Delete document by ID."""
        for i, doc in enumerate(self.documents):
            if doc.id == doc_id:
                del self.documents[i]
                self._rebuild_embeddings_matrix()
                return True
        return False
    
    def save(self, filepath: str):
        """Save database to file."""
        with open(filepath, 'wb') as f:
            pickle.dump(self, f)
        print(f"✓ Database saved to {filepath}")
    
    @staticmethod
    def load(filepath: str) -> 'SimpleVectorDB':
        """Load database from file."""
        with open(filepath, 'rb') as f:
            db = pickle.load(f)
        print(f"✓ Database loaded from {filepath}")
        return db
    
    def __len__(self):
        """Return number of documents."""
        return len(self.documents)
    
    def __repr__(self):
        """String representation."""
        return f"SimpleVectorDB(documents={len(self.documents)}, model={self.embedding_model})"

# Test the vector database
print("\n" + "="*80)
print("BUILDING VECTOR DATABASE")
print("="*80)

# Create database
db = SimpleVectorDB()

# Add sample documents
sample_docs = [
    "Python is a high-level programming language known for its simplicity.",
    "Machine learning is a subset of artificial intelligence.",
    "The Eiffel Tower is located in Paris, France.",
    "Neural networks are inspired by biological neurons.",
    "JavaScript is commonly used for web development.",
    "Deep learning uses multiple layers to learn representations.",
    "The Great Wall of China is one of the world's wonders.",
    "Natural language processing deals with text and speech.",
    "React is a popular JavaScript library for building UIs.",
    "Computer vision enables machines to interpret visual information."
]

print(f"\nAdding {len(sample_docs)} documents...")
db.add_documents(sample_docs)
print(f"✓ Database contains {len(db)} documents")

# Search examples
queries = [
    "programming languages",
    "artificial intelligence",
    "famous landmarks"
]

for query in queries:
    print(f"\n{'='*80}")
    print(f"Query: '{query}'")
    print(f"{'='*80}")
    
    results = db.search(query, top_k=3)
    
    for i, (doc, score) in enumerate(results, 1):
        print(f"\n{i}. Score: {score:.4f}")
        print(f"   Text: {doc.text}")

### Exercise 3.1: Build Your Own Vector Database

Enhance the vector database with additional features:

In [None]:
# TODO: Enhance the vector database

class EnhancedVectorDB(SimpleVectorDB):
    """
    TODO: Add these features:
    
    1. Metadata filtering
       - Search with filters (e.g., category="tech", date_range)
    
    2. Hybrid search
       - Combine keyword matching with semantic search
    
    3. Update documents
       - Update document text and regenerate embedding
    
    4. Batch operations
       - Add/delete multiple documents efficiently
    
    5. Statistics
       - Track search queries, popular documents, etc.
    """
    
    def __init__(self, embedding_model: str = "text-embedding-3-small"):
        super().__init__(embedding_model)
        # TODO: Add your enhancements
    
    def search_with_filter(
        self,
        query: str,
        filters: Dict[str, Any],
        top_k: int = 5
    ) -> List[Tuple[Document, float]]:
        """
        TODO: Search with metadata filters.
        
        Example:
            results = db.search_with_filter(
                "AI news",
                filters={"category": "technology", "year": 2024}
            )
        """
        pass
    
    def hybrid_search(
        self,
        query: str,
        top_k: int = 5,
        semantic_weight: float = 0.7
    ) -> List[Tuple[Document, float]]:
        """
        TODO: Combine keyword and semantic search.
        
        semantic_weight: Weight for semantic vs keyword (0-1)
        """
        pass
    
    def update_document(self, doc_id: str, new_text: str):
        """TODO: Update a document's text and embedding."""
        pass

# Test your enhancements
# enhanced_db = EnhancedVectorDB()
# enhanced_db.add_document("AI is transforming healthcare", metadata={"category": "tech", "year": 2024})
# results = enhanced_db.search_with_filter("medical technology", filters={"category": "tech"})

## Part 4: Semantic Search Engine

Build a practical semantic search engine.

In [None]:
class SemanticSearchEngine:
    """
    A semantic search engine with advanced features.
    """
    
    def __init__(self, embedding_model: str = "text-embedding-3-small"):
        """Initialize search engine."""
        self.db = SimpleVectorDB(embedding_model)
        self.search_history: List[Dict] = []
    
    def index_documents(
        self,
        documents: List[Dict[str, Any]],
        text_field: str = "text",
        id_field: str = "id"
    ):
        """
        Index documents for search.
        
        Args:
            documents: List of document dictionaries
            text_field: Field containing text to index
            id_field: Field containing document ID
        """
        print(f"Indexing {len(documents)} documents...")
        
        for doc in documents:
            text = doc.get(text_field, "")
            doc_id = doc.get(id_field)
            
            # Extract metadata (all fields except text)
            metadata = {k: v for k, v in doc.items() if k != text_field}
            
            self.db.add_document(text, doc_id=doc_id, metadata=metadata)
        
        print(f"✓ Indexed {len(self.db)} documents")
    
    def search(
        self,
        query: str,
        top_k: int = 5,
        return_scores: bool = True,
        min_score: Optional[float] = None
    ) -> List[Dict[str, Any]]:
        """
        Search documents.
        
        Args:
            query: Search query
            top_k: Number of results
            return_scores: Include similarity scores
            min_score: Minimum similarity score threshold
        
        Returns:
            List of search results
        """
        # Record search
        search_record = {
            "query": query,
            "timestamp": datetime.now().isoformat(),
            "top_k": top_k
        }
        
        # Perform search
        results = self.db.search(query, top_k=top_k)
        
        # Filter by minimum score if specified
        if min_score is not None:
            results = [(doc, score) for doc, score in results if score >= min_score]
        
        # Format results
        formatted_results = []
        for doc, score in results:
            result = {
                "id": doc.id,
                "text": doc.text,
                "metadata": doc.metadata
            }
            if return_scores:
                result["score"] = score
            
            formatted_results.append(result)
        
        search_record["num_results"] = len(formatted_results)
        self.search_history.append(search_record)
        
        return formatted_results
    
    def multi_query_search(
        self,
        queries: List[str],
        top_k: int = 5,
        aggregation: str = "max"
    ) -> List[Dict[str, Any]]:
        """
        Search using multiple queries and aggregate results.
        
        Args:
            queries: List of search queries
            top_k: Number of final results
            aggregation: How to combine scores ('max', 'mean', 'sum')
        
        Returns:
            Aggregated search results
        """
        # Collect results from all queries
        all_results = {}
        
        for query in queries:
            results = self.db.search(query, top_k=top_k * 2)  # Get more to aggregate
            
            for doc, score in results:
                if doc.id not in all_results:
                    all_results[doc.id] = {
                        "doc": doc,
                        "scores": []
                    }
                all_results[doc.id]["scores"].append(score)
        
        # Aggregate scores
        aggregated = []
        for doc_id, data in all_results.items():
            scores = data["scores"]
            
            if aggregation == "max":
                final_score = max(scores)
            elif aggregation == "mean":
                final_score = np.mean(scores)
            elif aggregation == "sum":
                final_score = sum(scores)
            else:
                raise ValueError(f"Unknown aggregation: {aggregation}")
            
            aggregated.append((data["doc"], final_score))
        
        # Sort and return top_k
        aggregated.sort(key=lambda x: x[1], reverse=True)
        aggregated = aggregated[:top_k]
        
        # Format results
        return [
            {
                "id": doc.id,
                "text": doc.text,
                "metadata": doc.metadata,
                "score": score
            }
            for doc, score in aggregated
        ]
    
    def get_similar_documents(
        self,
        doc_id: str,
        top_k: int = 5
    ) -> List[Dict[str, Any]]:
        """
        Find documents similar to a given document.
        
        Args:
            doc_id: Document ID
            top_k: Number of similar documents
        
        Returns:
            Similar documents
        """
        doc = self.db.get_document(doc_id)
        if doc is None:
            return []
        
        # Use document text as query
        results = self.search(doc.text, top_k=top_k + 1)
        
        # Remove the document itself
        results = [r for r in results if r["id"] != doc_id][:top_k]
        
        return results
    
    def get_search_analytics(self) -> Dict[str, Any]:
        """Get analytics about search usage."""
        if not self.search_history:
            return {"message": "No searches yet"}
        
        total_searches = len(self.search_history)
        avg_results = np.mean([s["num_results"] for s in self.search_history])
        
        # Most common queries
        from collections import Counter
        query_counts = Counter(s["query"] for s in self.search_history)
        top_queries = query_counts.most_common(5)
        
        return {
            "total_searches": total_searches,
            "avg_results_per_search": avg_results,
            "top_queries": top_queries,
            "total_documents": len(self.db)
        }

# Create a semantic search engine with sample data
print("\n" + "="*80)
print("SEMANTIC SEARCH ENGINE")
print("="*80)

engine = SemanticSearchEngine()

# Sample documents
tech_articles = [
    {
        "id": "art1",
        "text": "Artificial intelligence is revolutionizing healthcare with diagnostic tools and personalized treatment plans.",
        "category": "healthcare",
        "author": "Dr. Smith"
    },
    {
        "id": "art2",
        "text": "Python has become the dominant language for data science and machine learning applications.",
        "category": "programming",
        "author": "Jane Doe"
    },
    {
        "id": "art3",
        "text": "Cloud computing enables businesses to scale their infrastructure dynamically and cost-effectively.",
        "category": "cloud",
        "author": "John Tech"
    },
    {
        "id": "art4",
        "text": "Deep learning models are achieving remarkable results in natural language understanding tasks.",
        "category": "AI",
        "author": "Dr. Smith"
    },
    {
        "id": "art5",
        "text": "Cybersecurity threats are evolving rapidly, requiring advanced detection and prevention systems.",
        "category": "security",
        "author": "Alice Guard"
    },
    {
        "id": "art6",
        "text": "Quantum computing promises to solve problems that are intractable for classical computers.",
        "category": "quantum",
        "author": "Bob Quantum"
    },
    {
        "id": "art7",
        "text": "Medical imaging powered by AI helps radiologists detect diseases earlier and more accurately.",
        "category": "healthcare",
        "author": "Dr. Smith"
    },
    {
        "id": "art8",
        "text": "JavaScript frameworks like React and Vue make building interactive web applications easier.",
        "category": "programming",
        "author": "Jane Doe"
    }
]

# Index documents
engine.index_documents(tech_articles)

# Single query search
print("\n" + "="*80)
print("Single Query Search")
print("="*80)
query = "AI in medicine"
results = engine.search(query, top_k=3)
print(f"\nQuery: '{query}'")
for i, result in enumerate(results, 1):
    print(f"\n{i}. [{result['id']}] Score: {result['score']:.4f}")
    print(f"   {result['text']}")
    print(f"   Category: {result['metadata']['category']}, Author: {result['metadata']['author']}")

# Multi-query search
print("\n" + "="*80)
print("Multi-Query Search")
print("="*80)
queries = ["artificial intelligence", "machine learning", "neural networks"]
results = engine.multi_query_search(queries, top_k=3)
print(f"\nQueries: {queries}")
for i, result in enumerate(results, 1):
    print(f"\n{i}. [{result['id']}] Aggregated Score: {result['score']:.4f}")
    print(f"   {result['text'][:80]}...")

# Find similar documents
print("\n" + "="*80)
print("Similar Documents")
print("="*80)
doc_id = "art1"
similar = engine.get_similar_documents(doc_id, top_k=2)
print(f"\nDocuments similar to '{doc_id}':")
for i, result in enumerate(similar, 1):
    print(f"\n{i}. [{result['id']}] Score: {result['score']:.4f}")
    print(f"   {result['text'][:80]}...")

# Analytics
print("\n" + "="*80)
print("Search Analytics")
print("="*80)
analytics = engine.get_search_analytics()
for key, value in analytics.items():
    print(f"{key}: {value}")

### Exercise 4.1: Build a Domain-Specific Search Engine

Create a search engine for a specific domain:

In [None]:
# TODO: Build a domain-specific search engine

# 1. TODO: Choose a domain (e.g., medical papers, legal documents, code snippets)

# 2. TODO: Create a dataset (15-20 documents)
# You can generate synthetic data or use real examples

domain_documents = [
    # TODO: Add domain-specific documents
    {
        "id": "doc1",
        "text": "...",
        "metadata": {...}
    }
]

# 3. TODO: Implement domain-specific features
class DomainSearchEngine(SemanticSearchEngine):
    """
    TODO: Add domain-specific features:
    - Custom preprocessing for domain
    - Domain-specific metadata extraction
    - Specialized ranking logic
    - Query expansion using domain knowledge
    """
    pass

# 4. TODO: Test with domain-specific queries
# domain_engine = DomainSearchEngine()
# domain_engine.index_documents(domain_documents)
# results = domain_engine.search("your domain query")

## Part 5: Visualization and Analysis

Visualize embeddings and analyze search quality.

In [None]:
class EmbeddingVisualizer:
    """
    Visualize embeddings and similarity relationships.
    """
    
    @staticmethod
    def plot_embeddings_2d(
        embeddings: np.ndarray,
        labels: List[str],
        title: str = "Embedding Visualization"
    ):
        """
        Plot embeddings in 2D using PCA.
        
        Args:
            embeddings: Embedding matrix (n_samples, n_dimensions)
            labels: Labels for each embedding
            title: Plot title
        """
        # Reduce to 2D using PCA
        pca = PCA(n_components=2)
        embeddings_2d = pca.fit_transform(embeddings)
        
        # Create plot
        plt.figure(figsize=(12, 8))
        scatter = plt.scatter(
            embeddings_2d[:, 0],
            embeddings_2d[:, 1],
            c=range(len(labels)),
            cmap='viridis',
            s=100,
            alpha=0.6
        )
        
        # Add labels
        for i, label in enumerate(labels):
            plt.annotate(
                label[:30] + "..." if len(label) > 30 else label,
                (embeddings_2d[i, 0], embeddings_2d[i, 1]),
                fontsize=8,
                alpha=0.7
            )
        
        plt.title(title)
        plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)")
        plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)")
        plt.colorbar(scatter, label="Document Index")
        plt.tight_layout()
        plt.show()
    
    @staticmethod
    def plot_similarity_matrix(
        embeddings: np.ndarray,
        labels: List[str],
        title: str = "Similarity Matrix"
    ):
        """
        Plot similarity matrix heatmap.
        
        Args:
            embeddings: Embedding matrix
            labels: Labels for each embedding
            title: Plot title
        """
        # Compute similarity matrix
        similarities = cosine_similarity(embeddings)
        
        # Create plot
        plt.figure(figsize=(10, 8))
        sns.heatmap(
            similarities,
            xticklabels=labels,
            yticklabels=labels,
            annot=True,
            fmt=".2f",
            cmap="YlOrRd",
            square=True,
            cbar_kws={"label": "Cosine Similarity"}
        )
        plt.title(title)
        plt.tight_layout()
        plt.show()
    
    @staticmethod
    def plot_query_results(
        query: str,
        results: List[Tuple[str, float]],
        title: str = "Search Results"
    ):
        """
        Visualize search results as bar chart.
        
        Args:
            query: Search query
            results: List of (text, score) tuples
            title: Plot title
        """
        texts = [text[:40] + "..." if len(text) > 40 else text for text, _ in results]
        scores = [score for _, score in results]
        
        plt.figure(figsize=(12, 6))
        bars = plt.barh(range(len(texts)), scores, color='steelblue', alpha=0.7)
        plt.yticks(range(len(texts)), texts)
        plt.xlabel("Similarity Score")
        plt.title(f"{title}\nQuery: '{query}'")
        plt.xlim(0, max(scores) * 1.1)
        
        # Add score labels on bars
        for i, (bar, score) in enumerate(zip(bars, scores)):
            plt.text(score, i, f' {score:.3f}', va='center')
        
        plt.tight_layout()
        plt.show()

# Visualize our tech articles
print("\n" + "="*80)
print("VISUALIZATION")
print("="*80)

# Get embeddings and labels from search engine
embeddings_viz = np.vstack([doc.embedding for doc in engine.db.documents])
labels_viz = [doc.metadata.get('category', 'unknown') + ": " + doc.text[:40] 
              for doc in engine.db.documents]

# Plot embeddings in 2D
visualizer = EmbeddingVisualizer()
visualizer.plot_embeddings_2d(
    embeddings_viz,
    labels_viz,
    title="Tech Articles Embeddings (PCA)"
)

# Plot similarity matrix
visualizer.plot_similarity_matrix(
    embeddings_viz[:6],  # First 6 documents for readability
    [f"art{i+1}" for i in range(6)],
    title="Document Similarity Matrix"
)

# Plot search results
query = "AI and machine learning"
search_results = engine.search(query, top_k=5)
results_viz = [(r['text'], r['score']) for r in search_results]
visualizer.plot_query_results(query, results_viz, title="Search Results Visualization")

### Exercise 5.1: Analyze Embedding Clusters

Analyze how embeddings cluster by topic:

In [None]:
# TODO: Clustering analysis

# 1. TODO: Create a larger dataset with clear categories (50+ documents, 4-5 categories)

# 2. TODO: Use clustering algorithms (K-means, DBSCAN) to group similar documents
# from sklearn.cluster import KMeans, DBSCAN

# 3. TODO: Visualize clusters in 2D

# 4. TODO: Evaluate cluster quality
# - Are documents in the same cluster actually similar?
# - Do clusters align with your intended categories?

# 5. TODO: Try different embedding models and compare
# - text-embedding-3-small
# - text-embedding-3-large
# Do they produce different clusters?

# Your implementation:

## Part 6: Optimizing Embeddings

Techniques for optimizing embedding usage.

In [None]:
class EmbeddingOptimizer:
    """
    Optimize embedding storage and retrieval.
    """
    
    @staticmethod
    def reduce_dimensionality(
        embeddings: np.ndarray,
        target_dim: int = 256,
        method: str = "pca"
    ) -> Tuple[np.ndarray, Any]:
        """
        Reduce embedding dimensionality.
        
        Args:
            embeddings: Original embeddings
            target_dim: Target dimensionality
            method: Reduction method ('pca')
        
        Returns:
            (reduced_embeddings, transformer)
        """
        if method == "pca":
            pca = PCA(n_components=target_dim)
            reduced = pca.fit_transform(embeddings)
            print(f"✓ Reduced from {embeddings.shape[1]} to {target_dim} dimensions")
            print(f"  Explained variance: {pca.explained_variance_ratio_.sum():.1%}")
            return reduced, pca
        else:
            raise ValueError(f"Unknown method: {method}")
    
    @staticmethod
    def quantize_embeddings(
        embeddings: np.ndarray,
        bits: int = 8
    ) -> np.ndarray:
        """
        Quantize embeddings to reduce storage.
        
        Args:
            embeddings: Original embeddings
            bits: Bits per value (8 or 16)
        
        Returns:
            Quantized embeddings
        """
        if bits == 8:
            # Quantize to int8
            min_val = embeddings.min()
            max_val = embeddings.max()
            scale = (max_val - min_val) / 255
            quantized = ((embeddings - min_val) / scale).astype(np.uint8)
            
            # Calculate compression ratio
            original_size = embeddings.nbytes
            quantized_size = quantized.nbytes
            ratio = original_size / quantized_size
            
            print(f"✓ Quantized to {bits} bits")
            print(f"  Original size: {original_size:,} bytes")
            print(f"  Quantized size: {quantized_size:,} bytes")
            print(f"  Compression ratio: {ratio:.1f}x")
            
            return quantized
        else:
            raise ValueError(f"Unsupported bits: {bits}")
    
    @staticmethod
    def cache_embeddings(
        texts: List[str],
        cache_file: str = "embedding_cache.pkl"
    ) -> Dict[str, List[float]]:
        """
        Cache embeddings to avoid regenerating.
        
        Args:
            texts: Texts to embed
            cache_file: Cache file path
        
        Returns:
            Dictionary of text -> embedding
        """
        # Load existing cache
        cache = {}
        if os.path.exists(cache_file):
            with open(cache_file, 'rb') as f:
                cache = pickle.load(f)
            print(f"✓ Loaded cache with {len(cache)} embeddings")
        
        # Generate embeddings for new texts
        new_texts = [t for t in texts if t not in cache]
        if new_texts:
            print(f"Generating {len(new_texts)} new embeddings...")
            for text in new_texts:
                cache[text] = get_embedding(text)
        
        # Save cache
        with open(cache_file, 'wb') as f:
            pickle.dump(cache, f)
        print(f"✓ Cache saved with {len(cache)} embeddings")
        
        return cache
    
    @staticmethod
    def benchmark_search_speed(
        db: SimpleVectorDB,
        query: str,
        num_runs: int = 100
    ):
        """
        Benchmark search speed.
        
        Args:
            db: Vector database
            query: Search query
            num_runs: Number of search runs
        """
        import time
        
        times = []
        for _ in range(num_runs):
            start = time.time()
            db.search(query, top_k=10)
            times.append(time.time() - start)
        
        avg_time = np.mean(times)
        std_time = np.std(times)
        
        print(f"\n✓ Search benchmark ({num_runs} runs):")
        print(f"  Average time: {avg_time*1000:.2f} ms")
        print(f"  Std deviation: {std_time*1000:.2f} ms")
        print(f"  Min time: {min(times)*1000:.2f} ms")
        print(f"  Max time: {max(times)*1000:.2f} ms")

# Test optimization techniques
print("\n" + "="*80)
print("EMBEDDING OPTIMIZATION")
print("="*80)

optimizer = EmbeddingOptimizer()

# Test dimensionality reduction
embeddings_full = np.vstack([doc.embedding for doc in engine.db.documents])
embeddings_reduced, pca_transformer = optimizer.reduce_dimensionality(embeddings_full, target_dim=128)

# Test quantization
embeddings_quantized = optimizer.quantize_embeddings(embeddings_full)

# Benchmark search speed
print("\n" + "="*80)
print("Search Speed Benchmark")
print("="*80)
optimizer.benchmark_search_speed(engine.db, "artificial intelligence", num_runs=50)

### Exercise 6.1: Optimize for Production

Optimize a vector database for production use:

In [None]:
# TODO: Production optimization

class ProductionVectorDB(SimpleVectorDB):
    """
    TODO: Implement production optimizations:
    
    1. Lazy loading
       - Don't load all embeddings into memory at once
       - Load on-demand for large databases
    
    2. Approximate nearest neighbor search
       - Use libraries like FAISS or Annoy for faster search
       - Trade accuracy for speed
    
    3. Sharding
       - Split database into shards for parallel search
       - Useful for very large databases
    
    4. Caching
       - Cache frequent queries
       - Cache recent search results
    
    5. Monitoring
       - Track search latency
       - Monitor memory usage
       - Alert on slow queries
    """
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # TODO: Add production features
    
    # TODO: Implement optimized methods

# Test production features
# prod_db = ProductionVectorDB()
# Test with large dataset and measure improvements

## Challenge Projects

### Challenge 1: Multi-Modal Search

Build a search engine that handles text and other data types:

In [None]:
class MultiModalSearchEngine:
    """
    Search engine supporting multiple data types.
    
    TODO: Implement:
    1. Text + metadata search
    2. Date range filtering
    3. Numeric range filtering (e.g., price, rating)
    4. Geospatial search (location-based)
    5. Combined scoring (semantic + metadata + filters)
    6. Faceted search (category counts, etc.)
    """
    
    def __init__(self):
        """Initialize multi-modal search."""
        pass
    
    def search_with_filters(
        self,
        query: str,
        filters: Dict[str, Any],
        top_k: int = 10
    ) -> List[Dict]:
        """
        TODO: Search with multiple filter types.
        
        Example:
            results = engine.search_with_filters(
                "best restaurants",
                filters={
                    "price_range": (10, 50),
                    "rating_min": 4.0,
                    "distance_km": 5,
                    "categories": ["italian", "pizza"]
                }
            )
        """
        pass

# Usage example:
# multi_engine = MultiModalSearchEngine()
# multi_engine.index_documents(restaurant_data)
# results = multi_engine.search_with_filters("authentic pizza", filters={...})

### Challenge 2: Real-Time Search System

Build a search system that updates in real-time:

In [None]:
class RealTimeSearchSystem:
    """
    Real-time search with incremental updates.
    
    TODO: Implement:
    1. Stream new documents as they arrive
    2. Update index without rebuilding
    3. Handle concurrent reads and writes
    4. Implement write-ahead log for durability
    5. Support rollback on errors
    6. Real-time search analytics dashboard
    """
    
    def __init__(self):
        """Initialize real-time search."""
        pass
    
    def stream_document(self, document: Dict):
        """TODO: Add document to index in real-time."""
        pass
    
    def subscribe_to_updates(self, callback: callable):
        """TODO: Subscribe to index updates."""
        pass

# Usage:
# rt_system = RealTimeSearchSystem()
# rt_system.subscribe_to_updates(lambda doc: print(f"New doc: {doc}"))
# rt_system.stream_document({"text": "Breaking news..."})

### Challenge 3: Distributed Search System

Build a distributed search system for scale:

In [None]:
class DistributedSearchSystem:
    """
    Distributed search across multiple nodes.
    
    TODO: Implement:
    1. Shard data across multiple nodes
    2. Parallel search across shards
    3. Result aggregation from multiple nodes
    4. Load balancing
    5. Fault tolerance (handle node failures)
    6. Replication for redundancy
    7. Consistent hashing for shard assignment
    """
    
    def __init__(self, num_shards: int = 4):
        """Initialize distributed search."""
        self.num_shards = num_shards
        self.shards = []
    
    def add_shard(self, shard_db: SimpleVectorDB):
        """TODO: Add a shard to the system."""
        pass
    
    def search_distributed(
        self,
        query: str,
        top_k: int = 10
    ) -> List[Dict]:
        """
        TODO: Search across all shards in parallel.
        
        Steps:
        1. Send query to all shards
        2. Collect results from each shard
        3. Merge and re-rank
        4. Return top_k global results
        """
        pass

# Usage:
# dist_system = DistributedSearchSystem(num_shards=4)
# for i in range(4):
#     shard = SimpleVectorDB()
#     # Add documents to shard...
#     dist_system.add_shard(shard)
# results = dist_system.search_distributed("query")

## Summary

In this lab, you've learned:

1. ✅ Text embeddings and vector representations
2. ✅ Similarity metrics (cosine, euclidean, dot product)
3. ✅ Building a vector database from scratch
4. ✅ Implementing semantic search
5. ✅ Multi-query and aggregation strategies
6. ✅ Visualizing embeddings and similarities
7. ✅ Optimizing embeddings for production
8. ✅ Advanced search features

### Key Takeaways

**Embeddings:**
- Dense vector representations of text
- Capture semantic meaning
- Similar texts have similar embeddings
- Different models, different dimensions

**Vector Databases:**
- Store and search embeddings efficiently
- Support similarity-based retrieval
- Enable semantic search
- Foundation for RAG systems

**Similarity Metrics:**
- **Cosine similarity**: Angle between vectors (most common)
- **Euclidean distance**: Straight-line distance
- **Dot product**: Considers magnitude and direction
- Choice depends on use case

**Optimization Techniques:**
- Dimensionality reduction (PCA)
- Quantization for storage
- Caching for performance
- Approximate search for scale

### Best Practices

1. **Choose the right metric**: Cosine similarity for most text use cases
2. **Normalize embeddings**: For dot product similarity
3. **Cache embeddings**: Don't regenerate unnecessarily
4. **Monitor quality**: Evaluate search relevance regularly
5. **Consider scale**: Use approximate methods for large datasets
6. **Metadata matters**: Combine semantic + metadata filtering
7. **Test with real queries**: User queries differ from synthetic ones

### Next Steps

- Complete the challenge projects
- Experiment with different embedding models
- Build domain-specific search engines
- Move on to Lab 2: Basic RAG Implementation
- Explore vector database libraries (FAISS, Pinecone, Weaviate)

**Provided by:** ADC ENGINEERING & CONSULTING LTD