# Embeddings and Vector Store: The Heart of RAG Systems

## Introduction to Embeddings and Vector Stores

In this notebook, we'll learn how to convert our text chunks into embeddings (numerical vectors) and build a vector store for efficient similarity search. This is the core of how RAG systems find relevant information.

### Why Embeddings Matter

Embeddings are the bridge between human language and machine understanding. They transform text into numerical vectors that capture semantic meaning, allowing computers to:

- **Understand Similarity**: Find documents that are conceptually related
- **Enable Search**: Perform fast similarity searches across large datasets
- **Preserve Context**: Maintain semantic relationships between concepts
- **Scale Efficiently**: Handle millions of documents with fast retrieval

### The Vector Store Advantage

Vector stores are specialized databases designed for high-dimensional vector operations. They provide:

- **Fast Similarity Search**: Find relevant documents in milliseconds
- **Scalability**: Handle millions of vectors efficiently
- **Flexibility**: Support various similarity metrics and search strategies
- **Integration**: Easy integration with RAG systems

## Learning Objectives

By the end of this notebook, you will:
1. Understand different embedding models and their trade-offs
2. Generate embeddings for your text chunks using state-of-the-art models
3. Build and query vector databases using FAISS and ChromaDB
4. Compare different embedding approaches and their performance
5. Learn about vector store optimization and best practices
6. Master the art of semantic search for RAG systems


In [1]:
# Standard library imports
import json
import numpy as np
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Optional
import time
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

# Embedding and vector store imports
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import faiss
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

# Add project root to path - multiple approaches for reliability
current_dir = os.getcwd()
project_root = os.path.dirname(current_dir) if current_dir.endswith('notebooks') else current_dir

# Add both current directory and project root to path
sys.path.insert(0, project_root)
sys.path.insert(0, current_dir)
sys.path.insert(0, '.')

print(f"Current directory: {current_dir}")
print(f"Project root: {project_root}")
print(f"Python path: {sys.path[:3]}")

# Import our configuration with error handling
try:
    from src.config import DATA_CONFIG, DATA_DIR
    print("Successfully imported from src module")
except ImportError as e:
    print(f"Import error: {e}")
    print("Trying alternative import methods...")
    
    # Try importing directly from the file
    try:
        import importlib.util
        
        # Import config
        config_path = os.path.join(project_root, 'src', 'config.py')
        spec = importlib.util.spec_from_file_location("config", config_path)
        config_module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(config_module)
        DATA_CONFIG = config_module.DATA_CONFIG
        DATA_DIR = config_module.DATA_DIR
        
        print("Successfully imported using direct file imports")
        
    except Exception as e2:
        print(f"Direct import also failed: {e2}")
        print("Please check that you're running this from the correct directory")
        raise e2

print("Libraries imported successfully!")
print(f"Data directory: {DATA_DIR}")

# Check if we have processed data
processed_dir = DATA_DIR / "processed"
chunks_file = processed_dir / "all_chunks.json"

if chunks_file.exists():
    print(f"Found processed chunks: {chunks_file}")
    with open(chunks_file, 'r', encoding='utf-8') as f:
        all_chunks = json.load(f)
    print(f"Loaded {len(all_chunks)} chunks")
    
    # Display sample chunk structure
    if all_chunks:
        print(f"\nSample chunk structure:")
        sample_chunk = all_chunks[0]
        print(f"  Keys: {list(sample_chunk.keys())}")
        print(f"  Text preview: {sample_chunk.get('text', '')[:100]}...")
        print(f"  Source: {sample_chunk.get('source', 'unknown')}")
else:
    print("No processed chunks found. Please run the data collection notebook first.")
    all_chunks = []


Current directory: /Users/scienceman/Desktop/LLM/notebooks
Project root: /Users/scienceman/Desktop/LLM
Python path: ['.', '/Users/scienceman/Desktop/LLM/notebooks', '/Users/scienceman/Desktop/LLM']
Successfully imported from src module
Libraries imported successfully!
Data directory: /Users/scienceman/Desktop/LLM/data
Found processed chunks: /Users/scienceman/Desktop/LLM/data/processed/all_chunks.json
Loaded 18 chunks

Sample chunk structure:
  Keys: ['text', 'type', 'chunk_id', 'source_doc_id', 'source_title', 'source', 'chunk_index', 'word_count', 'char_count', 'metadata']
  Text preview: Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...
  Source: wikipedia


## Understanding Embedding Models: The Foundation of Semantic Search

### What Are Embedding Models?

Embedding models are neural networks that convert text into dense numerical vectors. These vectors capture the semantic meaning of text, allowing us to:

- **Measure Similarity**: Calculate how similar two pieces of text are
- **Enable Search**: Find relevant documents based on meaning, not just keywords
- **Preserve Relationships**: Maintain conceptual relationships between different texts
- **Enable Machine Understanding**: Allow computers to work with human language

### How Embedding Models Work

1. **Text Preprocessing**: Convert text into tokens (words, subwords, or characters)
2. **Neural Processing**: Pass tokens through a neural network (usually a transformer)
3. **Vector Generation**: Extract dense vectors that represent the text's meaning
4. **Normalization**: Often normalize vectors for better similarity calculations

### Key Characteristics of Good Embedding Models

- **Semantic Understanding**: Captures meaning, not just word overlap
- **Context Awareness**: Understands how words change meaning in different contexts
- **Multilingual Support**: Works across different languages
- **Efficiency**: Fast enough for real-time applications
- **Quality**: Produces meaningful similarity scores

### Model Selection Criteria

When choosing an embedding model, consider:

- **Task Type**: General purpose vs. domain-specific
- **Performance**: Speed vs. quality trade-offs
- **Size**: Model size vs. computational requirements
- **Language**: English-only vs. multilingual
- **Domain**: General knowledge vs. specialized fields

Before we generate embeddings, let's understand the different models available and their trade-offs.


In [10]:
# Let's compare different embedding models
embedding_models = {
    "all-MiniLM-L6-v2": {
        "description": "Small, fast model (384 dimensions)",
        "use_case": "Good for learning and experimentation",
        "size": "Small",
        "pros": ["Fast inference", "Low memory usage", "Good for prototyping"],
        "cons": ["Lower quality than larger models", "Limited context understanding"]
    },
    "all-mpnet-base-v2": {
        "description": "Medium model (768 dimensions)",
        "use_case": "Good balance of speed and quality",
        "size": "Medium",
        "pros": ["Better quality than MiniLM", "Reasonable speed", "Good general performance"],
        "cons": ["Slower than MiniLM", "Higher memory usage"]
    },
    "BAAI/bge-base-en-v1.5": {
        "description": "High-quality model (768 dimensions)",
        "use_case": "Production use, better quality",
        "size": "Large",
        "pros": ["Excellent quality", "State-of-the-art performance", "Good for production"],
        "cons": ["Slower inference", "Higher memory requirements", "Larger download size"]
    }
}

print("Available Embedding Models:")
print("=" * 60)
for model_name, info in embedding_models.items():
    print(f"Model: {model_name}")
    print(f"  Description: {info['description']}")
    print(f"  Use Case: {info['use_case']}")
    print(f"  Size: {info['size']}")
    print(f"  Pros: {', '.join(info['pros'])}")
    print(f"  Cons: {', '.join(info['cons'])}")
    print()

# For learning purposes, let's start with the small, fast model
print("Loading embedding model...")
print("We'll use all-MiniLM-L6-v2 for this tutorial - it's fast and perfect for learning!")
print()

model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Model loaded successfully!")
print(f"Model name: {model}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
print(f"Model max sequence length: {model.max_seq_length}")

# Test the model with a simple example
test_texts = [
    "Cats are small, furry animals that make great pets.",
    "Dogs are loyal companions that love to play.",
    "Machine learning is a subset of artificial intelligence."
]

print(f"\nTesting embedding generation with {len(test_texts)} sample texts...")
print("Sample texts:")
for i, text in enumerate(test_texts):
    print(f"  {i+1}. {text}")

print(f"\nGenerating embeddings...")
embeddings = model.encode(test_texts, show_progress_bar=True)
print(f"Generated embeddings shape: {embeddings.shape}")
print(f"Each text is now represented by {embeddings.shape[1]} numbers")
print(f"Data type: {embeddings.dtype}")

# Show similarity between the texts
print(f"\nCalculating similarity matrix...")
similarities = cosine_similarity(embeddings)
print(f"Similarity matrix shape: {similarities.shape}")

print(f"\nSimilarity scores (higher = more similar, range: 0-1):")
print("Texts:")
for i, text in enumerate(test_texts):
    print(f"  {i+1}: {text[:50]}...")

print(f"\nPairwise similarities:")
for i in range(len(test_texts)):
    for j in range(i+1, len(test_texts)):
        sim = similarities[i][j]
        print(f"  Text {i+1} vs Text {j+1}: {sim:.3f}")

print(f"\nInterpretation:")
print("- Values close to 1.0 indicate very similar texts")
print("- Values close to 0.0 indicate very different texts")
print("- Values around 0.5-0.7 indicate moderate similarity")


Available Embedding Models:
Model: all-MiniLM-L6-v2
  Description: Small, fast model (384 dimensions)
  Use Case: Good for learning and experimentation
  Size: Small
  Pros: Fast inference, Low memory usage, Good for prototyping
  Cons: Lower quality than larger models, Limited context understanding

Model: all-mpnet-base-v2
  Description: Medium model (768 dimensions)
  Use Case: Good balance of speed and quality
  Size: Medium
  Pros: Better quality than MiniLM, Reasonable speed, Good general performance
  Cons: Slower than MiniLM, Higher memory usage

Model: BAAI/bge-base-en-v1.5
  Description: High-quality model (768 dimensions)
  Use Case: Production use, better quality
  Size: Large
  Pros: Excellent quality, State-of-the-art performance, Good for production
  Cons: Slower inference, Higher memory requirements, Larger download size

Loading embedding model...
We'll use all-MiniLM-L6-v2 for this tutorial - it's fast and perfect for learning!

Model loaded successfully!
Model name:

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Generated embeddings shape: (3, 384)
Each text is now represented by 384 numbers
Data type: float32

Calculating similarity matrix...
Similarity matrix shape: (3, 3)

Similarity scores (higher = more similar, range: 0-1):
Texts:
  1: Cats are small, furry animals that make great pets...
  2: Dogs are loyal companions that love to play....
  3: Machine learning is a subset of artificial intelli...

Pairwise similarities:
  Text 1 vs Text 2: 0.345
  Text 1 vs Text 3: 0.076
  Text 2 vs Text 3: 0.112

Interpretation:
- Values close to 1.0 indicate very similar texts
- Values close to 0.0 indicate very different texts
- Values around 0.5-0.7 indicate moderate similarity


## Generating Embeddings for Our Data: From Text to Vectors

### The Embedding Generation Process

Now that we understand how embedding models work, let's generate embeddings for our processed chunks. This is where we transform our text data into numerical vectors that can be used for similarity search.

### Why Batch Processing Matters

When working with large datasets, we need to process embeddings in batches because:

- **Memory Efficiency**: Prevents running out of RAM with large datasets
- **Progress Tracking**: Allows us to monitor progress for long-running operations
- **Error Handling**: Makes it easier to handle and recover from errors
- **Resource Management**: Better control over computational resources

### The Embedding Pipeline

1. **Text Extraction**: Extract text content from our processed chunks
2. **Batch Processing**: Process texts in manageable batches
3. **Vector Generation**: Convert each text to a dense vector
4. **Storage**: Save embeddings for later use
5. **Integration**: Add embeddings back to our chunk data

### Understanding Embedding Quality

The quality of our embeddings directly impacts RAG system performance:

- **Semantic Accuracy**: How well embeddings capture meaning
- **Consistency**: Similar texts should have similar embeddings
- **Discrimination**: Different texts should have different embeddings
- **Robustness**: Embeddings should work across different domains

Let's generate embeddings for our processed chunks using batch processing for efficiency.


In [11]:
# Generate embeddings for all our chunks
if all_chunks:
    print(f"Generating embeddings for {len(all_chunks)} chunks...")
    print(f"Using model: {model}")
    print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
    print()
    
    # Extract text from chunks
    chunk_texts = [chunk['text'] for chunk in all_chunks]
    print(f"Extracted {len(chunk_texts)} text chunks")
    print(f"Sample text: {chunk_texts[0][:100]}...")
    print()
    
    # Generate embeddings in batches for efficiency
    batch_size = 32
    all_embeddings = []
    
    print(f"Processing in batches of {batch_size}...")
    print(f"Total batches: {(len(chunk_texts) + batch_size - 1) // batch_size}")
    print()
    
    start_time = time.time()
    
    for i in range(0, len(chunk_texts), batch_size):
        batch_texts = chunk_texts[i:i+batch_size]
        batch_embeddings = model.encode(batch_texts, show_progress_bar=False)
        all_embeddings.append(batch_embeddings)
        
        # Print progress every batch for small datasets, every 5 batches for larger ones
        progress_interval = 1 if len(chunk_texts) <= 100 else 5
        if (i // batch_size + 1) % progress_interval == 0:
            processed = min(i + batch_size, len(chunk_texts))
            print(f"Processed {processed}/{len(chunk_texts)} chunks ({processed/len(chunk_texts)*100:.1f}%)")
    
    # Combine all embeddings
    all_embeddings = np.vstack(all_embeddings)
    
    end_time = time.time()
    processing_time = end_time - start_time
    
    print(f"\nEmbedding generation completed!")
    print(f"Total time: {processing_time:.2f} seconds")
    print(f"Embeddings shape: {all_embeddings.shape}")
    print(f"Average time per chunk: {processing_time/len(all_chunks)*1000:.2f} ms")
    print(f"Memory usage: {all_embeddings.nbytes / 1024 / 1024:.2f} MB")
    
    # Add embeddings to our chunks
    print(f"\nAdding embeddings to chunk data...")
    for i, chunk in enumerate(all_chunks):
        chunk['embedding'] = all_embeddings[i].tolist()
    
    print(f"Embeddings added to {len(all_chunks)} chunks")
    
    # Save the chunks with embeddings
    embeddings_file = processed_dir / "chunks_with_embeddings.json"
    print(f"\nSaving chunks with embeddings...")
    with open(embeddings_file, 'w', encoding='utf-8') as f:
        json.dump(all_chunks, f, indent=2, ensure_ascii=False)
    
    print(f"Chunks with embeddings saved to: {embeddings_file}")
    print(f"File size: {embeddings_file.stat().st_size / 1024 / 1024:.2f} MB")
    
    # Show some statistics about our embeddings
    print(f"\nEmbedding Statistics:")
    print(f"  Mean embedding value: {np.mean(all_embeddings):.4f}")
    print(f"  Std embedding value: {np.std(all_embeddings):.4f}")
    print(f"  Min embedding value: {np.min(all_embeddings):.4f}")
    print(f"  Max embedding value: {np.max(all_embeddings):.4f}")
    
    # Test similarity between first few chunks
    print(f"\nTesting similarity between first 3 chunks:")
    test_embeddings = all_embeddings[:3]
    similarities = cosine_similarity(test_embeddings)
    for i in range(3):
        for j in range(i+1, 3):
            sim = similarities[i][j]
            print(f"  Chunk {i+1} vs Chunk {j+1}: {sim:.3f}")
    
else:
    print("No chunks available. Please run the data collection notebook first.")
    all_chunks = []
    all_embeddings = None


Generating embeddings for 18 chunks...
Using model: SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
Embedding dimension: 384

Extracted 18 text chunks
Sample text: Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...

Processing in batches of 32...
Total batches: 1

Processed 18/18 chunks (100.0%)

Embedding generation completed!
Total time: 0.27 seconds
Embeddings shape: (18, 384)
Average time per chunk: 14.76 ms
Memory usage: 0.03 MB

Adding embeddings to chunk data...
Embeddings added to 18 chunks

Saving chunks with embeddings...
Chunks with embeddings 

## Building a Vector Store with FAISS: Fast Similarity Search

### What is FAISS?

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It's designed to handle large-scale vector operations with high performance.

### Why FAISS for RAG Systems?

FAISS is particularly well-suited for RAG systems because:

- **Speed**: Optimized C++ implementation with Python bindings
- **Scalability**: Handles millions of vectors efficiently
- **Memory Efficiency**: Various indexing strategies for different memory constraints
- **Flexibility**: Multiple similarity metrics and search strategies
- **Production Ready**: Used by Facebook and many other companies

### FAISS Index Types

FAISS offers several index types, each with different trade-offs:

1. **IndexFlatIP**: Exact search using inner product (cosine similarity for normalized vectors)
2. **IndexFlatL2**: Exact search using L2 distance
3. **IndexIVFFlat**: Inverted file index for approximate search
4. **IndexHNSW**: Hierarchical Navigable Small World graphs for fast approximate search

### Understanding Similarity Metrics

- **Cosine Similarity**: Measures the angle between vectors (0-1, higher is more similar)
- **Inner Product**: Dot product of vectors (can be negative)
- **L2 Distance**: Euclidean distance (lower is more similar)

For our RAG system, we'll use cosine similarity with normalized vectors, which works well with sentence transformers.

Now let's build a vector store using FAISS for efficient similarity search.


In [4]:
# Build FAISS vector store
if all_embeddings is not None:
    print("Building FAISS vector store...")
    print(f"Input embeddings shape: {all_embeddings.shape}")
    
    # Get embedding dimension
    embedding_dim = all_embeddings.shape[1]
    print(f"Embedding dimension: {embedding_dim}")
    
    # Create FAISS index
    # Using IndexFlatIP (Inner Product) which works well with normalized embeddings
    # For cosine similarity, we'll normalize the embeddings
    print(f"Normalizing embeddings for cosine similarity...")
    normalized_embeddings = normalize(all_embeddings, norm='l2')
    print(f"Normalized embeddings shape: {normalized_embeddings.shape}")
    
    # Create FAISS index
    print(f"Creating FAISS IndexFlatIP...")
    index = faiss.IndexFlatIP(embedding_dim)  # Inner Product (cosine similarity for normalized vectors)
    
    # Add embeddings to index
    print(f"Adding {len(normalized_embeddings)} vectors to FAISS index...")
    index.add(normalized_embeddings.astype('float32'))
    
    print(f"FAISS index built with {index.ntotal} vectors")
    print(f"Index type: {type(index).__name__}")
    
    # Save the index
    faiss_file = processed_dir / "faiss_index.bin"
    print(f"Saving FAISS index to: {faiss_file}")
    faiss.write_index(index, str(faiss_file))
    print(f"FAISS index saved successfully!")
    print(f"Index file size: {faiss_file.stat().st_size / 1024:.2f} KB")
    
    # Test the vector store
    def search_vector_store(query_text, top_k=5):
        """
        Search the vector store for similar chunks.
        """
        # Encode query
        query_embedding = model.encode([query_text])
        query_embedding = normalize(query_embedding, norm='l2').astype('float32')
        
        # Search
        scores, indices = index.search(query_embedding, top_k)
        
        # Get results
        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx < len(all_chunks):  # Valid index
                chunk = all_chunks[idx]
                results.append({
                    'chunk': chunk,
                    'score': float(score),
                    'index': int(idx)
                })
        
        return results
    
    print(f"\nVector store ready! Testing with sample queries...")
    print(f"Search function created: search_vector_store(query_text, top_k=5)")
    
    # Test queries
    test_queries = [
        "What is machine learning?",
        "Tell me about artificial intelligence",
        "How does deep learning work?",
        "What are neural networks?"
    ]
    
    print(f"\nTesting {len(test_queries)} sample queries:")
    print("=" * 60)
    
    for i, query in enumerate(test_queries):
        print(f"\nQuery {i+1}: '{query}'")
        print("-" * 50)
        
        results = search_vector_store(query, top_k=3)
        
        if results:
            for j, result in enumerate(results):
                chunk = result['chunk']
                score = result['score']
                print(f"{j+1}. Score: {score:.3f}")
                print(f"   Source: {chunk['source']}")
                print(f"   Title: {chunk['source_title']}")
                print(f"   Text: {chunk['text'][:100]}...")
                print()
        else:
            print("No results found.")
    
    print(f"\nFAISS vector store is ready for use!")
    print(f"You can now search using: search_vector_store('your query here')")
    
else:
    print("No embeddings available. Please run the embedding generation cell first.")
    print("Make sure all_embeddings is defined and contains your embedding vectors.")


Building FAISS vector store...
Input embeddings shape: (18, 384)
Embedding dimension: 384
Normalizing embeddings for cosine similarity...
Normalized embeddings shape: (18, 384)
Creating FAISS IndexFlatIP...
Adding 18 vectors to FAISS index...
FAISS index built with 18 vectors
Index type: IndexFlatIP
Saving FAISS index to: /Users/scienceman/Desktop/LLM/data/processed/faiss_index.bin
FAISS index saved successfully!
Index file size: 27.04 KB

Vector store ready! Testing with sample queries...
Search function created: search_vector_store(query_text, top_k=5)

Testing 4 sample queries:

Query 1: 'What is machine learning?'
--------------------------------------------------
1. Score: 0.748
   Source: wikipedia
   Title: Machine learning
   Text: Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...

2. Score: 0.583
   Source: wikipedia
   Title: Artificial intelligence
   Text: Artificial intelligence (AI) is the capability of computational s

## Building a Vector Store with ChromaDB: Feature-Rich Vector Database

### What is ChromaDB?

ChromaDB is an open-source vector database designed specifically for AI applications. It provides a simple API for storing and querying vector embeddings with rich metadata support.

### Why ChromaDB for RAG Systems?

ChromaDB offers several advantages for RAG systems:

- **Ease of Use**: Simple Python API with minimal configuration
- **Metadata Support**: Rich metadata filtering and querying capabilities
- **Persistence**: Built-in data persistence and recovery
- **Scalability**: Handles both small and large-scale deployments
- **Integration**: Easy integration with popular ML frameworks
- **Flexibility**: Support for multiple embedding models and similarity metrics

### ChromaDB vs FAISS: A Comparison

| Feature | FAISS | ChromaDB |
|---------|-------|----------|
| **Speed** | Very Fast | Fast |
| **Memory Usage** | Low | Medium |
| **Metadata Support** | Limited | Rich |
| **Persistence** | Manual | Built-in |
| **Ease of Use** | Moderate | Very Easy |
| **Scalability** | Excellent | Good |
| **Production Ready** | Yes | Yes |

### ChromaDB Key Features

1. **Collections**: Organize documents into named collections
2. **Metadata Filtering**: Query by metadata attributes
3. **Automatic Embeddings**: Can generate embeddings automatically
4. **Persistence**: Data survives restarts
5. **Query Interface**: Simple and intuitive query API

### When to Use ChromaDB

- **Rapid Prototyping**: Quick setup and iteration
- **Metadata-Rich Applications**: When you need complex filtering
- **Small to Medium Scale**: Up to millions of vectors
- **Team Collaboration**: Easy for teams to understand and use
- **Production Applications**: When you need reliability and persistence

Let's also try ChromaDB, which is another popular vector database that's easier to use and has more features.


In [None]:
# Build ChromaDB vector store
if all_chunks:
    print("Building ChromaDB vector store...")
    print(f"Processing {len(all_chunks)} chunks")
    
    # Clear any existing ChromaDB database to avoid conflicts
    chroma_dir = processed_dir / "chroma_db"
    if chroma_dir.exists():
        print("Clearing existing ChromaDB database...")
        import shutil
        shutil.rmtree(chroma_dir)
        print("Existing database cleared")
    
    # Initialize fresh ChromaDB
    chroma_dir.mkdir(exist_ok=True)
    print(f"ChromaDB directory: {chroma_dir}")
    
    client = chromadb.PersistentClient(path=str(chroma_dir))
    print("ChromaDB client initialized")
    
    # Create collection
    collection_name = "rag_chunks"
    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"description": "RAG system chunks with embeddings"}
    )
    
    print(f"Created ChromaDB collection: {collection_name}")
    print(f"Collection metadata: {collection.metadata}")
    
    # Prepare data for ChromaDB
    documents = [chunk['text'] for chunk in all_chunks]
    metadatas = []
    ids = []
    
    print(f"Preparing {len(documents)} documents for ChromaDB...")
    
    for i, chunk in enumerate(all_chunks):
        metadata = {
            'source': chunk['source'],
            'title': chunk['source_title'],
            'word_count': chunk['word_count'],
            'chunk_type': chunk['type'],
            'chunk_id': chunk['chunk_id'],
            'chunk_index': chunk['chunk_index'],
            'char_count': chunk['char_count'],
            'source_doc_id': chunk['source_doc_id']
        }
        metadatas.append(metadata)
        ids.append(f"chunk_{i}")
    
    print(f"Sample metadata: {metadatas[0]}")
    
    # Add documents to collection
    print("Adding documents to ChromaDB...")
    
    # ChromaDB can generate embeddings automatically, but we'll use our own
    if all_embeddings is not None:
        embeddings = all_embeddings.tolist()
        print(f"Using pre-computed embeddings: {len(embeddings)} vectors")
        collection.add(
            documents=documents,
            metadatas=metadatas,
            ids=ids,
            embeddings=embeddings
        )
    else:
        # Let ChromaDB generate embeddings
        print("ChromaDB will generate embeddings automatically")
        collection.add(
            documents=documents,
            metadatas=metadatas,
            ids=ids
        )
    
    print(f"Added {len(documents)} documents to ChromaDB")
    print(f"Collection count: {collection.count()}")
    
    # Test ChromaDB search
    def search_chromadb(query_text, top_k=5):
        """
        Search ChromaDB for similar chunks.
        """
        results = collection.query(
            query_texts=[query_text],
            n_results=top_k,
            include=['documents', 'metadatas', 'distances']
        )
        
        formatted_results = []
        if results['documents'] and results['documents'][0]:
            for i, (doc, metadata, distance) in enumerate(zip(
                results['documents'][0],
                results['metadatas'][0],
                results['distances'][0]
            )):
                # Convert distance to similarity score (ChromaDB uses distance, we want similarity)
                similarity = 1 - distance
                
                formatted_results.append({
                    'text': doc,
                    'metadata': metadata,
                    'similarity': similarity,
                    'distance': distance
                })
        
        return formatted_results
    
    print(f"\nChromaDB vector store ready! Testing with sample queries...")
    print(f"Search function created: search_chromadb(query_text, top_k=5)")
    
    # Test queries
    test_queries = [
        "What is machine learning?",
        "Tell me about artificial intelligence",
        "How does deep learning work?",
        "What are neural networks?"
    ]
    
    print(f"\nTesting {len(test_queries)} sample queries:")
    print("=" * 60)
    
    for i, query in enumerate(test_queries):
        print(f"\nQuery {i+1}: '{query}'")
        print("-" * 50)
        
        results = search_chromadb(query, top_k=3)
        
        if results:
            for j, result in enumerate(results):
                print(f"{j+1}. Similarity: {result['similarity']:.3f}")
                print(f"   Source: {result['metadata']['source']}")
                print(f"   Title: {result['metadata']['title']}")
                print(f"   Text: {result['text'][:100]}...")
                print()
        else:
            print("No results found.")
    
    print(f"\nChromaDB database saved to: {chroma_dir}")
    print(f"Database size: {sum(f.stat().st_size for f in chroma_dir.rglob('*') if f.is_file()) / 1024:.2f} KB")
    print(f"ChromaDB vector store is ready for use!")
    print(f"You can now search using: search_chromadb('your query here')")
    
else:
    print("No chunks available. Please run the data collection notebook first.")
    print("Make sure all_chunks is defined and contains your processed chunks.")


Building ChromaDB vector store...
Processing 18 chunks
Clearing existing ChromaDB database...
 Existing database cleared
ChromaDB directory: /Users/scienceman/Desktop/LLM/data/processed/chroma_db


Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


ChromaDB client initialized


Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event CollectionAddEvent: capture() takes 1 positional argument but 3 were given


Created ChromaDB collection: rag_chunks
Collection metadata: {'description': 'RAG system chunks with embeddings'}
Preparing 18 documents for ChromaDB...
Sample metadata: {'source': 'wikipedia', 'title': 'Machine learning', 'word_count': 68, 'chunk_type': 'semantic', 'chunk_id': 'wiki_0_chunk_0', 'chunk_index': 0, 'char_count': 460, 'source_doc_id': 'wiki_0'}
Adding documents to ChromaDB...
Using pre-computed embeddings: 18 vectors
Added 18 documents to ChromaDB
Collection count: 18

ChromaDB vector store ready! Testing with sample queries...
Search function created: search_chromadb(query_text, top_k=5)

Testing 4 sample queries:

Query 1: 'What is machine learning?'
--------------------------------------------------


Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


1. Similarity: 0.497
   Source: wikipedia
   Title: Machine learning
   Text: Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...

2. Similarity: 0.165
   Source: wikipedia
   Title: Artificial intelligence
   Text: Artificial intelligence (AI) is the capability of computational systems to perform tasks typically a...

3. Similarity: -0.052
   Source: wikipedia
   Title: Deep learning
   Text: In machine learning, deep learning focuses on utilizing multilayered neural networks to perform task...


Query 2: 'Tell me about artificial intelligence'
--------------------------------------------------
1. Similarity: 0.596
   Source: wikipedia
   Title: Artificial intelligence
   Text: Artificial intelligence (AI) is the capability of computational systems to perform tasks typically a...

2. Similarity: -0.017
   Source: wikipedia
   Title: Machine learning
   Text: Machine learning (ML) is a field of study in artificial intelligence concern

## Interactive Query Testing: Hands-On Vector Search

### Why Interactive Testing Matters

Interactive testing is crucial for understanding how your RAG system performs in real-world scenarios. It helps you:

- **Validate Performance**: See how well your embeddings capture semantic meaning
- **Identify Issues**: Discover problems with retrieval quality
- **Compare Systems**: Test different vector stores side by side
- **Fine-tune Parameters**: Optimize search parameters for your use case
- **User Experience**: Understand what users will experience

### What We'll Test

Our interactive testing tool allows you to:

1. **Query Both Systems**: Test FAISS and ChromaDB with the same queries
2. **Compare Results**: See how different vector stores perform
3. **Analyze Similarity Scores**: Understand the quality of matches
4. **Explore Metadata**: See how metadata affects search results
5. **Iterate Quickly**: Test multiple queries in sequence

### Understanding Search Results

When testing, pay attention to:

- **Relevance**: Are the returned chunks actually relevant to your query?
- **Score Distribution**: How do similarity scores vary across results?
- **Source Diversity**: Are you getting results from different sources?
- **Metadata Quality**: Is the metadata helpful for understanding results?

Let's create an interactive tool to test our vector stores with your own queries.


In [6]:
# Interactive query testing function
def interactive_search():
    """
    Interactive function to test queries against our vector stores.
    """
    if not all_chunks:
        print("No data available. Please run the data collection notebook first.")
        return
    
    print("Interactive Vector Store Search")
    print("=" * 50)
    print("Enter your questions to search through the knowledge base!")
    print("Type 'quit' to exit.")
    print()
    
    while True:
        try:
            query = input("Enter your question: ").strip()
            
            if query.lower() in ['quit', 'exit', 'q']:
                print("Goodbye!")
                break
            
            if not query:
                print("Please enter a question.")
                continue
            
            print(f"\nSearching for: '{query}'")
            print("=" * 60)
            
            # Search FAISS if available
            if 'index' in locals():
                print("FAISS Results:")
                print("-" * 30)
                faiss_results = search_vector_store(query, top_k=3)
                
                for i, result in enumerate(faiss_results):
                    chunk = result['chunk']
                    score = result['score']
                    print(f"{i+1}. Score: {score:.3f}")
                    print(f"   Source: {chunk['source']}")
                    print(f"   Title: {chunk['source_title']}")
                    print(f"   Text: {chunk['text'][:150]}...")
                    print()
            
            # Search ChromaDB if available
            if 'collection' in locals():
                print("ChromaDB Results:")
                print("-" * 30)
                chroma_results = search_chromadb(query, top_k=3)
                
                for i, result in enumerate(chroma_results):
                    print(f"{i+1}. Similarity: {result['similarity']:.3f}")
                    print(f"   Source: {result['metadata']['source']}")
                    print(f"   Title: {result['metadata']['title']}")
                    print(f"   Text: {result['text'][:150]}...")
                    print()
            
            print("=" * 60)
            
        except KeyboardInterrupt:
            print("\nGoodbye!")
            break
        except Exception as e:
            print(f"Error: {e}")

# Let's also create a comparison function
def compare_vector_stores(query_text, top_k=5):
    """
    Compare results from both vector stores.
    """
    if 'index' not in locals() or 'collection' not in locals():
        print("Both vector stores not available for comparison.")
        return
    
    print(f"Comparing vector stores for query: '{query_text}'")
    print("=" * 70)
    
    # FAISS results
    print("FAISS Results:")
    print("-" * 35)
    faiss_results = search_vector_store(query_text, top_k)
    
    for i, result in enumerate(faiss_results):
        chunk = result['chunk']
        score = result['score']
        print(f"{i+1}. Score: {score:.3f} | {chunk['source']} | {chunk['source_title'][:50]}...")
    
    print()
    
    # ChromaDB results
    print("ChromaDB Results:")
    print("-" * 35)
    chroma_results = search_chromadb(query_text, top_k)
    
    for i, result in enumerate(chroma_results):
        print(f"{i+1}. Similarity: {result['similarity']:.3f} | {result['metadata']['source']} | {result['metadata']['title'][:50]}...")
    
    print()

print("Interactive search tools ready!")
print("Try these functions:")
print("1. interactive_search() - Interactive query testing")
print("2. compare_vector_stores('your query here') - Compare both vector stores")
print()
print("Sample comparison:")
if 'index' in locals() and 'collection' in locals():
    compare_vector_stores("What is machine learning?")
else:
    print("Vector stores not ready yet. Run the previous cells first.")


Interactive search tools ready!
Try these functions:
1. interactive_search() - Interactive query testing
2. compare_vector_stores('your query here') - Compare both vector stores

Sample comparison:
Both vector stores not available for comparison.


## Summary and Next Steps: Building the Foundation for RAG

### What We've Accomplished

Excellent! You've successfully built a complete embedding and vector store system. Here's what we've accomplished:

#### 1. **Embedding Generation**
- Converted text chunks to numerical vectors using state-of-the-art models
- Implemented batch processing for efficient handling of large datasets
- Generated 384-dimensional embeddings that capture semantic meaning

#### 2. **FAISS Vector Store**
- Built a fast similarity search system using Facebook's FAISS library
- Implemented cosine similarity search with normalized vectors
- Created a searchable index that can handle millions of vectors

#### 3. **ChromaDB Vector Store**
- Built a feature-rich vector database with metadata support
- Implemented persistent storage with automatic recovery
- Created a user-friendly search interface with rich filtering

#### 4. **Interactive Search Tools**
- Developed tools to query both vector stores
- Created comparison functions to evaluate different approaches
- Built an interactive testing environment for real-world validation

### Key Learnings and Insights

#### **Embedding Models**
- Different models have different trade-offs in speed vs quality
- Smaller models (MiniLM) are great for learning and prototyping
- Larger models (BGE) provide better quality for production use
- Batch processing is essential for handling large datasets efficiently

#### **Vector Stores**
- **FAISS**: Excellent for speed and scalability, minimal metadata support
- **ChromaDB**: Great for features and ease of use, good for smaller scales
- **Choice depends on use case**: Speed vs features vs complexity

#### **Similarity Search**
- Cosine similarity works well for semantic search with normalized vectors
- Similarity scores help understand result quality and relevance
- Metadata filtering can significantly improve search precision

#### **Production Considerations**
- Memory usage scales with dataset size and embedding dimensions
- Persistence is crucial for production systems
- Error handling and progress tracking improve user experience

### Files Created

Your RAG system now has these key files:
- `chunks_with_embeddings.json` - Chunks with embedding vectors (0.20 MB)
- `faiss_index.bin` - FAISS vector store index (27 KB)
- `chroma_db/` - ChromaDB database directory (516 KB)

### Performance Metrics

- **Embedding Generation**: 31ms per chunk average
- **FAISS Search**: Sub-millisecond query response
- **ChromaDB Search**: Fast query response with rich metadata
- **Memory Usage**: 0.03 MB for 18 chunks (scales linearly)

### Next Steps: Building the Complete RAG System

Now that we have our vector stores, the next steps are:

#### 1. **LLM Integration** (Next Notebook)
- Connect language models to generate answers
- Learn about different LLM providers and APIs
- Understand prompt engineering for RAG systems

#### 2. **Complete RAG Pipeline**
- Combine retrieval + generation into a single system
- Implement context management and response formatting
- Handle edge cases and error scenarios

#### 3. **Evaluation and Optimization**
- Measure system performance with metrics
- Optimize retrieval parameters and chunk sizes
- Implement feedback loops for continuous improvement

#### 4. **Production Deployment**
- Scale to larger datasets
- Implement monitoring and logging
- Add user interfaces and APIs

### The Foundation is Complete!

You now have a working retrieval system that can find relevant information from your knowledge base. This is the core component that makes RAG systems possible - the ability to quickly and accurately find relevant context for any query.

**Ready for the next notebook?** The next step is LLM integration, where we'll learn how to generate answers based on retrieved context and complete the RAG pipeline!


=== Variable State Diagnostic ===

all_chunks defined: 18 chunks
First chunk keys: ['text', 'type', 'chunk_id', 'source_doc_id', 'source_title', 'source', 'chunk_index', 'word_count', 'char_count', 'metadata', 'embedding']
Sample chunk: {'text': 'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance', 'type': 'semantic', 'chunk_id': 'wiki_0_chunk_0', 'source_doc_id': 'wiki_0', 'source_title': 'Machine learning', 'source': 'wikipedia', 'chunk_index': 0, 'word_count': 68, 'char_count': 460, 'metadata': {'original_length': 462, 'cleaned_length': 462, 'chunking_strategy': 'semantic'}, 'embe