# Embeddings and Vector Store

In this notebook, we'll learn how to convert our text chunks into embeddings (numerical vectors) and build a vector store for efficient similarity search. This is the core of how RAG systems find relevant information.

## Learning Objectives
By the end of this notebook, you will:
1. Understand different embedding models and their trade-offs
2. Generate embeddings for your text chunks
3. Build and query a vector database
4. Compare different embedding approaches
5. Learn about vector store optimization


## Setup and Imports

Let's import the libraries we need and load our processed data.


In [None]:
# Standard library imports
import json
import numpy as np
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Optional
import time
import matplotlib.pyplot as plt
import seaborn as sns

# Embedding and vector store imports
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import faiss
from sklearn.metrics.pairwise import cosine_similarity

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

# Add project root to path
import sys
sys.path.append(str(Path.cwd().parent))

# Import our configuration
from src.config import DATA_CONFIG, DATA_DIR

print("Libraries imported successfully!")
print(f"Data directory: {DATA_DIR}")

# Check if we have processed data
processed_dir = DATA_DIR / "processed"
chunks_file = processed_dir / "all_chunks.json"

if chunks_file.exists():
    print(f"Found processed chunks: {chunks_file}")
    with open(chunks_file, 'r', encoding='utf-8') as f:
        all_chunks = json.load(f)
    print(f"Loaded {len(all_chunks)} chunks")
else:
    print("No processed chunks found. Please run the data collection notebook first.")
    all_chunks = []


## Understanding Embedding Models

Before we generate embeddings, let's understand the different models available and their trade-offs.


In [None]:
# Let's compare different embedding models
embedding_models = {
    "all-MiniLM-L6-v2": {
        "description": "Small, fast model (384 dimensions)",
        "use_case": "Good for learning and experimentation",
        "size": "Small"
    },
    "all-mpnet-base-v2": {
        "description": "Medium model (768 dimensions)",
        "use_case": "Good balance of speed and quality",
        "size": "Medium"
    },
    "BAAI/bge-base-en-v1.5": {
        "description": "High-quality model (768 dimensions)",
        "use_case": "Production use, better quality",
        "size": "Large"
    }
}

print("Available Embedding Models:")
print("=" * 50)
for model_name, info in embedding_models.items():
    print(f"Model: {model_name}")
    print(f"  Description: {info['description']}")
    print(f"  Use Case: {info['use_case']}")
    print(f"  Size: {info['size']}")
    print()

# For learning purposes, let's start with the small, fast model
print("Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Model loaded: {model}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

# Test the model with a simple example
test_texts = [
    "Cats are small, furry animals that make great pets.",
    "Dogs are loyal companions that love to play.",
    "Machine learning is a subset of artificial intelligence."
]

print("\nTesting embedding generation...")
embeddings = model.encode(test_texts)
print(f"Generated embeddings shape: {embeddings.shape}")
print(f"Each text is now represented by {embeddings.shape[1]} numbers")

# Show similarity between the texts
similarities = cosine_similarity(embeddings)
print(f"\nSimilarity matrix:")
print("Texts:")
for i, text in enumerate(test_texts):
    print(f"  {i}: {text[:50]}...")

print("\nSimilarity scores (higher = more similar):")
for i in range(len(test_texts)):
    for j in range(i+1, len(test_texts)):
        sim = similarities[i][j]
        print(f"  Text {i} vs Text {j}: {sim:.3f}")


## Generating Embeddings for Our Data

Now let's generate embeddings for our processed chunks. We'll do this in batches to handle large amounts of data efficiently.


In [None]:
# Generate embeddings for all our chunks
if all_chunks:
    print(f"Generating embeddings for {len(all_chunks)} chunks...")
    
    # Extract text from chunks
    chunk_texts = [chunk['text'] for chunk in all_chunks]
    
    # Generate embeddings in batches for efficiency
    batch_size = 32
    all_embeddings = []
    
    print(f"Processing in batches of {batch_size}...")
    
    start_time = time.time()
    
    for i in range(0, len(chunk_texts), batch_size):
        batch_texts = chunk_texts[i:i+batch_size]
        batch_embeddings = model.encode(batch_texts, show_progress_bar=False)
        all_embeddings.append(batch_embeddings)
        
        if (i // batch_size + 1) % 5 == 0:  # Print progress every 5 batches
            print(f"Processed {min(i + batch_size, len(chunk_texts))}/{len(chunk_texts)} chunks")
    
    # Combine all embeddings
    all_embeddings = np.vstack(all_embeddings)
    
    end_time = time.time()
    processing_time = end_time - start_time
    
    print(f"\nEmbedding generation completed!")
    print(f"Total time: {processing_time:.2f} seconds")
    print(f"Embeddings shape: {all_embeddings.shape}")
    print(f"Average time per chunk: {processing_time/len(all_chunks)*1000:.2f} ms")
    
    # Add embeddings to our chunks
    for i, chunk in enumerate(all_chunks):
        chunk['embedding'] = all_embeddings[i].tolist()
    
    print(f"Embeddings added to {len(all_chunks)} chunks")
    
    # Save the chunks with embeddings
    embeddings_file = processed_dir / "chunks_with_embeddings.json"
    with open(embeddings_file, 'w', encoding='utf-8') as f:
        json.dump(all_chunks, f, indent=2, ensure_ascii=False)
    
    print(f"Chunks with embeddings saved to: {embeddings_file}")
    
else:
    print("No chunks available. Please run the data collection notebook first.")
    all_embeddings = None


## Building a Vector Store with FAISS

Now let's build a vector store using FAISS (Facebook AI Similarity Search) for efficient similarity search.


In [None]:
# Build FAISS vector store
if all_embeddings is not None:
    print("Building FAISS vector store...")
    
    # Get embedding dimension
    embedding_dim = all_embeddings.shape[1]
    print(f"Embedding dimension: {embedding_dim}")
    
    # Create FAISS index
    # Using IndexFlatIP (Inner Product) which works well with normalized embeddings
    # For cosine similarity, we'll normalize the embeddings
    from sklearn.preprocessing import normalize
    
    # Normalize embeddings for cosine similarity
    normalized_embeddings = normalize(all_embeddings, norm='l2')
    
    # Create FAISS index
    index = faiss.IndexFlatIP(embedding_dim)  # Inner Product (cosine similarity for normalized vectors)
    
    # Add embeddings to index
    index.add(normalized_embeddings.astype('float32'))
    
    print(f"FAISS index built with {index.ntotal} vectors")
    
    # Save the index
    faiss_file = processed_dir / "faiss_index.bin"
    faiss.write_index(index, str(faiss_file))
    print(f"FAISS index saved to: {faiss_file}")
    
    # Test the vector store
    def search_vector_store(query_text, top_k=5):
        """
        Search the vector store for similar chunks.
        """
        # Encode query
        query_embedding = model.encode([query_text])
        query_embedding = normalize(query_embedding, norm='l2').astype('float32')
        
        # Search
        scores, indices = index.search(query_embedding, top_k)
        
        # Get results
        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx < len(all_chunks):  # Valid index
                chunk = all_chunks[idx]
                results.append({
                    'chunk': chunk,
                    'score': float(score),
                    'index': int(idx)
                })
        
        return results
    
    print("\nVector store ready! Testing with sample queries...")
    
    # Test queries
    test_queries = [
        "What is machine learning?",
        "Tell me about cats",
        "How does artificial intelligence work?",
        "What are the benefits of pets?"
    ]
    
    for query in test_queries:
        print(f"\nQuery: '{query}'")
        print("-" * 40)
        
        results = search_vector_store(query, top_k=3)
        
        for i, result in enumerate(results):
            chunk = result['chunk']
            score = result['score']
            print(f"{i+1}. Score: {score:.3f}")
            print(f"   Source: {chunk['source']}")
            print(f"   Title: {chunk['source_title']}")
            print(f"   Text: {chunk['text'][:100]}...")
            print()
    
else:
    print("No embeddings available. Please run the embedding generation cell first.")


## Building a Vector Store with ChromaDB

Let's also try ChromaDB, which is another popular vector database that's easier to use and has more features.


In [None]:
# Build ChromaDB vector store
if all_chunks:
    print("Building ChromaDB vector store...")
    
    # Initialize ChromaDB
    chroma_dir = processed_dir / "chroma_db"
    chroma_dir.mkdir(exist_ok=True)
    
    client = chromadb.PersistentClient(path=str(chroma_dir))
    
    # Create collection
    collection_name = "rag_chunks"
    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"description": "RAG system chunks with embeddings"}
    )
    
    print(f"Created ChromaDB collection: {collection_name}")
    
    # Prepare data for ChromaDB
    documents = [chunk['text'] for chunk in all_chunks]
    metadatas = []
    ids = []
    
    for i, chunk in enumerate(all_chunks):
        metadata = {
            'source': chunk['source'],
            'title': chunk['source_title'],
            'word_count': chunk['word_count'],
            'chunk_type': chunk['type'],
            'chunk_id': chunk['chunk_id']
        }
        metadatas.append(metadata)
        ids.append(f"chunk_{i}")
    
    # Add documents to collection
    print("Adding documents to ChromaDB...")
    
    # ChromaDB can generate embeddings automatically, but we'll use our own
    if all_embeddings is not None:
        embeddings = all_embeddings.tolist()
        collection.add(
            documents=documents,
            metadatas=metadatas,
            ids=ids,
            embeddings=embeddings
        )
    else:
        # Let ChromaDB generate embeddings
        collection.add(
            documents=documents,
            metadatas=metadatas,
            ids=ids
        )
    
    print(f"Added {len(documents)} documents to ChromaDB")
    
    # Test ChromaDB search
    def search_chromadb(query_text, top_k=5):
        """
        Search ChromaDB for similar chunks.
        """
        results = collection.query(
            query_texts=[query_text],
            n_results=top_k,
            include=['documents', 'metadatas', 'distances']
        )
        
        formatted_results = []
        if results['documents'] and results['documents'][0]:
            for i, (doc, metadata, distance) in enumerate(zip(
                results['documents'][0],
                results['metadatas'][0],
                results['distances'][0]
            )):
                # Convert distance to similarity score (ChromaDB uses distance, we want similarity)
                similarity = 1 - distance
                
                formatted_results.append({
                    'text': doc,
                    'metadata': metadata,
                    'similarity': similarity,
                    'distance': distance
                })
        
        return formatted_results
    
    print("\nChromaDB vector store ready! Testing with sample queries...")
    
    # Test queries
    test_queries = [
        "What is machine learning?",
        "Tell me about cats",
        "How does artificial intelligence work?",
        "What are the benefits of pets?"
    ]
    
    for query in test_queries:
        print(f"\nQuery: '{query}'")
        print("-" * 40)
        
        results = search_chromadb(query, top_k=3)
        
        for i, result in enumerate(results):
            print(f"{i+1}. Similarity: {result['similarity']:.3f}")
            print(f"   Source: {result['metadata']['source']}")
            print(f"   Title: {result['metadata']['title']}")
            print(f"   Text: {result['text'][:100]}...")
            print()
    
    print(f"ChromaDB database saved to: {chroma_dir}")
    
else:
    print("No chunks available. Please run the data collection notebook first.")


## Interactive Query Testing

Let's create an interactive tool to test our vector stores with your own queries.


In [None]:
# Interactive query testing function
def interactive_search():
    """
    Interactive function to test queries against our vector stores.
    """
    if not all_chunks:
        print("No data available. Please run the data collection notebook first.")
        return
    
    print("Interactive Vector Store Search")
    print("=" * 50)
    print("Enter your questions to search through the knowledge base!")
    print("Type 'quit' to exit.")
    print()
    
    while True:
        try:
            query = input("Enter your question: ").strip()
            
            if query.lower() in ['quit', 'exit', 'q']:
                print("Goodbye!")
                break
            
            if not query:
                print("Please enter a question.")
                continue
            
            print(f"\nSearching for: '{query}'")
            print("=" * 60)
            
            # Search FAISS if available
            if 'index' in locals():
                print("FAISS Results:")
                print("-" * 30)
                faiss_results = search_vector_store(query, top_k=3)
                
                for i, result in enumerate(faiss_results):
                    chunk = result['chunk']
                    score = result['score']
                    print(f"{i+1}. Score: {score:.3f}")
                    print(f"   Source: {chunk['source']}")
                    print(f"   Title: {chunk['source_title']}")
                    print(f"   Text: {chunk['text'][:150]}...")
                    print()
            
            # Search ChromaDB if available
            if 'collection' in locals():
                print("ChromaDB Results:")
                print("-" * 30)
                chroma_results = search_chromadb(query, top_k=3)
                
                for i, result in enumerate(chroma_results):
                    print(f"{i+1}. Similarity: {result['similarity']:.3f}")
                    print(f"   Source: {result['metadata']['source']}")
                    print(f"   Title: {result['metadata']['title']}")
                    print(f"   Text: {result['text'][:150]}...")
                    print()
            
            print("=" * 60)
            
        except KeyboardInterrupt:
            print("\nGoodbye!")
            break
        except Exception as e:
            print(f"Error: {e}")

# Let's also create a comparison function
def compare_vector_stores(query_text, top_k=5):
    """
    Compare results from both vector stores.
    """
    if 'index' not in locals() or 'collection' not in locals():
        print("Both vector stores not available for comparison.")
        return
    
    print(f"Comparing vector stores for query: '{query_text}'")
    print("=" * 70)
    
    # FAISS results
    print("FAISS Results:")
    print("-" * 35)
    faiss_results = search_vector_store(query_text, top_k)
    
    for i, result in enumerate(faiss_results):
        chunk = result['chunk']
        score = result['score']
        print(f"{i+1}. Score: {score:.3f} | {chunk['source']} | {chunk['source_title'][:50]}...")
    
    print()
    
    # ChromaDB results
    print("ChromaDB Results:")
    print("-" * 35)
    chroma_results = search_chromadb(query_text, top_k)
    
    for i, result in enumerate(chroma_results):
        print(f"{i+1}. Similarity: {result['similarity']:.3f} | {result['metadata']['source']} | {result['metadata']['title'][:50]}...")
    
    print()

print("Interactive search tools ready!")
print("Try these functions:")
print("1. interactive_search() - Interactive query testing")
print("2. compare_vector_stores('your query here') - Compare both vector stores")
print()
print("Sample comparison:")
if 'index' in locals() and 'collection' in locals():
    compare_vector_stores("What is machine learning?")
else:
    print("Vector stores not ready yet. Run the previous cells first.")


## Summary and Next Steps

Excellent! You've successfully built a complete embedding and vector store system. Here's what we've accomplished:

### What We've Built:
1. **Embedding Generation** - Converted text chunks to numerical vectors
2. **FAISS Vector Store** - Fast similarity search using Facebook's library
3. **ChromaDB Vector Store** - Feature-rich vector database with metadata
4. **Interactive Search** - Tools to query your knowledge base
5. **Comparison Tools** - Compare different vector store approaches

### Key Learnings:
- **Embedding Models** - Different models have different trade-offs in speed vs quality
- **Vector Stores** - FAISS is fast, ChromaDB is feature-rich
- **Similarity Search** - Cosine similarity works well for semantic search
- **Batch Processing** - Efficient embedding generation for large datasets
- **Metadata** - Important for filtering and understanding search results

### Files Created:
- `chunks_with_embeddings.json` - Chunks with embedding vectors
- `faiss_index.bin` - FAISS vector store index
- `chroma_db/` - ChromaDB database directory

### Next Steps:
Now that we have our vector stores, the next steps are:
1. **LLM Integration** - Connect language models to generate answers
2. **Prompt Engineering** - Design effective prompts for the LLM
3. **Complete RAG Pipeline** - Combine retrieval + generation
4. **Evaluation** - Measure how well the system works
5. **Optimization** - Improve performance and accuracy

The foundation is now complete - you have a working retrieval system that can find relevant information from your knowledge base!

**Ready for the next notebook?** The next step is LLM integration, where we'll learn how to generate answers based on retrieved context.
