# Embeddings and Vector Store

In this notebook, we'll learn how to convert our text chunks into embeddings (numerical vectors) and build a vector store for efficient similarity search. This is the core of how RAG systems find relevant information.

## Learning Objectives
By the end of this notebook, you will:
1. Understand different embedding models and their trade-offs
2. Generate embeddings for your text chunks
3. Build and query a vector database
4. Compare different embedding approaches
5. Learn about vector store optimization


## Setup and Imports

Let's import the libraries we need and load our processed data.


In [None]:
# Standard library imports
import json
import numpy as np
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Optional
import time
import matplotlib.pyplot as plt
import seaborn as sns

# Embedding and vector store imports
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import faiss
from sklearn.metrics.pairwise import cosine_similarity

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

# Add project root to path
import sys
sys.path.append(str(Path.cwd().parent))

# Import our configuration
from src.config import DATA_CONFIG, DATA_DIR

print("Libraries imported successfully!")
print(f"Data directory: {DATA_DIR}")

# Check if we have processed data
processed_dir = DATA_DIR / "processed"
chunks_file = processed_dir / "all_chunks.json"

if chunks_file.exists():
    print(f"Found processed chunks: {chunks_file}")
    with open(chunks_file, 'r', encoding='utf-8') as f:
        all_chunks = json.load(f)
    print(f"Loaded {len(all_chunks)} chunks")
else:
    print("No processed chunks found. Please run the data collection notebook first.")
    all_chunks = []
