# Embedding-Based Retrieval with ChromaDB and OpenAI

This notebook creates embeddings from text data and stores them in a local ChromaDB vector store.

**Features:** Auto-refresh on each run | Local storage | Batch processing

## 1. Setup and Configuration

In [1]:
import chromadb
import openai
import os
from dotenv import load_dotenv

# Configuration
CHUNK_SIZE = 1000
BATCH_SIZE = 100
COLLECTION_NAME = "space_exploration"
CHROMA_PATH = "./chroma_db"
SOURCE_FILE = "llm.txt"
N_RESULTS = 10  # Number of results to return from queries

# Load API key
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

if not openai.api_key:
    raise ValueError("OPENAI_API_KEY not found in .env file")

print("✓ Configuration loaded")
print(f"  API Key: {openai.api_key[:10]}...")
print(f"  Collection: {COLLECTION_NAME}")
print(f"  Storage: {CHROMA_PATH}")
print(f"  Query results: {N_RESULTS}")

✓ Configuration loaded
  API Key: sk-proj-lq...
  Collection: space_exploration
  Storage: ./chroma_db
  Query results: 10


## 2. Load and Chunk Data

In [2]:
# Load text file
if not os.path.exists(SOURCE_FILE):
    raise FileNotFoundError(f"{SOURCE_FILE} not found. Run 1_Data_collection_preparation.ipynb first.")

with open(SOURCE_FILE, 'r', encoding='utf-8') as f:
    text = f.read()

# Create chunks
chunks = [text[i:i+CHUNK_SIZE] for i in range(0, len(text), CHUNK_SIZE)]

print(f"✓ Data loaded and chunked")
print(f"  File size: {len(text):,} characters")
print(f"  Total chunks: {len(chunks)}")
print(f"  Chunk size: {CHUNK_SIZE} characters")

✓ Data loaded and chunked
  File size: 1,144,194 characters
  Total chunks: 1145
  Chunk size: 1000 characters


## 3. Setup ChromaDB Collection

In [3]:
# Initialize ChromaDB client
client = chromadb.PersistentClient(path=CHROMA_PATH)

# Delete existing collection and create fresh one
try:
    client.delete_collection(name=COLLECTION_NAME)
    print(f"✓ Deleted existing collection")
except:
    print(f"  No existing collection to delete")

# Create fresh collection
collection = client.create_collection(
    name=COLLECTION_NAME,
    metadata={"description": "Space exploration articles from Wikipedia"}
)

print(f"✓ Collection '{COLLECTION_NAME}' created")

✓ Deleted existing collection
✓ Collection 'space_exploration' created


## 4. Generate Embeddings and Store

In [4]:
def get_embeddings(texts, model="text-embedding-3-small"):
    """Get embeddings from OpenAI"""
    if isinstance(texts, str):
        texts = [texts]
    texts = [t.replace("\n", " ") for t in texts]
    response = openai.embeddings.create(input=texts, model=model)
    return [data.embedding for data in response.data]

# Process chunks in batches
print(f"Processing {len(chunks)} chunks in batches of {BATCH_SIZE}...\n")

for batch_start in range(0, len(chunks), BATCH_SIZE):
    batch_end = min(batch_start + BATCH_SIZE, len(chunks))
    batch = chunks[batch_start:batch_end]
    
    # Generate IDs, embeddings, and metadata
    ids = [f"chunk_{i}" for i in range(batch_start, batch_end)]
    embeddings = get_embeddings(batch)
    metadatas = [{"source": SOURCE_FILE, "chunk_id": i} for i in range(batch_start, batch_end)]
    
    # Add to ChromaDB
    collection.add(ids=ids, documents=batch, embeddings=embeddings, metadatas=metadatas)
    
    print(f"  ✓ Processed {batch_end}/{len(chunks)} chunks")

print(f"\n✓ All chunks stored in vector database")
print(f"  Total documents: {collection.count()}")

Processing 1145 chunks in batches of 100...

  ✓ Processed 100/1145 chunks
  ✓ Processed 200/1145 chunks
  ✓ Processed 300/1145 chunks
  ✓ Processed 400/1145 chunks
  ✓ Processed 500/1145 chunks
  ✓ Processed 600/1145 chunks
  ✓ Processed 700/1145 chunks
  ✓ Processed 800/1145 chunks
  ✓ Processed 900/1145 chunks
  ✓ Processed 1000/1145 chunks
  ✓ Processed 1100/1145 chunks
  ✓ Processed 1145/1145 chunks

✓ All chunks stored in vector database
  Total documents: 1145


## 5. Test Query

In [5]:
# Test the vector store with a sample query
test_query = "What is the International Space Station?"
query_embedding = get_embeddings([test_query])[0]

results = collection.query(query_embeddings=[query_embedding], n_results=N_RESULTS)

print(f"Query: {test_query}")
print(f"Requested results: {N_RESULTS}")
print(f"Actual results returned: {len(results['documents'][0])}\n")
print("="*80)

for i in range(len(results['documents'][0])):
    doc = results['documents'][0][i]
    dist = results['distances'][0][i]
    print(f"\nResult {i+1} (distance: {dist:.4f}):")
    print(f"{doc[:200]}...")

print("="*80)

Query: What is the International Space Station?
Requested results: 10
Actual results returned: 10


Result 1 (distance: 0.5572):
 Programare:

TheInternational Space Station(ISS) is a largespace stationthat wasassembledand is maintained inlow Earth orbitby a collaboration of five space agencies and their contractors:NASA(United...

Result 2 (distance: 0.7319):
 architectures and associated timelines relevant to lunar and Mars exploration and science. TheInternational Space Station(ISS) combines NASA'sSpace StationFreedomproject with the RussianMir-2station,...

Result 3 (distance: 0.7457):
n Space Agency's headquarters inSaint-Hubert, Quebec. The ISS is currently maintained in a nearly circular orbit with a minimum mean altitude of 370 km (230 mi) and a maximum of 460 km (290 mi),in the...

Result 4 (distance: 0.7488):
arth at an average altitude of 400 kilometres (250 miles)and circles the Earth in roughly 93 minutes, completing 15.5 orbits per day. TheISS programmecombines two previo

## 6. Understanding Embeddings and Distance Metrics

### What are Embeddings?

An **embedding** is a vector representation of text in high-dimensional space. Each text chunk is converted to a vector:

$$\mathbf{v} = [v_1, v_2, v_3, \ldots, v_{1536}] \in \mathbb{R}^{1536}$$

Where:
- $\mathbf{v}$ is the embedding vector
- Each $v_i$ is a real number (typically between -0.1 and 0.1)
- The dimension is 1536 for the `text-embedding-3-small` model

**Example:** When we show `[-0.0077, -0.0326, 0.0752, ...]`, these are the first few components of the 1536-dimensional vector.

### Understanding Embedding Value Statistics

When we analyze embedding statistics, we're looking at the **distribution of the 1536 values** within each vector:

**Mean of embedding values:**
$$\mu = \frac{1}{1536} \sum_{i=1}^{1536} v_i$$

- **Interpretation:** Average value across all dimensions
- **Typical value:** Close to 0 (embeddings are normalized)
- **Insight:** If mean $\approx 0$, the embedding is balanced (not biased toward positive/negative)

**Standard deviation:**
$$\sigma = \sqrt{\frac{1}{1536} \sum_{i=1}^{1536} (v_i - \mu)^2}$$

- **Interpretation:** How spread out the values are across dimensions
- **Typical value:** 0.02-0.04 for normalized embeddings
- **Insight:** Higher std = more varied features; Lower std = more concentrated representation

**Min/Max values:**
- **Interpretation:** The range of values in the vector
- **Typical range:** [-0.1, 0.1] after normalization
- **Insight:** Shows if any dimension has unusually strong activation (outlier features)

### How Distance Works

**Distance** measures semantic similarity between two embeddings using **Euclidean distance**:

$$d(\mathbf{v}_1, \mathbf{v}_2) = \sqrt{\sum_{i=1}^{1536} (v_{1,i} - v_{2,i})^2}$$

**Interpretation:**
- $d = 0$: Identical embeddings
- $d < 0.5$: Very similar content
- $0.5 \leq d < 1.0$: Moderately similar
- $d \geq 1.0$: Less similar or unrelated

### Practical Insight

When you query the vector database:
1. Your query text → embedding vector $\mathbf{q} \in \mathbb{R}^{1536}$
2. Calculate $d(\mathbf{q}, \mathbf{v}_i)$ for each stored chunk
3. Results ranked by distance (lowest = most relevant)

**Why statistics matter:** Embeddings with similar statistics (mean, std) tend to represent similar types of content. Outlier values (unusually high/low) indicate unique semantic features.

In [7]:
import numpy as np

# Calculate storage size
def get_dir_size(path):
    total = 0
    for dirpath, _, filenames in os.walk(path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            if os.path.exists(fp):
                total += os.path.getsize(fp)
    return total

# Display summary
print("="*80)
print("VECTOR STORE SUMMARY")
print("="*80)
print(f"Collection: {collection.name}")
print(f"Total documents: {collection.count()}")
print(f"Storage path: {CHROMA_PATH}")

if os.path.exists(CHROMA_PATH):
    size = get_dir_size(CHROMA_PATH)
    print(f"Storage size: {size / (1024*1024):.2f} MB")

print(f"Metadata: {collection.metadata}")
print("="*80)

# Show sample documents retrieved by semantic search
print(f"\nSAMPLE RETRIEVED DOCUMENTS (Top {N_RESULTS} from semantic search):")
print("="*80)

# Perform a sample query to get actual retrieved documents
sample_query = "space exploration and missions"
sample_query_embedding = get_embeddings([sample_query])[0]
sample_results = collection.query(
    query_embeddings=[sample_query_embedding],
    n_results=N_RESULTS,
    include=["documents", "metadatas", "embeddings", "distances"]
)

print(f"Query: '{sample_query}'")
print(f"Retrieving top {N_RESULTS} most similar chunks by vector distance\n")

for i in range(len(sample_results['ids'][0])):
    chunk_id = sample_results['ids'][0][i]
    doc = sample_results['documents'][0][i]
    distance = sample_results['distances'][0][i]
    embedding = sample_results['embeddings'][0][i]
    
    print(f"--- Result {i+1}: {chunk_id} (distance: {distance:.4f}) ---")
    print(f"Text: {doc[:200]}...")
    print(f"Embedding vector (first 5 of {len(embedding)} dims): {[f'{x:.4f}' for x in embedding[:5]]}")
    print()

print("="*80)

# Show embedding statistics
print("\nEMBEDDING STATISTICS:")
print("="*80)

all_embeddings = sample_results['embeddings'][0]
embeddings_array = np.array(all_embeddings)

print(f"Model: text-embedding-3-small")
print(f"Dimensions: {embeddings_array.shape[1]}")
print(f"Documents analyzed: {embeddings_array.shape[0]}")
print(f"Value range: [{embeddings_array.min():.4f}, {embeddings_array.max():.4f}]")
print(f"Mean: {embeddings_array.mean():.6f}")
print(f"Std deviation: {embeddings_array.std():.6f}")
print("="*80)

# Show metadata distribution across entire database
print("\nDATABASE METADATA:")
print("="*80)

all_data = collection.get(include=["metadatas"])
chunk_ids = [m['chunk_id'] for m in all_data['metadatas']]

print(f"Total chunks: {len(chunk_ids)}")
print(f"Chunk range: 0 to {max(chunk_ids)}")
print(f"Source: {all_data['metadatas'][0]['source']}")
print("="*80)

VECTOR STORE SUMMARY
Collection: space_exploration
Total documents: 1145
Storage path: ./chroma_db
Storage size: 58.63 MB
Metadata: {'description': 'Space exploration articles from Wikipedia'}

SAMPLE RETRIEVED DOCUMENTS (Top 10 from semantic search):
Query: 'space exploration and missions'
Retrieving top 10 most similar chunks by vector distance

--- Result 1: chunk_0 (distance: 0.6388) ---
Text: Space explorationis the physical investigation ofouter spacebyuncrewed robotic space probesand throughhuman spaceflight. While the observation of objects in space, known asastronomy, predates reliable...
Embedding vector (first 5 of 1536 dims): ['-0.0077', '-0.0325', '0.0753', '0.0074', '0.0128']

--- Result 2: chunk_733 (distance: 0.7296) ---
Text: t based on the HALO-module for the Gateway station. NASA has conducted many uncrewed and robotic spaceflight programs throughout its history. More than 1,000 uncrewed missions have been designed to ex...
Embedding vector (first 5 of 1536 dims): ['