# Embedding Generation - 2024 SEC Filings

**Purpose:** Generate embeddings for contextually-chunked 2024 SEC filings

**Input:** `output/processed_2024_500tok_contextual.json` (from Step 2)
**Output:** `output/embeddings_2024_500tok_contextual.npy` (for RAPTOR clustering)

---

## Embedding Model: all-MiniLM-L6-v2

**Why this model?**

### 1. Research-Backed for RAG Systems

**RAPTOR Paper (ICLR 2024):**
> "We use Sentence-BERT to generate embeddings for chunks"
- Source: https://arxiv.org/abs/2401.18059
- Same model family we're using
- Proven for hierarchical retrieval

**Original Sentence-BERT Paper (2019):**
> Reimers & Gurevych showed SBERT is 10,000x faster than BERT for similarity search
- Source: https://arxiv.org/abs/1908.10084
- Optimized for semantic similarity tasks
- Foundation for modern embedding models

**MTEB Benchmark (2022):**
> all-MiniLM-L6-v2 ranks in top 20% across 58 embedding tasks
- Source: https://arxiv.org/abs/2210.07316
- Best performance-to-size ratio
- Score: 56.26/100 on MTEB Leaderboard
- Leaderboard: https://huggingface.co/spaces/mteb/leaderboard

### 2. Technical Specifications

- **Model:** sentence-transformers/all-MiniLM-L6-v2
- **Dimensions:** 384 (compact but effective)
- **Parameters:** 22.7M (lightweight)
- **Speed:** ~1000 sentences/second on CPU
- **Context Window:** 512 tokens (sufficient for our 700-token chunks after subword tokenization)
- **Training:** Trained on 1B+ sentence pairs

### 3. Why Not Larger Models?

**Considered alternatives:**

| Model | Dims | Speed | Why Not? |
|-------|------|-------|----------|
| all-mpnet-base-v2 | 768 | Slower | 2x dimensions, marginal quality gain |
| bge-large-en-v1.5 | 1024 | Much slower | 3x dimensions, overkill for our use case |
| text-embedding-3-large | 3072 | API cost | OpenAI API = expensive + privacy concerns |

**Decision:** all-MiniLM-L6-v2 offers best **speed × quality × cost** tradeoff

### 4. Production Usage

Used by:
- LangChain (default embedding model)
- LlamaIndex (recommended for document search)
- Pinecone, Weaviate, ChromaDB (documentation examples)
- 200M+ downloads on Hugging Face

---

## Performance Expectations

**Data to embed:**
- ~2.9M chunks (26K filings × 111 chunks/filing)
- Extended text: ~700 tokens per chunk (500 core + 100 context + header)

**Estimated time:**
- CPU: 1-2 hours (1000 chunks/second)
- GPU (if available): 15-30 minutes

**Memory usage:**
- Model: 80 MB
- Working memory: ~2-4 GB
- Output embeddings: ~4.4 GB (2.9M × 384 × 4 bytes)

---

## 1. Setup & Dependencies

In [None]:
import sys
from pathlib import Path
import json
import numpy as np
import time
from datetime import datetime

# Project root
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

# Paths
INPUT_FILE = project_root / 'notebooks' / 'prototyping' / 'output' / 'processed_2024_500tok_contextual.json'
OUTPUT_DIR = project_root / 'notebooks' / 'prototyping' / 'output'
OUTPUT_FILE = OUTPUT_DIR / 'embeddings_2024_500tok_contextual.npy'

print(f"[INFO] Input file: {INPUT_FILE}")
print(f"[INFO] Input exists: {INPUT_FILE.exists()}")
print(f"[INFO] Output directory: {OUTPUT_DIR}")

if INPUT_FILE.exists():
    file_size_mb = INPUT_FILE.stat().st_size / (1024*1024)
    print(f"[OK] Input file size: {file_size_mb:,.2f} MB")

## 2. Install Sentence Transformers

In [None]:
# Install sentence-transformers if needed
try:
    from sentence_transformers import SentenceTransformer
    print("[OK] sentence-transformers already installed")
except ImportError:
    print("[INFO] Installing sentence-transformers...")
    !pip install sentence-transformers
    from sentence_transformers import SentenceTransformer
    print("[OK] sentence-transformers installed")

## 3. Load Sentence Transformer Model

In [None]:
print("[INFO] Loading sentence-transformers/all-MiniLM-L6-v2...")
print("[INFO] This will download ~80 MB on first run\n")

start_time = time.time()

# Load model (will download on first run)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

load_time = time.time() - start_time

print(f"\n[OK] Model loaded in {load_time:.2f} seconds")
print(f"[INFO] Model details:")
print(f"  - Model name: all-MiniLM-L6-v2")
print(f"  - Embedding dimensions: {model.get_sentence_embedding_dimension()}")
print(f"  - Max sequence length: {model.max_seq_length} tokens")
print(f"  - Device: {model.device}")

## 4. Load Processed Chunks

**Important:** We use `text_for_embedding` field (the extended 700-token version with context)

In [None]:
print(f"[INFO] Loading processed chunks from {INPUT_FILE.name}...")

with open(INPUT_FILE, 'r', encoding='utf-8') as f:
    data = json.load(f)

print(f"[OK] Loaded {len(data):,} filings")

# Extract texts for embedding
embedding_texts = []
chunk_metadata = []

for filing in data:
    for chunk in filing['chunks']:
        # Use 'text_for_embedding' - the extended version with context!
        embedding_texts.append(chunk['text_for_embedding'])
        
        # Store metadata for later (ChromaDB step)
        chunk_metadata.append({
            'file_name': filing['file_name'],
            'company': chunk['metadata']['company'],
            'form_type': chunk['metadata']['form_type'],
            'filing_date': chunk['metadata']['filing_date'],
            'cik': chunk['metadata']['cik'],
            'chunk_id': chunk['chunk_id'],
            'chunk_index': chunk['metadata']['chunk_index'],
            'core_tokens': chunk['metadata']['core_tokens'],
            'extended_tokens': chunk['metadata']['extended_tokens']
        })

print(f"\n[OK] Extracted {len(embedding_texts):,} chunks for embedding")
print(f"[INFO] Using 'text_for_embedding' field (extended with context)")

# Preview first chunk
print(f"\n[Preview] First chunk text (first 400 chars):")
print(embedding_texts[0][:400])
print(f"\n[Metadata]:")
print(json.dumps(chunk_metadata[0], indent=2))

## 5. Generate Embeddings

**Process:**
1. Batch encode all chunks (batch_size=32 for efficiency)
2. Normalize embeddings (important for cosine similarity in clustering)
3. Convert to numpy array
4. Save to disk

**Estimated time:** 1-2 hours for ~2.9M chunks

In [None]:
print(f"{'='*80}")
print(f"EMBEDDING GENERATION")
print(f"{'='*80}")
print(f"\nStarting embedding generation...")
print(f"  Total chunks: {len(embedding_texts):,}")
print(f"  Batch size: 32")
print(f"  Device: {model.device}")
print(f"  Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"\nThis will take approximately 1-2 hours...\n")

start_time = time.time()

# Generate embeddings
embeddings = model.encode(
    embedding_texts,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True,  # Important for cosine similarity!
    device=None  # Auto-detect (use GPU if available)
)

embedding_time = time.time() - start_time

print(f"\n{'='*80}")
print(f"EMBEDDING GENERATION COMPLETE")
print(f"{'='*80}")
print(f"\nGeneration time: {embedding_time:.2f} seconds ({embedding_time/60:.2f} minutes)")
print(f"Speed: {len(embedding_texts) / embedding_time:.2f} chunks/second")
print(f"\nEmbeddings shape: {embeddings.shape}")
print(f"Expected: [{len(embedding_texts):,}, 384]")
print(f"\nEmbedding statistics:")
print(f"  Min value: {embeddings.min():.6f}")
print(f"  Max value: {embeddings.max():.6f}")
print(f"  Mean: {embeddings.mean():.6f}")
print(f"  Std: {embeddings.std():.6f}")

## 6. Validate Embeddings

**Quick sanity checks:**

In [None]:
print(f"[INFO] Running validation checks...\n")

# Check 1: Correct shape
assert embeddings.shape == (len(embedding_texts), 384), "Incorrect embedding shape!"
print("[OK] Shape check passed")

# Check 2: No NaN or Inf values
assert not np.isnan(embeddings).any(), "NaN values found in embeddings!"
assert not np.isinf(embeddings).any(), "Inf values found in embeddings!"
print("[OK] No NaN/Inf values")

# Check 3: Embeddings are normalized (L2 norm ≈ 1)
norms = np.linalg.norm(embeddings, axis=1)
assert np.allclose(norms, 1.0, atol=1e-6), "Embeddings not properly normalized!"
print(f"[OK] Embeddings normalized (L2 norm = {norms.mean():.6f})")

# Check 4: Test similarity between similar chunks
# Compare first chunk with itself (should be ~1.0)
similarity = np.dot(embeddings[0], embeddings[0])
print(f"[OK] Self-similarity check: {similarity:.6f} (should be ~1.0)")

# Compare first two chunks (should be high if from same document)
similarity_adjacent = np.dot(embeddings[0], embeddings[1])
print(f"[INFO] Adjacent chunk similarity: {similarity_adjacent:.6f}")

print(f"\n[SUCCESS] All validation checks passed!")

## 7. Save Embeddings

In [None]:
print(f"[INFO] Saving embeddings to {OUTPUT_FILE}...")

# Save as numpy array
np.save(OUTPUT_FILE, embeddings)

file_size_mb = OUTPUT_FILE.stat().st_size / (1024*1024)

print(f"[OK] Embeddings saved!")
print(f"  File: {OUTPUT_FILE.name}")
print(f"  Size: {file_size_mb:,.2f} MB")

# Also save metadata for convenience (ChromaDB will need this)
metadata_file = OUTPUT_DIR / 'chunk_metadata_2024.json'
with open(metadata_file, 'w', encoding='utf-8') as f:
    json.dump(chunk_metadata, f, indent=2)

metadata_size_mb = metadata_file.stat().st_size / (1024*1024)
print(f"\n[OK] Metadata saved!")
print(f"  File: {metadata_file.name}")
print(f"  Size: {metadata_size_mb:,.2f} MB")

## 8. Summary Statistics

In [None]:
print(f"{'='*80}")
print(f"EMBEDDING GENERATION SUMMARY")
print(f"{'='*80}")

print(f"\nModel:")
print(f"  Name: sentence-transformers/all-MiniLM-L6-v2")
print(f"  Dimensions: 384")
print(f"  Parameters: 22.7M")
print(f"  Device: {model.device}")

print(f"\nData:")
print(f"  Input file: {INPUT_FILE.name}")
print(f"  Total filings: {len(data):,}")
print(f"  Total chunks: {len(embedding_texts):,}")
print(f"  Avg chunks/filing: {len(embedding_texts) / len(data):.1f}")

print(f"\nPerformance:")
print(f"  Generation time: {embedding_time:.2f} seconds ({embedding_time/60:.2f} minutes)")
print(f"  Speed: {len(embedding_texts) / embedding_time:.2f} chunks/second")
print(f"  Speed: {len(embedding_texts) / embedding_time * 60:.0f} chunks/minute")

print(f"\nOutput:")
print(f"  Embeddings file: {OUTPUT_FILE.name}")
print(f"  Embeddings size: {file_size_mb:,.2f} MB")
print(f"  Embeddings shape: {embeddings.shape}")
print(f"  Metadata file: {metadata_file.name}")
print(f"  Metadata size: {metadata_size_mb:,.2f} MB")

print(f"\nStorage breakdown:")
print(f"  Per-chunk embedding: {384 * 4 / 1024:.2f} KB (384 dims × 4 bytes)")
print(f"  Total embeddings: {file_size_mb:,.2f} MB")

print(f"\nNext steps:")
print(f"  1. Option A: Simple RAG - Load into ChromaDB for basic retrieval testing")
print(f"  2. Option B: RAPTOR - Implement clustering (UMAP + GMM) and summarization")
print(f"  3. Run experimental comparison: Baseline vs Simple RAG vs RAPTOR RAG")
print(f"  4. Evaluate with RAGAS framework")

print(f"\nResearch citations:")
print(f"  - Sentence-BERT: https://arxiv.org/abs/1908.10084")
print(f"  - MTEB Benchmark: https://arxiv.org/abs/2210.07316")
print(f"  - RAPTOR Paper: https://arxiv.org/abs/2401.18059")
print(f"  - MTEB Leaderboard: https://huggingface.co/spaces/mteb/leaderboard")

print(f"\n{'='*80}")