# Production Deployment: RAPTOR RAG with SEC EDGAR Filings (2024 Complete)

## Overview
Production RAPTOR RAG system deployment on AWS EC2:
- **Data**: All SEC 10-K/10-Q filings for 2024 (26,014 files) 🔄 RE-CHUNKING WITH CONTEXTUAL SUMMARIES
- **Infrastructure**: AWS EC2 t3.xlarge (8 vCPUs, 64 GB RAM) - All processing and deployment
- **Models**: Sentence Transformers for embeddings, Ollama for LLM contextual summaries and queries
- **Goal**: Production-ready RAG system with hierarchical retrieval via Open WebUI

---

## Current Status (2025-10-24)

### ✅ Infrastructure Deployed
**AWS EC2 Instance "secAI":**
- **Instance Type**: t3.xlarge
- **vCPUs**: 8
- **RAM**: 64 GB
- **OS**: Ubuntu 24.04 LTS
- **Public IP**: 35.175.134.36
- **SSH Access**: Configured (key-based auth)
- **Data Directory**: `/app/data/`

**Deployed Services:**
- Docker: edgar-chunking image (8.4 GB)
- Open WebUI: Configured for deployment
- Security groups: SSH access configured

### 🔄 2024 Data Re-Chunking IN PROGRESS

**Previous chunking (DELETED):** Simple 500-token chunks without contextual summaries
**New approach:** Anthropic's contextual retrieval method with LLM-generated summaries

**EC2 File Structure:**
```
/app/data/
├── edgar/extracted/2024/
│   ├── QTR1/ (6,337 .txt files)   ✅ READY
│   ├── QTR2/ (7,247 .txt files)   ✅ READY
│   ├── QTR3/ (6,248 .txt files)   ✅ READY
│   └── QTR4/ (6,182 .txt files)   ✅ READY
│
├── processed/2024/
│   └── (empty - previous data deleted)
│
└── embeddings/
    └── test/ (next: 3 test files after re-chunking)
```

**2024 Target Processing (with LLM context generation):**

| Quarter | Files | Expected Time | Status |
|---------|-------|---------------|--------|
| Q1 | 6,337 | ~2-3 hours | ⏸️ Pending |
| Q2 | 7,247 | ~2-3 hours | ⏸️ Pending |
| Q3 | 6,248 | ~2-3 hours | ⏸️ Pending |
| Q4 | 6,182 | ~2-3 hours | ⏸️ Pending |
| **TOTAL** | **26,014** | **~8-12 hours** | ⏸️ Pending |

**Note:** LLM context generation adds processing time but provides 35-49% better retrieval accuracy

---

## Next Phase: Contextual Chunking

### Implementation: Anthropic's Contextual Retrieval Method

**What we're doing:**
For each 500-token chunk, generate a 50-100 token LLM summary explaining what the chunk discusses in relation to the full document.

**Example:**
- **Chunk text:** "The company's revenue grew by 3% over the previous quarter"
- **LLM-generated context:** "This chunk is from ACME Corp's Q2 2023 10-Q filing discussing quarterly revenue performance. Previous quarter revenue was $314M."
- **Stored:** Original chunk (500 tokens) + context summary (50-100 tokens)
- **Embedded:** [context + chunk] for better retrieval

**Expected improvement:** 35-49% reduction in retrieval failures (per Anthropic research)

### Test Files (Q4 2024):
1. `20241024_10-Q_edgar_data_1318605_0001628280-24-043486.txt`
2. `20241030_10-Q_edgar_data_789019_0000950170-24-118967.txt`
3. `20241101_10-K_edgar_data_320193_0000320193-24-000123.txt`

### Embedding Model: `multi-qa-mpnet-base-dot-v1` (768-dim)

**Selection Rationale:**
- **High-dimensional (768)** for precise retrieval of exact wording
- **Trained for Q&A** tasks - perfect for "find X in filings" queries
- **Preserves jargon** - financial/legal term distinctions maintained
- **No overfitting concerns** - pre-trained model, inference only

**Why NOT lower-dimensional models:**
- 384-dim (`all-MiniLM-L6-v2`): Loses nuance needed for legal/financial precision
- Use case requires exact wording retrieval, not general semantic similarity
- 2x storage cost (8.6GB vs 4.3GB) worth the quality improvement

### Storage Impact:
- **2.8M chunks × 768 dims × 4 bytes = ~8.6 GB** (embeddings)
- **26,014 JSON files with contextual summaries** (~20-25 GB estimated)
- Acceptable for EC2 EBS volume
- Enables precise retrieval for complex financial queries

---

## Implementation Pipeline

### Phase 1: Data Processing 🔄 IN PROGRESS
1. ✅ Extract all 2024 filings from ZIP archives
2. 🔄 Re-chunk with LLM-generated contextual summaries
   - Core chunk: 500 tokens (tiktoken)
   - LLM context: 50-100 tokens per chunk
   - Model: qwen2.5:1.5b via Ollama
   - Embedded chunk: [context + core chunk]
3. ✅ Metadata extraction (CIK, company, form, date)
4. ⏸️ JSON output: 26,014 files with 2.8M contextualized chunks

### Phase 2: Embedding Generation ⏸️
1. ✅ Create `embedding_generator.py` script
2. ⏸️ Test on 3 files (validation)
3. ⏸️ Scale to full 2024 (26,014 files)
4. ⏸️ Store in `/app/data/embeddings/2024/`

### Phase 3: RAPTOR Implementation ⏸️
1. Hierarchical clustering (UMAP + GMM)
2. Recursive summarization (3 levels via Ollama)
3. ChromaDB setup and ingestion
4. Cluster validation

### Phase 4: Deployment ⏸️
1. Open WebUI integration
2. ChromaDB retrieval pipeline
3. LLM query interface
4. End-to-end testing

---

## Technical Specifications

### Chunking Strategy (Updated - Anthropic's Method)
- **Core chunk:** 500 tokens (tiktoken)
- **Contextual summary:** 50-100 tokens (LLM-generated via Ollama)
- **LLM model:** qwen2.5:1.5b (fast, efficient)
- **Prompt template:** "Provide a brief, factual summary (50-100 tokens) explaining what this chunk discusses in relation to the full [FORM_TYPE] filing"
- **Stored:** Both original chunk AND contextualized chunk
- **Embedded:** [context + chunk] for better semantic search
- **Metadata:** CIK, company name, form type, filing date
- **Rationale:** 35-49% better retrieval accuracy vs. non-contextual chunks

### Embedding Model (Selected)
- **Model:** sentence-transformers/multi-qa-mpnet-base-dot-v1
- **Dimensions:** 768
- **Parameters:** 420M
- **Training:** Question-answering tasks
- **Similarity:** Dot-product (faster than cosine)

### EC2 Infrastructure
- **Instance:** t3.xlarge
- **RAM:** 64 GB
- **Storage:** EBS volume at `/app/data/`
- **Docker:** edgar-chunking (8.4 GB image)
- **Ollama:** Running inside Docker for LLM context generation

---

## Research Citations

**Contextual Retrieval:**
- **Anthropic Contextual Retrieval (2024):** https://www.anthropic.com/news/contextual-retrieval
- **Key finding:** 35% fewer retrieval failures with contextual embeddings, 49% with BM25 hybrid

**Embedding Selection:**
- **Sentence-BERT (2019):** https://arxiv.org/abs/1908.10084
- **MPNet (2020):** https://arxiv.org/abs/2004.09297
- **MTEB Benchmark (2022):** https://arxiv.org/abs/2210.07316

**RAPTOR:**
- **RAPTOR Paper (2024):** https://arxiv.org/abs/2401.18059

**Tools:**
- **ChromaDB:** https://docs.trychroma.com/
- **Ollama:** https://ollama.com/
- **Open WebUI:** https://github.com/open-webui/open-webui

---

**Last Updated:** 2025-10-24

**Status:** 🔄 Re-chunking with LLM-generated contextual summaries → Then embedding generation