# Production Deployment: RAPTOR RAG with SEC EDGAR Filings (2024 Complete)

## Overview
Production RAPTOR RAG system deployment on AWS EC2:
- **Data**: All SEC 10-K/10-Q filings for 2024 (26,014 files) ✅ CHUNKED
- **Infrastructure**: AWS EC2 t3.xlarge (8 vCPUs, 64 GB RAM) - All processing and deployment
- **Models**: Sentence Transformers for embeddings, Ollama for LLM queries
- **Goal**: Production-ready RAG system with hierarchical retrieval via Open WebUI

---

## Current Status (2025-10-24)

### ✅ Infrastructure Deployed
**AWS EC2 Instance "secAI":**
- **Instance Type**: t3.xlarge
- **vCPUs**: 8
- **RAM**: 64 GB
- **OS**: Ubuntu 24.04 LTS
- **Public IP**: 35.175.134.36
- **SSH Access**: Configured (key-based auth)
- **Data Directory**: `/app/data/`

**Deployed Services:**
- Docker: edgar-chunking image (8.4 GB)
- Open WebUI: Configured for deployment
- Security groups: SSH access configured

### ✅ 2024 Data Chunking COMPLETE

**EC2 File Structure:**
```
/app/data/
├── edgar/extracted/2024/
│   ├── QTR1/ (6,337 .txt files)   ✅ CHUNKED
│   ├── QTR2/ (7,247 .txt files)   ✅ CHUNKED
│   ├── QTR3/ (6,248 .txt files)   ✅ CHUNKED
│   └── QTR4/ (6,182 .txt files)   ✅ CHUNKED
│
├── processed/2024/
│   ├── QTR1/ (6,337 JSON files)   ✅ COMPLETE
│   ├── QTR2/ (7,247 JSON files)   ✅ COMPLETE
│   ├── QTR3/ (6,248 JSON files)   ✅ COMPLETE
│   └── QTR4/ (6,182 JSON files)   ✅ COMPLETE
│
└── embeddings/
    └── test/ (next: 3 test files)
```

**2024 Processing Results:**

| Quarter | Files | Chunks | Tokens | Status |
|---------|-------|--------|--------|--------|
| Q1 | 6,337 | 1,235,886 | 616,372,446 | ✅ Complete |
| Q2 | 7,247 | 584,914 | 290,670,736 | ✅ Complete |
| Q3 | 6,248 | 522,716 | 259,802,386 | ✅ Complete |
| Q4 | 6,182 | 498,682 | 247,809,543 | ✅ Complete |
| **TOTAL** | **26,014** | **2,842,198** | **1,414,655,111** | ✅ Complete |

**Processing completed:** 2025-10-24 using Docker container `edgar-chunking`
**Total processing time:** ~4 hours (sequential quarterly processing)

---

## Next Phase: Embedding Generation

### Test Files (Q4 2024):
1. `20241024_10-Q_edgar_data_1318605_0001628280-24-043486.txt`
2. `20241030_10-Q_edgar_data_789019_0000950170-24-118967.txt`
3. `20241101_10-K_edgar_data_320193_0000320193-24-000123.txt`

### Embedding Model: `multi-qa-mpnet-base-dot-v1` (768-dim)

**Selection Rationale:**
- **High-dimensional (768)** for precise retrieval of exact wording
- **Trained for Q&A** tasks - perfect for "find X in filings" queries
- **Preserves jargon** - financial/legal term distinctions maintained
- **No overfitting concerns** - pre-trained model, inference only

**Why NOT lower-dimensional models:**
- 384-dim (`all-MiniLM-L6-v2`): Loses nuance needed for legal/financial precision
- Use case requires exact wording retrieval, not general semantic similarity
- 2x storage cost (8.6GB vs 4.3GB) worth the quality improvement

### Storage Impact:
- **2.8M chunks × 768 dims × 4 bytes = ~8.6 GB**
- Acceptable for EC2 EBS volume
- Enables precise retrieval for complex financial queries

---

## Implementation Pipeline

### Phase 1: Data Processing ✅ COMPLETE
1. ✅ Extract all 2024 filings from ZIP archives
2. ✅ Process with Docker chunking container
3. ✅ 500-token chunks using tiktoken
4. ✅ Metadata extraction (CIK, company, form, date)
5. ✅ JSON output: 26,014 files with 2.8M chunks

### Phase 2: Embedding Generation 🔄 IN PROGRESS
1. Create `embedding_generator.py` script
2. Test on 3 files (validation)
3. Scale to full 2024 (26,014 files)
4. Store in `/app/data/embeddings/2024/`

### Phase 3: RAPTOR Implementation ⏸️
1. Hierarchical clustering (UMAP + GMM)
2. Recursive summarization (3 levels via Ollama)
3. ChromaDB setup and ingestion
4. Cluster validation

### Phase 4: Deployment ⏸️
1. Open WebUI integration
2. ChromaDB retrieval pipeline
3. LLM query interface
4. End-to-end testing

---

## Technical Specifications

### Chunking Strategy (Completed)
- **Core chunk:** 500 tokens (tiktoken)
- **No contextual window:** Direct chunking for baseline
- **Metadata:** CIK, company name, form type, filing date
- **Rationale:** Simpler baseline; can add contextual embeddings later if needed

### Embedding Model (Selected)
- **Model:** sentence-transformers/multi-qa-mpnet-base-dot-v1
- **Dimensions:** 768
- **Parameters:** 420M
- **Training:** Question-answering tasks
- **Similarity:** Dot-product (faster than cosine)

### EC2 Infrastructure
- **Instance:** t3.xlarge
- **RAM:** 64 GB
- **Storage:** EBS volume at `/app/data/`
- **Docker:** edgar-chunking (8.4 GB image)

---

## Research Citations

**Embedding Selection:**
- **Sentence-BERT (2019):** https://arxiv.org/abs/1908.10084
- **MPNet (2020):** https://arxiv.org/abs/2004.09297
- **MTEB Benchmark (2022):** https://arxiv.org/abs/2210.07316

**RAPTOR:**
- **RAPTOR Paper (2024):** https://arxiv.org/abs/2401.18059

**Tools:**
- **ChromaDB:** https://docs.trychroma.com/
- **Ollama:** https://ollama.com/
- **Open WebUI:** https://github.com/open-webui/open-webui

---

**Last Updated:** 2025-10-24

**Status:** 🔄 Chunking Complete → Starting Embedding Generation