# Production Deployment: RAPTOR RAG with SEC EDGAR Filings (2024 Complete)

## Overview
Production RAPTOR RAG system deployment on AWS EC2:
- **Data**: All SEC 10-K/10-Q filings for 2024 (26,014 files)
- **Infrastructure**: AWS EC2 t3.xlarge (8 vCPUs, 64 GB RAM) - All processing and deployment
- **Models**: Sentence Transformers for embeddings, Ollama for LLM queries
- **Goal**: Production-ready RAG system with hierarchical retrieval via Open WebUI

---

## Current Status (2025-10-24)

### ✅ Infrastructure Deployed
**AWS EC2 Instance "secAI":**
- **Instance Type**: t3.xlarge
- **vCPUs**: 8
- **RAM**: 64 GB
- **OS**: Ubuntu 24.04 LTS
- **Public IP**: 35.175.134.36
- **SSH Access**: Configured (key-based auth)
- **Data Directory**: `/app/data/`

**Deployed Services:**
- Docker: edgar-chunking image (8.4 GB)
- Open WebUI: Configured for deployment
- Security groups: SSH access configured

### ✅ 2024 Data Chunking COMPLETE (Q1)

**Method:** NVIDIA's 15% overlap chunking (no LLM context generation)
**Rationale:** Anthropic's LLM method too slow (~11 days for all 2024); NVIDIA's overlap method fast (~4 hours total) with good retrieval quality

**EC2 File Structure:**
```
/app/data/
├── edgar/extracted/2024/
│   ├── QTR1/ (6,337 .txt files)   ✅ READY
│   ├── QTR2/ (7,247 .txt files)   ✅ READY
│   ├── QTR3/ (6,248 .txt files)   ✅ READY
│   └── QTR4/ (6,182 .txt files)   ✅ READY
│
├── processed/2024/
│   ├── QTR1/ (6,337 .json files)  ✅ CHUNKED (15% overlap)
│   ├── QTR2/                      ⏸️ Pending
│   ├── QTR3/                      ⏸️ Pending
│   └── QTR4/                      ⏸️ Pending
│
└── embeddings/
    ├── test_q1/                   ✅ TEST (3 companies, 286 chunks)
    │   ├── embeddings.parquet     (1.5 MB - 286 rows × 768 dims)
    │   └── metadata.parquet       (3.3 KB - 286 rows)
    ├── 2024/
    │   ├── QTR1/                  ⏸️ Next (after test validation)
    │   ├── QTR2/                  ⏸️ Pending
    │   ├── QTR3/                  ⏸️ Pending
    │   └── QTR4/                  ⏸️ Pending
```

**2024 Processing Progress:**

| Quarter | Files | Chunks (est.) | Chunking Status | Embedding Status |
|---------|-------|---------------|-----------------|------------------|
| Q1 | 6,337 | ~108,000 | ✅ Complete | 🔄 Test (3 files) |
| Q2 | 7,247 | ~123,000 | ⏸️ Pending | ⏸️ Pending |
| Q3 | 6,248 | ~106,000 | ⏸️ Pending | ⏸️ Pending |
| Q4 | 6,182 | ~105,000 | ⏸️ Pending | ⏸️ Pending |
| **TOTAL** | **26,014** | **~442,000** | **Q1 done** | **Test done** |

**Test Embedding Results (Q1 - 3 companies):**
- **Companies**: Tesla (CIK 1318605), Microsoft (CIK 789019), Apple (CIK 320193)
- **Files**: Tesla 10-K (171 chunks), Microsoft 10-Q (87 chunks), Apple 10-Q (28 chunks)
- **Total**: 286 chunks embedded
- **Time**: 2min 40sec
- **Output**: `/app/data/embeddings/test_q1/` (embeddings.parquet + metadata.parquet)
- **Performance**: ~95 chunks/min
- **Estimated Q1 full**: ~19 hours for 108,000 chunks

---

## Implementation Details

### Chunking Strategy: NVIDIA's 15% Overlap Method

**Configuration:**
- **Core chunk**: 500 tokens (tiktoken `cl100k_base`)
- **Overlap**: 15% = 75 tokens
- **Step size**: 425 tokens (500 - 75)
- **Format**: JSON per filing with chunks array

**Why 15% overlap?**
- Research-backed: NVIDIA papers recommend 10-20% overlap for optimal retrieval
- No LLM overhead: Anthropic's contextual method too slow (11 days vs 4 hours)
- Good retrieval: Overlap provides context without requiring LLM summarization
- Fast processing: ~4 hours for all 26,014 files

**Example chunk overlap:**
```
Chunk 0: tokens 0-499    (500 tokens)
Chunk 1: tokens 425-924  (500 tokens, 75 token overlap with Chunk 0)
Chunk 2: tokens 850-1349 (500 tokens, 75 token overlap with Chunk 1)
```

### Embedding Model: `multi-qa-mpnet-base-dot-v1` (768-dim)

**Selection Rationale:**
- **High-dimensional (768)** for precise retrieval of exact wording
- **Trained for Q&A** tasks - perfect for "find X in filings" queries
- **Preserves jargon** - financial/legal term distinctions maintained
- **Dot-product similarity** - faster than cosine similarity
- **No overfitting concerns** - pre-trained model, inference only

**Why NOT lower-dimensional models:**
- 384-dim (`all-MiniLM-L6-v2`): Loses nuance needed for legal/financial precision
- Use case requires exact wording retrieval, not general semantic similarity
- Storage cost acceptable for precision gain

**Parquet Storage Format:**
- **embeddings.parquet**: N rows × 768 columns (one row per chunk)
- **metadata.parquet**: N rows × 2 columns (file_name, chunk_id)
- **Why Parquet?**: Columnar format, efficient compression, vector DB ready
- **Not individual JSON files**: 1 consolidated file for all chunks per quarter

### Storage Estimates:
- **Chunked JSON**: 26,014 files × ~1 MB avg = ~26 GB
- **Embeddings**: 442,000 chunks × 768 dims × 4 bytes = ~1.4 GB (compressed in Parquet)
- **Total**: ~27.4 GB for all 2024 data
- **EC2 EBS**: Sufficient capacity

---

## Implementation Pipeline

### Phase 1: Data Processing ✅ Q1 COMPLETE, Q2-Q4 PENDING
1. ✅ Extract all 2024 filings from ZIP archives
2. ✅ Chunk Q1 with 15% overlap (NVIDIA method)
   - Core chunk: 500 tokens (tiktoken)
   - Overlap: 75 tokens (15%)
   - No LLM context generation
3. ✅ Metadata extraction (CIK, company, form, date)
4. 🔄 JSON output: Q1 complete (6,337 files), Q2-Q4 pending

### Phase 2: Embedding Generation 🔄 TEST COMPLETE, Q1 FULL PENDING
1. ✅ Create `embedding_generator.py` script
2. ✅ Test on 3 Q1 files (Tesla, Microsoft, Apple - 286 chunks)
3. ⏸️ Scale to full Q1 (6,337 files, ~108,000 chunks - est. 19 hours)
4. ⏸️ Scale to Q2-Q4
5. ⏸️ Store in `/app/data/embeddings/2024/QTR{1-4}/`

### Phase 3: RAPTOR Implementation ⏸️
1. Hierarchical clustering (UMAP + GMM)
2. Recursive summarization (3 levels via Ollama)
3. ChromaDB setup and ingestion
4. Cluster validation

### Phase 4: Deployment ⏸️
1. Open WebUI integration
2. ChromaDB retrieval pipeline
3. LLM query interface
4. End-to-end testing

---

## Docker Configuration

### `docker-compose.chunking.yml` Volume Mounts:
```yaml
volumes:
  - /app/data/edgar:/app/data/edgar:ro          # Read-only input (extracted .txt)
  - /app/data/processed:/app/data/processed     # Write output (chunked .json)
  - /app/data/embeddings:/app/data/embeddings   # Embeddings output (.parquet)
  - ./src:/app/src                              # Source code (development)
```

**Key Points:**
- All three data directories mounted for persistence
- Embeddings directory properly mounted (fixed issue where embeddings were lost)
- Files persist on EC2 host filesystem after container removal (`--rm` flag)

---

## Technical Specifications

### EC2 Infrastructure
- **Instance:** t3.xlarge
- **RAM:** 64 GB
- **Storage:** EBS volume at `/app/data/`
- **Docker:** edgar-chunking (8.4 GB image)
- **Python environment**: Inside Docker container only

### Commands Reference

**Chunking (Q2-Q4 pending):**
```bash
# Q2
docker compose -f docker-compose.chunking.yml run --rm chunking \
  python -m src.data.text_processor \
  --input /app/data/edgar/extracted/2024/QTR2 \
  --output /app/data/processed/2024/QTR2 \
  --chunk-size 500

# Q3, Q4 similar
```

**Embedding generation:**
```bash
# Full Q1
docker compose -f docker-compose.chunking.yml run --rm chunking \
  python -m src.models.embedding_generator \
  --input /app/data/processed/2024/QTR1 \
  --output /app/data/embeddings/2024/QTR1 \
  --files $(ls /app/data/processed/2024/QTR1/*.json)
```

**Monitoring:**
```bash
# Count chunked files
ls -1 /app/data/processed/2024/QTR1/*.json 2>/dev/null | wc -l

# Check embedding files
ls -lh /app/data/embeddings/test_q1/

# Check embedding directory size
du -sh /app/data/embeddings/test_q1/
```

---

## Research Citations

**Chunking Strategy:**
- **NVIDIA Retrieval QA (2023)**: Overlap-based chunking for improved retrieval
- **Recommended overlap**: 10-20% for balance between context and efficiency

**Embedding Selection:**
- **Sentence-BERT (2019):** https://arxiv.org/abs/1908.10084
- **MPNet (2020):** https://arxiv.org/abs/2004.09297
- **MTEB Benchmark (2022):** https://arxiv.org/abs/2210.07316

**RAPTOR:**
- **RAPTOR Paper (2024):** https://arxiv.org/abs/2401.18059

**Tools:**
- **ChromaDB:** https://docs.trychroma.com/
- **Ollama:** https://ollama.com/
- **Open WebUI:** https://github.com/open-webui/open-webui
- **Sentence Transformers:** https://www.sbert.net/

---

**Last Updated:** 2025-10-24

**Status:** ✅ Q1 chunking complete → ✅ Test embeddings complete (3 companies) → Next: Full Q1 embedding generation