# Production Plan: RAPTOR RAG with SEC EDGAR Filings (1993-2024)

## Overview
This notebook documents the full production RAPTOR RAG system deployment:
- **Data**: All SEC 10-K/10-Q filings (1993-2024) - ~51GB total, expanding from initial 2024 prototype
- **Infrastructure**: AWS EC2 (secai instance) - All processing, embedding, and deployment
- **Models**: Ollama with FinGPT-v3 and other financial LLMs
- **Goal**: Production-ready RAG system with hierarchical retrieval via Open WebUI

## Evolution: Prototype → Production

**Initial Prototype Plan (Archived):**
- 2024 only (26,014 filings)
- Local PC processing → AWS EC2 embedding → handoff
- Test RAPTOR before scaling

**Current Production Reality:**
- **2021-2024 uploaded** (6.6GB compressed, ~100K+ filings)
- **Target: 1993-2024** (full 31-year dataset, ~300K+ filings)
- **All operations on AWS EC2** (processing + embedding + RAPTOR + deployment)
- **Open WebUI already deployed** (port 3000, Docker container)
- **Team collaboration**: Processing/embedding (Kabe) + Ollama/RAPTOR (Betsy)

---

## Current Status (2025-10-17)

### ✅ Infrastructure Deployed
- AWS EC2 instance "secai" (Ubuntu 24.04)
- SSH access configured (kabe@35.175.134.36)
- Data directory: `/app/data/edgar/`
- Open WebUI: Running on port 3000 (Docker)
- Security groups: Configured for team access

### ✅ Data Uploaded to EC2
**Location:** `/app/data/edgar/`
- ✅ 10-X_C_2021.zip (1.6GB)
- ✅ 10-X_C_2022.zip (1.8GB)
- ✅ 10-X_C_2023.zip (1.7GB)
- ✅ 10-X_C_2024.zip (1.6GB)
- ✅ 2024/ (unzipped, ready for processing)
- ⏳ 1993-2020 data (to be uploaded)

**Total uploaded:** 6.6GB compressed (~25-30GB uncompressed when all extracted)

### ✅ Local Prototype Completed (Reference Only)
**Note:** This was done locally for testing and is archived. Production uses AWS EC2.

**Completed on local PC:**
- ✅ Text processing: 26,014 filings → 2.7M chunks (Step 2 complete)
- ✅ Output: `processed_2024_500tok_contextual.json` (15GB)
- ⏳ Embedding generation: Aborted locally (too resource-intensive)

**Key learnings from local prototype:**
- 500-token chunks optimal for SEC filings
- Contextual chunking (Anthropic method) works well
- 42 minutes to process 26K filings
- 15GB output manageable for transfer

---

## Production Pipeline (AWS EC2)

### Phase 1: Data Processing (Kabe) 🔄
**Status:** Starting

**Tasks:**
1. Process unzipped 2024 data on EC2
2. Run `run_02_processing.py` adapted for EC2 paths
3. Generate chunks with contextual embedding
4. Repeat for 2021-2023 data
5. Scale to 1993-2020 once uploaded

**Expected output per year:**
- ~26K filings/year × 4 years = ~104K filings
- ~2.7M chunks/year × 4 years = ~11M chunks
- ~15GB JSON/year × 4 years = ~60GB processed data

**Location:** `/app/data/processed/`

### Phase 2: Embedding Generation (Kabe) ⏸️
**Status:** Awaiting Phase 1 completion

**Tasks:**
1. Load processed chunks on EC2
2. Run `run_03_embeddings.py` on EC2
3. Generate embeddings using sentence-transformers
4. Store embeddings + metadata

**Expected output:**
- ~11M embeddings (384 dims each)
- ~17GB embeddings file (.npy)
- Metadata JSON (~2GB)

**Location:** `/app/data/embeddings/`

### Phase 3: RAPTOR Implementation (Betsy + Kabe) ⏸️
**Status:** Awaiting Phase 2

**Tasks:**
1. Hierarchical clustering (UMAP + GMM)
2. Recursive summarization (3 levels via Ollama)
3. ChromaDB setup and ingestion
4. Cluster validation

**Collaborative work:**
- Kabe: Clustering implementation
- Betsy: Ollama integration + summarization

### Phase 4: Deployment (Betsy) ⏸️
**Status:** Infrastructure ready, awaiting data pipeline

**Completed:**
- ✅ Open WebUI Docker container
- ✅ Port 3000 accessible
- ⏳ Ollama connection (troubleshooting)

**Remaining:**
- Connect Ollama to Open WebUI
- Load ChromaDB with embeddings + summaries
- Configure RAG query pipeline
- Test end-to-end queries

---

## Division of Labor

### Kabe's Responsibilities
1. ✅ Upload remaining EDGAR data (1993-2020)
2. 🔄 Process all years (1993-2024) into chunks
3. 🔄 Generate embeddings for all chunks
4. 🔄 Implement clustering (UMAP + GMM)
5. 🔄 Validation and testing

### Betsy's Responsibilities
1. ✅ Set up AWS EC2 infrastructure
2. ✅ Deploy Open WebUI Docker container
3. 🔄 Install and configure Ollama
4. 🔄 Implement recursive summarization
5. 🔄 ChromaDB setup and ingestion
6. 🔄 RAG pipeline integration

### Collaborative Work
- RAPTOR clustering + summarization
- Query interface testing
- Evaluation with RAGAS
- Documentation

---

## Technical Specifications

### Chunking Strategy (Proven from Local Prototype)
- **Core chunk:** 500 tokens (stored)
- **Context window:** 100 tokens (50 before + 50 after)
- **Extended chunk:** ~700 tokens (embedded)
- **Method:** Anthropic Contextual Retrieval
- **Overhead:** 19.9% (efficient)

### Embedding Model
- **Model:** sentence-transformers/all-MiniLM-L6-v2
- **Dimensions:** 384
- **Speed:** ~1000 chunks/second (CPU), faster on GPU
- **Normalized:** Yes (L2 norm = 1.0)

### Infrastructure
- **Instance:** AWS EC2 "secai" (Ubuntu 24.04)
- **Location:** `/app/` (project root)
- **Docker:** Open WebUI + Ollama containers
- **Access:** SSH via key-based auth, port 3000 for WebUI

### Data Scale (Full Production)
**When complete (1993-2024):**
- **Filings:** ~300,000+ (31 years)
- **Chunks:** ~30-40 million
- **Embeddings:** ~45-55GB
- **Processed data:** ~180-200GB
- **ChromaDB:** ~250GB total (with summaries)

---

## Archived Local Prototype Results

### Step 2: Text Processing (Local - Reference Only)
**Status:** ✅ Completed locally on 2025-10-16

**Results (2024 data only):**
- Filings processed: 26,014 / 26,014 (100%)
- Total chunks: 2,725,171
- Processing time: 42.1 minutes
- Output: `processed_2024_500tok_contextual.json` (15GB)

**Token statistics:**
- Total tokens: 1.36 billion (core)
- Extended tokens: 1.63 billion (for embedding)
- Context overhead: 19.9%
- Avg tokens/filing: 52,128

**Note:** This local prototype validated the approach. Production processing happens on AWS EC2 for all years.

---

## Success Criteria

### Phase 1-2 (Processing + Embedding)
- [ ] All 1993-2024 data processed successfully
- [ ] Embeddings generated for all chunks
- [ ] No data loss or corruption
- [ ] Reasonable processing time (<1 week total)

### Phase 3 (RAPTOR)
- [ ] Clustering produces coherent topics
- [ ] 3-level summarization accurate
- [ ] Hierarchical structure adds value
- [ ] Manual validation passes

### Phase 4 (Deployment)
- [ ] Open WebUI connects to Ollama
- [ ] ChromaDB retrieval works correctly
- [ ] Query latency < 10 seconds
- [ ] System stable under load

### Overall System
- [ ] Answers factually accurate (90%+)
- [ ] RAPTOR outperforms simple RAG
- [ ] Citations reference correct filings
- [ ] RAGAS evaluation scores high

---

## Next Immediate Steps

1. **Kabe:**
   - Transfer `run_02_processing.py` to EC2
   - Modify paths for `/app/data/edgar/2024/`
   - Run processing on 2024 data (already unzipped)
   - Monitor and validate output

2. **Betsy:**
   - Fix Ollama connection to Open WebUI
   - Verify Docker containers healthy
   - Prepare for RAPTOR implementation

3. **Both:**
   - Upload remaining EDGAR data (1993-2020)
   - Plan RAPTOR collaboration workflow

---

## Resources

### Completed Files (Local Prototype):
- ✅ `02_text_processing.ipynb` - Local processing (archived)
- ✅ `run_02_processing.py` - Script (to be adapted for EC2)
- ✅ `03_embedding_generation.ipynb` - Embedding notebook
- ✅ `run_03_embeddings.py` - Script (to be run on EC2)

### To Be Created:
- `04_raptor_clustering.ipynb`
- `05_raptor_summarization.ipynb`
- `06_chromadb_setup.ipynb`
- `07_rag_query_interface.ipynb`

### References:
- [Anthropic Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval)
- [RAPTOR Paper](https://arxiv.org/abs/2401.18059)
- [Sentence-BERT Paper](https://arxiv.org/abs/1908.10084)
- [MTEB Benchmark](https://arxiv.org/abs/2210.07316)
- [ChromaDB Docs](https://docs.trychroma.com/)
- [Ollama](https://ollama.com/)
- [Open WebUI](https://github.com/open-webui/open-webui)

---

**Status:** 🔄 Production Deployment In Progress

**Last Updated:** 2025-10-17

**Key Change:** Transitioned from prototype to full production deployment on AWS EC2 with complete 1993-2024 dataset.