# Production Plan: RAPTOR RAG with SEC EDGAR Filings (1993-2024)

## Overview
This notebook documents the full production RAPTOR RAG system deployment:
- **Data**: All SEC 10-K/10-Q filings (1993-2024) - ~51GB total, expanding from initial 2024 prototype
- **Infrastructure**: AWS EC2 g6.2xlarge (GPU-accelerated) - All processing, embedding, and deployment
- **Models**: Ollama with FinGPT-v3 and other financial LLMs
- **Goal**: Production-ready RAG system with hierarchical retrieval via Open WebUI

## Evolution: Prototype → Production

**Initial Prototype Plan (Archived):**
- 2024 only (26,014 filings)
- Local PC processing → AWS EC2 embedding → handoff
- Test RAPTOR before scaling

**Current Production Reality:**
- **2021-2024 uploaded** (6.6GB compressed, ~100K+ filings)
- **Target: 1993-2024** (full 31-year dataset, ~300K+ filings)
- **All operations on AWS EC2** (processing + embedding + RAPTOR + deployment)
- **Open WebUI deployed** (currently port 3000 during troubleshooting, target port 8080)

---

## Current Status (2025-10-17)

### ✅ Infrastructure Deployed
**AWS EC2 Instance "secai":**
- **Instance Type**: g6.2xlarge (GPU-accelerated)
- **vCPUs**: 8
- **RAM**: 32 GB
- **GPU**: 1x NVIDIA L4 (24 GB GPU memory)
- **OS**: Ubuntu 24.04 LTS
- **Public IP**: 35.175.134.36
- **SSH Access**: Configured (key-based auth)
- **Data Directory**: `/app/data/edgar/`

**Deployed Services:**
- Open WebUI: Running in Docker (currently port 3000, troubleshooting Ollama connectivity, will move to port 8080)
- Security groups: Configured for authorized access (port 8080 already allowed)
- Admin account: dev@onyxgs.com (local auth, new users require admin approval)

### ✅ Data Uploaded to EC2
**Location:** `/app/data/edgar/`
- ✅ 10-X_C_2021.zip (1.6GB)
- ✅ 10-X_C_2022.zip (1.8GB)
- ✅ 10-X_C_2023.zip (1.7GB)
- ✅ 10-X_C_2024.zip (1.6GB)
- ✅ 2024/ (unzipped, ready for processing)
- 🔄 1993-2020 data (uploading in progress)

**Total uploaded:** 6.6GB compressed (~25-30GB uncompressed when all extracted)

### ✅ Local Prototype Completed (Reference Only)
**Note:** This was done locally for testing and is archived. Production uses AWS EC2.

**Completed on local PC:**
- ✅ Text processing: 26,014 filings → 2.7M chunks (Step 2 complete)
- ✅ Output: `processed_2024_500tok_contextual.json` (15GB)
- ⏳ Embedding generation: Aborted locally (too resource-intensive)

**Key learnings from local prototype:**
- 500-token chunks optimal for SEC filings
- Contextual chunking (Anthropic method) works well
- 42 minutes to process 26K filings
- 15GB output manageable for transfer

---

## Production Pipeline (AWS EC2)

### Phase 1: Data Processing 🔄
**Status:** Starting

**Tasks:**
1. Process unzipped 2024 data on EC2
2. Run `run_02_processing.py` adapted for EC2 paths
3. Generate chunks with contextual embedding
4. Repeat for 2021-2023 data
5. Scale to 1993-2020 once uploaded

**Expected output per year:**
- ~26K filings/year × 4 years = ~104K filings
- ~2.7M chunks/year × 4 years = ~11M chunks
- ~15GB JSON/year × 4 years = ~60GB processed data

**Location:** `/app/data/processed/`

### Phase 2: Embedding Generation ⏸️
**Status:** Awaiting Phase 1 completion

**Tasks:**
1. Load processed chunks on EC2
2. Run `run_03_embeddings.py` on EC2
3. Generate embeddings using sentence-transformers (GPU-accelerated on NVIDIA L4)
4. Store embeddings + metadata

**Expected output:**
- ~11M embeddings (384 dims each)
- ~17GB embeddings file (.npy)
- Metadata JSON (~2GB)

**Performance estimate with GPU:**
- NVIDIA L4 24GB GPU should process ~5,000-10,000 chunks/second
- Total time for 11M chunks: ~20-40 minutes (vs. 3+ hours on CPU)

**Location:** `/app/data/embeddings/`

### Phase 3: RAPTOR Implementation ⏸️
**Status:** Awaiting Phase 2

**Tasks:**
1. Hierarchical clustering (UMAP + GMM)
2. Recursive summarization (3 levels via Ollama)
3. ChromaDB setup and ingestion
4. Cluster validation

### Phase 4: Deployment ⏸️
**Status:** Infrastructure ready, awaiting data pipeline

**Completed:**
- ✅ Open WebUI Docker container deployed
- ✅ Port 8080 security rules configured
- ✅ Admin account configured (dev@onyxgs.com, password: 644e;1C6ig,o)
- 🔄 Currently running on port 3000 while troubleshooting Ollama connection

**Remaining:**
- Fix Ollama connection to Open WebUI (adjusting Docker run command)
- Move to port 8080 once connection working
- Load ChromaDB with embeddings + summaries
- Configure RAG query pipeline
- Test end-to-end queries

---

## Implementation Workflow

### Infrastructure & Deployment
1. ✅ AWS EC2 g6.2xlarge instance provisioned
2. ✅ Open WebUI Docker container deployed
3. ✅ Admin account configured
4. 🔄 Ollama connection troubleshooting (currently port 3000)
5. ⏸️ Finalize on port 8080 once connection stable
6. ⏸️ ChromaDB setup and ingestion
7. ⏸️ RAG pipeline integration

### Data Pipeline
1. ✅ Upload 2021-2024 data (6.6GB)
2. 🔄 Upload remaining 1993-2020 data (in progress)
3. ⏸️ Process all years into chunks
4. ⏸️ Generate embeddings for all chunks (GPU-accelerated)
5. ⏸️ Implement RAPTOR clustering (UMAP + GMM)
6. ⏸️ Recursive summarization via Ollama

### Testing & Validation
- ⏸️ Query interface testing
- ⏸️ Evaluation with RAGAS
- ⏸️ Performance benchmarking
- ⏸️ Documentation

---

## Technical Specifications

### AWS EC2 Instance
- **Type**: g6.2xlarge (GPU-accelerated compute)
- **vCPUs**: 8 cores
- **RAM**: 32 GB DDR5
- **GPU**: 1x NVIDIA L4 Tensor Core GPU
  - GPU Memory: 24 GB GDDR6
  - CUDA Cores: 7,424
  - Tensor Cores: 232 (4th generation)
  - FP32 Performance: 30.3 TFLOPS
  - Optimized for: AI inference, embedding generation, LLM serving
- **OS**: Ubuntu 24.04 LTS
- **Location**: `/app/` (project root)
- **Network**: Enhanced networking enabled

### Chunking Strategy (Proven from Local Prototype)
- **Core chunk:** 500 tokens (stored)
- **Context window:** 100 tokens (50 before + 50 after)
- **Extended chunk:** ~700 tokens (embedded)
- **Method:** Anthropic Contextual Retrieval
- **Overhead:** 19.9% (efficient)

### Embedding Model
- **Model:** sentence-transformers/all-MiniLM-L6-v2
- **Dimensions:** 384
- **Speed (CPU):** ~1000 chunks/second
- **Speed (GPU - NVIDIA L4):** ~5,000-10,000 chunks/second (estimated)
- **Normalized:** Yes (L2 norm = 1.0)

### Infrastructure
- **Docker:** Open WebUI + Ollama containers
- **Access:** SSH via key-based auth
- **Web UI:** Port 8080 (target), currently 3000 during setup, admin: dev@onyxgs.com

### Data Scale (Full Production)
**When complete (1993-2024):**
- **Filings:** ~300,000+ (31 years)
- **Chunks:** ~30-40 million
- **Embeddings:** ~45-55GB
- **Processed data:** ~180-200GB
- **ChromaDB:** ~250GB total (with summaries)

---

## Archived Local Prototype Results

### Step 2: Text Processing (Local - Reference Only)
**Status:** ✅ Completed locally on 2025-10-16

**Results (2024 data only):**
- Filings processed: 26,014 / 26,014 (100%)
- Total chunks: 2,725,171
- Processing time: 42.1 minutes
- Output: `processed_2024_500tok_contextual.json` (15GB)

**Token statistics:**
- Total tokens: 1.36 billion (core)
- Extended tokens: 1.63 billion (for embedding)
- Context overhead: 19.9%
- Avg tokens/filing: 52,128

**Note:** This local prototype validated the approach. Production processing happens on AWS EC2 for all years.

---

## Success Criteria

### Phase 1-2 (Processing + Embedding)
- [ ] All 1993-2024 data processed successfully
- [ ] Embeddings generated for all chunks
- [ ] No data loss or corruption
- [ ] Reasonable processing time (<1 week total)
- [ ] GPU utilization optimized (>80% during embedding generation)

### Phase 3 (RAPTOR)
- [ ] Clustering produces coherent topics
- [ ] 3-level summarization accurate
- [ ] Hierarchical structure adds value
- [ ] Manual validation passes

### Phase 4 (Deployment)
- [ ] Open WebUI connects to Ollama
- [ ] Service running on port 8080
- [ ] ChromaDB retrieval works correctly
- [ ] Query latency < 10 seconds
- [ ] System stable under load

### Overall System
- [ ] Answers factually accurate (90%+)
- [ ] RAPTOR outperforms simple RAG
- [ ] Citations reference correct filings
- [ ] RAGAS evaluation scores high

---

## Next Immediate Steps

1. **Infrastructure:**
   - Fix Ollama connection to Open WebUI (adjust Docker run command)
   - Verify Docker containers healthy
   - Move to port 8080 once stable
   - Verify GPU accessibility for embedding generation

2. **Data Processing:**
   - Complete upload of 1993-2020 data
   - Transfer `run_02_processing.py` to EC2
   - Modify paths for `/app/data/edgar/2024/`
   - Run processing on 2024 data (already unzipped)
   - Monitor and validate output

3. **GPU Optimization:**
   - Install CUDA drivers and toolkit
   - Configure PyTorch/TensorFlow for GPU
   - Test GPU-accelerated embedding generation
   - Benchmark performance vs. CPU

---

## Resources

### Completed Files (Local Prototype):
- ✅ `02_text_processing.ipynb` - Local processing (archived)
- ✅ `run_02_processing.py` - Script (to be adapted for EC2)
- ✅ `03_embedding_generation.ipynb` - Embedding notebook
- ✅ `run_03_embeddings.py` - Script (to be run on EC2 with GPU)

### To Be Created:
- `04_raptor_clustering.ipynb`
- `05_raptor_summarization.ipynb`
- `06_chromadb_setup.ipynb`
- `07_rag_query_interface.ipynb`

### References:
- [Anthropic Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval)
- [RAPTOR Paper](https://arxiv.org/abs/2401.18059)
- [Sentence-BERT Paper](https://arxiv.org/abs/1908.10084)
- [MTEB Benchmark](https://arxiv.org/abs/2210.07316)
- [ChromaDB Docs](https://docs.trychroma.com/)
- [Ollama](https://ollama.com/)
- [Open WebUI](https://github.com/open-webui/open-webui)
- [AWS g6 Instances](https://aws.amazon.com/ec2/instance-types/g6/)
- [NVIDIA L4 GPU](https://www.nvidia.com/en-us/data-center/l4/)

---

**Status:** 🔄 Production Deployment In Progress

**Last Updated:** 2025-10-17

**Key Change:** Transitioned from prototype to full production deployment on AWS EC2 g6.2xlarge with GPU acceleration for embedding generation.