# SEC 10-K/10-Q Analysis - RAPTOR RAG Project Plan

## Project Overview
AI-powered system for analyzing complete SEC 10-K and 10-Q filings (1993-2024) using RAPTOR RAG (Recursive Adaptive Processing and Topical Organizational Retrieval). The system will create an enhanced knowledge base from financial filings that users can query interactively to identify year-over-year changes, patterns, and potential anomalies.

**Data Coverage:** 1993-2024 (31 years of SEC EDGAR filings, ~51 GB)

---

## Core Architecture

### Infrastructure
- **Development Environment**: Local Windows laptop with Ollama + Docker Desktop
- **Production Deployment**: AWS EC2 instance (r6i.4xlarge: 128 GB RAM, 500 GB EBS storage)
- **Containerization**: Docker Compose for service orchestration
- **Model Hosting**: Ollama (containerized)
- **User Interface**: Open WebUI (containerized)
- **Vector Database**: ChromaDB (file-based, stored on EC2 EBS volume)
- **Data Storage**: EC2 EBS volume for processed embeddings and knowledge base

### Deployment Architecture

**Development (Phase 2):**
```
Local Laptop (32 GB RAM)
├── Docker Compose
│   ├── Ollama Container (gpt-oss 13 GB)
│   ├── RAPTOR API Container (Python app)
│   └── Open WebUI Container
└── Local Volumes
    ├── ./data/external (sample data)
    ├── ./data/embeddings (ChromaDB)
    └── ./models (Ollama models)
```

**Production (Phase 4):**
```
AWS EC2 (r6i.4xlarge - 128 GB RAM)
├── Docker Compose
│   ├── Ollama Container (llama3-sec 49 GB)
│   ├── RAPTOR API Container (Python app)
│   └── Open WebUI Container
└── EBS Volume (500 GB)
    ├── /data/external (51 GB SEC filings)
    ├── /data/processed (chunked filings)
    ├── /data/embeddings (ChromaDB vector DB)
    └── /models (Ollama models)
```

### Architecture Diagram

![System Architecture](diagrams/architecture.png)

### RAPTOR RAG System
Unlike traditional RAG systems that use simple similarity search, RAPTOR implements:
- **Hierarchical Clustering**: Multi-level organization (global + local) using UMAP and Gaussian Mixture Models
- **Recursive Summarization**: 3-level hierarchical summaries capturing both granular details and high-level concepts
- **Enhanced Context Retrieval**: Cluster-aware retrieval providing richer context for LLM queries

---

## Technical Stack

### NLP & ML

**Development Model (Phase 2):**
- **gpt-oss** (13 GB) - General purpose with reasoning capabilities
- Suitable for: Local testing, rapid iteration, pipeline validation
- RAM requirement: ~16-20 GB

**Production Model (Phase 4):**
- **llama3-sec** (Arcee AI, Llama-3-70B based, fine-tuned for SEC filings)
  - Source: `arcee-ai/llama3-sec` on Ollama
  - Trained on 72B tokens of SEC filing data (20B checkpoint available)
  - Deployment: Via Ollama in Docker container
  - Size: 49 GB download, 4-bit quantized
  - RAM requirement: ~35-40 GB (requires EC2 with 128 GB)
  - Specialized for: SEC data analysis, investment analysis, risk assessment

**Fallback Models:**
- qwen2.5:1.5b (986 MB) - Lightweight for quick testing

**RAPTOR Implementation:**
- Adapted from FinGPT's `FinancialReportAnalysis/utils/rag.py`
- Source: https://github.com/AI4Finance-Foundation/FinGPT
- Custom implementation in `src/models/raptor.py`

**Embeddings & Clustering:**
- Sentence Transformers (`all-MiniLM-L6-v2`) for local, cost-free embedding generation
- UMAP (dimensionality reduction) + scikit-learn GMM for clustering
- LLM Interface: Ollama Python client

### Infrastructure & Deployment

**Containerization:**
- Docker + Docker Compose for all services
- Reproducible environments (dev → production)
- Service isolation and orchestration

**AWS EC2 Configuration (Production):**
- **Instance Type**: r6i.4xlarge (memory-optimized)
  - 16 vCPUs
  - 128 GB RAM (sufficient for llama3-sec 70B model)
  - ~$0.80-1.00/hour (~$600-750/month if running 24/7)
- **Storage**: 500 GB EBS gp3 volume
  - 51 GB raw data + processed embeddings + models
- **OS**: Ubuntu 22.04 LTS
- **Security**: VPC with restricted security groups, SSH key access

**Data Processing:**
- Chunking: LangChain `RecursiveCharacterTextSplitter` (2000-4000 tokens/chunk)
- Vector Storage: ChromaDB (file-based, stored on EBS volume)
- Data Format: JSON/Parquet for structured storage

### Libraries
- `langchain`, `langchain_community` - LLM orchestration
- `sentence-transformers` - Local embeddings
- `umap-learn` - Dimensionality reduction
- `scikit-learn` - Clustering algorithms (GMM)
- `pandas`, `numpy` - Data manipulation
- `requests` - SEC EDGAR API access
- `ollama` - Python client for Ollama
- `docker`, `docker-compose` - Containerization

---

## Data Scope

### Current Data Holdings
- **Time Period:** 1993-2024 (31 years)
- **Data Size:** ~51 GB
- **Data Location:** `data/external/`
- **Filing Types:** Complete 10-K (annual reports) and 10-Q (quarterly filings)
- **Processing Scope:** Full filing text (all sections)
- **Analysis Focus:** Year-over-year changes, topic trends, boilerplate vs. substantive disclosure

**Why Process Complete Filings:**
- RAG enables users to ask questions about ANY section (not just risk factors)
- RAPTOR clustering naturally organizes content by topic regardless of section
- Maximizes system flexibility and future-proofs the knowledge base
- Supports queries like: "How did revenue recognition policies change?" or "Compare executive compensation across years"

---

## RAPTOR Pipeline Flowchart

![RAPTOR Pipeline](diagrams/raptor_pipeline.png)

---

## Data Processing Workflow

![Data Processing Workflow](diagrams/data_processing_workflow.png)

---

## Implementation Phases

### Phase 1: Model Research & Setup (Week 1) ✅ COMPLETE
**Objectives:**
- [x] Clarify FinGPT components: FinGPT-v3 (LLM model) vs RAPTOR (Python implementation)
- [x] Set up Ollama and test local LLM deployment
- [x] Evaluate and download appropriate models for SEC filing analysis
- [x] Set up project structure (`src/`, `data/`, `notebooks/`, `dashboard/`)
- [ ] Copy and adapt RAPTOR class from FinGPT's `rag.py` (deferred to Phase 3)
- [ ] Create base `Raptor` class skeleton in `src/models/raptor.py` (deferred to Phase 3)

**Completed Actions:**
- ✅ Installed Ollama v0.12.5 on Windows
- ✅ Downloaded qwen2.5:1.5b (986 MB) - lightweight model for testing
- ✅ Downloaded gpt-oss (13 GB) - reasoning-capable model for development
- ✅ Verified Python ollama package integration
- ✅ Tested model inference via command line and Python

**Model Selection Strategy:**
- **Development (Phase 2):** gpt-oss (13 GB) on local laptop
- **Production (Phase 4):** llama3-sec (49 GB) on AWS EC2
- **Rationale:** Start with smaller model for faster iteration, scale to specialized model in production

**Key Clarification:**
- ✅ **FinGPT-v3** = Fine-tuned LLM model (downloadable from Hugging Face, runnable in Ollama)
- ✅ **RAPTOR** = Python implementation for hierarchical clustering/summarization (we copy the code)
- ✅ **fingpt-rag** = Deprecated project name (not a model), replaced by newer implementations
- ✅ **llama3-sec** = Domain-specific SEC filing model (best fit for production)
- ✅ **gpt-oss** = General purpose reasoning model (best fit for development)

---

### Phase 2: Data Processing Pipeline (Week 2) - IN PROGRESS (Sample Testing)
**Objectives:**
- [ ] Extract filings from sample archives
- [ ] Parse complete 10-K/10-Q text from HTML/XML/SGML formats
- [ ] Implement document chunking (test 2000-4000 token ranges)
- [ ] Generate embeddings using local Sentence Transformers
- [ ] Store structured data (chunks + metadata) in JSON/Parquet
- [ ] Set up Docker Compose for local development environment

**Key Files:**
- `src/data/filing_extractor.py` - Unzip archives, extract full filing text
- `src/data/text_processor.py` - Clean text, chunk into configurable token segments
- `src/models/embedding_generator.py` - Embedding creation
- `docker-compose.dev.yml` - Development environment setup

**Development Setup (Docker Compose):**
```yaml
services:
  ollama:
    image: ollama/ollama
    volumes:
      - ./models:/root/.ollama
    ports:
      - "11434:11434"
  
  raptor-api:
    build: ./src
    volumes:
      - ./data:/app/data
      - ./src:/app/src
    environment:
      - OLLAMA_HOST=ollama:11434
      - MODEL_NAME=gpt-oss
    depends_on:
      - ollama
  
  webui:
    image: ghcr.io/open-webui/open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
```

**Current Status:**
- Working with sample filings to validate processing approach
- Testing chunk sizes: 200, 500, 1000, 2000, 3000, 4000 tokens
- Using gpt-oss (13 GB) for development
- Full pipeline implementation with complete dataset deferred to Phase 4

**Validation:**
- Test on 3-5 sample filings from different time periods (1995, 2010, 2024)
- Verify text extraction accuracy across HTML/XML/SGML formats
- Confirm chunking preserves semantic coherence
- Compare chunk quality across different token sizes

---

### Phase 3: RAPTOR System Implementation (Week 3)
**Objectives:**
- [ ] Copy RAPTOR class from FinGPT to `src/models/raptor.py`
- [ ] Implement hierarchical clustering (adapted from FinGPT's RAPTOR):
  - Global clustering (UMAP → GMM with BIC for optimal cluster count)
  - Local clustering (secondary refinement within global clusters)
- [ ] Build recursive summarization engine (3 levels deep)
- [ ] Create enhanced knowledge base combining:
  - Original document chunks
  - Level 1 summaries (cluster summaries)
  - Level 2 summaries (summary of summaries)
  - Level 3 summaries (highest abstraction)
- [ ] Implement cluster-aware retrieval mechanism
- [ ] Test on sample data with gpt-oss

**Key Methods in `Raptor` class (adapted from FinGPT):**
```python
def global_cluster_embeddings(embeddings, dim, n_neighbors, metric="cosine")
def local_cluster_embeddings(embeddings, dim, num_neighbors=10)
def get_optimal_clusters(embeddings, max_clusters=50)
def GMM_cluster(embeddings, threshold, random_state=0)
def perform_clustering(embeddings, dim, threshold)
def recursive_embed_cluster_summarize(texts, level=1, n_levels=3)
```

**Source Reference:**
- Original implementation: https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py

**Testing:**
- Validate clustering quality on sample documents
- Review generated summaries for coherence
- Ensure topics are properly grouped (e.g., all revenue-related content clusters together)
- Benchmark with gpt-oss before scaling to production

---

### Phase 4: Production Deployment on AWS EC2 (Week 4-5)
**Objectives:**
- [ ] Provision AWS EC2 instance (r6i.4xlarge, 128 GB RAM, 500 GB storage)
- [ ] Set up Docker + Docker Compose on EC2
- [ ] Deploy Ollama container with llama3-sec model (49 GB)
- [ ] Deploy RAPTOR API container with full pipeline
- [ ] Deploy Open WebUI container for user interaction
- [ ] Process complete 51 GB dataset into knowledge base
- [ ] Implement query handling for diverse topics across filings
- [ ] Create sample query templates for common use cases
- [ ] Set up monitoring and logging

**EC2 Setup Steps:**
1. Launch r6i.4xlarge instance (Ubuntu 22.04)
2. Attach 500 GB EBS gp3 volume
3. Install Docker + Docker Compose
4. Clone repository to EC2
5. Pull llama3-sec model via Ollama
6. Start services with `docker-compose -f docker-compose.prod.yml up -d`
7. Process full 51 GB dataset
8. Configure security groups (restrict access)

**Production Docker Compose:**
```yaml
services:
  ollama:
    image: ollama/ollama
    volumes:
      - /data/models:/root/.ollama
    deploy:
      resources:
        limits:
          memory: 60G
  
  raptor-api:
    build: ./src
    volumes:
      - /data:/app/data
    environment:
      - OLLAMA_HOST=ollama:11434
      - MODEL_NAME=llama3-sec
      - CHUNK_SIZE=2000
    depends_on:
      - ollama
  
  webui:
    image: ghcr.io/open-webui/open-webui
    ports:
      - "443:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
```

**Integration Workflow:**
1. User submits query via Open WebUI (web interface)
2. RAPTOR retrieves relevant chunks + hierarchical summaries from ChromaDB
3. Context passed to Ollama LLM (llama3-sec) via API
4. LLM generates response with supporting evidence
5. Results displayed in WebUI with citations

**Example Queries:**
- "What cyber risks did Apple disclose in 2023?"
- "How have revenue recognition policies evolved from 2010 to 2024?"
- "Compare executive compensation disclosures between tech companies"
- "Show boilerplate vs. substantive language in risk disclosures"

**Cost Estimation:**
- EC2 r6i.4xlarge: ~$0.80-1.00/hour
- EBS gp3 500GB: ~$40/month
- Data transfer: Minimal (queries only)
- **Total**: ~$600-800/month for 24/7 operation
- **Optimization**: Stop instance when not in use, use spot instances for batch processing

**Deliverables:**
- Functional production system on EC2
- Dockerized, reproducible deployment
- Full 51 GB dataset processed into knowledge base
- Open WebUI interface accessible via HTTPS
- Documentation for deployment and maintenance

---

## Docker Strategy

### Development vs. Production

**Development (Local Laptop):**
- Purpose: Build and test pipeline with sample data
- Model: gpt-oss (13 GB) - fits in 32 GB RAM
- Data: Sample filings (~1-5 GB)
- Services: Ollama + RAPTOR API + Open WebUI
- Command: `docker-compose -f docker-compose.dev.yml up`

**Production (AWS EC2):**
- Purpose: Full-scale deployment with complete dataset
- Model: llama3-sec (49 GB) - requires 128 GB RAM
- Data: Complete 51 GB SEC filings (1993-2024)
- Services: Same as dev, with production configs
- Command: `docker-compose -f docker-compose.prod.yml up -d`

### Benefits of Docker Approach:

1. **Reproducibility**: Same environment dev → production
2. **Isolation**: Services don't conflict (different Python versions, dependencies)
3. **Portability**: Works on laptop, EC2, teammate's machine
4. **Scalability**: Easy to add services (monitoring, caching, etc.)
5. **Version Control**: Infrastructure as code (`Dockerfile`, `docker-compose.yml`)
6. **Easy Deployment**: `git pull && docker-compose up` deploys updates

### Repository Structure with Docker
```
edgar_anomaly_detection/
├── data/
│   ├── external/         # Downloaded filings (gitignored)
│   ├── processed/        # Chunked filings (gitignored)
│   └── embeddings/       # ChromaDB files (gitignored)
├── src/
│   ├── Dockerfile        # RAPTOR API container definition
│   ├── requirements.txt  # Python dependencies
│   ├── data/
│   │   ├── filing_extractor.py
│   │   └── text_processor.py
│   ├── models/
│   │   ├── raptor.py
│   │   ├── embedding_generator.py
│   │   └── clustering.py
│   └── api/
│       └── main.py       # FastAPI server for RAPTOR
├── notebooks/
│   └── 01_project_plan.ipynb
├── docker-compose.dev.yml   # Development setup
├── docker-compose.prod.yml  # Production setup (EC2)
├── .dockerignore
├── .gitignore
├── requirements.txt
└── README.md
```

---

## RAPTOR vs. Traditional RAG Comparison

| Feature | Traditional RAG | RAPTOR RAG |
|---------|----------------|------------|
| Text Processing | Simple chunking | Recursive, hierarchical |
| Clustering | None or basic | Multi-level (global + local) |
| Summarization | None or single-level | Recursive, 3-level |
| Context Selection | Similarity-based only | Cluster-aware + similarity |
| Document Understanding | Flat representation | Hierarchical representation |
| Knowledge Integration | Direct chunks only | Chunks + multi-level summaries |

**Why RAPTOR for Financial Filings?**
- Financial documents have hierarchical structure (sections, subsections, themes)
- YoY analysis requires understanding both granular changes and high-level shifts
- Boilerplate detection benefits from cluster analysis (repetitive language clusters together)
- Complex queries need multi-level context (e.g., "How did regulatory disclosures evolve?")
- Historical coverage (1993-2024) enables long-term trend analysis

---

## Success Metrics
- [ ] Successfully process 90%+ of downloaded filings (1993-2024) into knowledge base
- [ ] Clustering produces coherent, interpretable topic groups
- [ ] Generated summaries accurately capture content at each hierarchical level
- [ ] LLM queries return relevant, accurate responses with supporting evidence
- [ ] System responds to queries in <10 seconds (including retrieval + generation)
- [ ] Manual validation: Test 10 diverse queries across different topics and decades, verify accuracy
- [ ] Docker deployment: Services start successfully on both dev and production
- [ ] EC2 deployment: System runs stably for 7+ days without intervention

---

## Key Advantages of AI-First Approach
1. **No Manual Feature Engineering**: LLM infers patterns from enhanced context (vs. building YoY diff algorithms)
2. **Flexible Queries**: Users can ask arbitrary questions about ANY topic or section
3. **Semantic Understanding**: Detects substantive changes even when wording differs
4. **Scalable**: Adding new filings just requires re-running RAPTOR pipeline
5. **Explainable**: LLM can cite specific sections supporting its conclusions
6. **Historical Depth**: 31 years of data enables long-term trend analysis
7. **Reproducible**: Docker ensures consistent environment across deployments
8. **Cost-Effective**: EC2 instance can be stopped when not in use

---

## Technical Challenges & Mitigations

### Challenge 1: Model Size vs. Available RAM
- **Issue**: llama3-sec (49 GB) requires 35-40 GB RAM, local laptop has 32 GB
- **Solution**: 
  - Development: Use gpt-oss (13 GB) on laptop
  - Production: Deploy llama3-sec on AWS EC2 r6i.4xlarge (128 GB RAM)
  - Benefits: Faster iteration locally, best quality in production

### Challenge 2: Embedding Generation at Scale
- **Issue**: Processing 31 years of complete filings (~51 GB) requires significant compute power
- **Solution**: Use EC2 instance, batch processing, cache embeddings, process in chronological chunks

### Challenge 3: Infrastructure Complexity
- **Issue**: Managing multiple services (Ollama, RAPTOR API, WebUI, ChromaDB)
- **Solution**: 
  - Docker Compose orchestrates all services
  - Single command deployment: `docker-compose up`
  - Services isolated and independently scalable

### Challenge 4: Clustering Quality
- **Issue**: Poorly defined clusters reduce summary quality
- **Solution**: Use BIC for optimal cluster count, validate clusters manually on samples

### Challenge 5: Context Window Limits
- **Issue**: LLMs have token limits, can't ingest entire knowledge base
- **Solution**: RAPTOR's hierarchical retrieval provides most relevant chunks + summaries

### Challenge 6: Data Format Evolution
- **Issue**: SEC filing formats changed significantly between 1993 and 2024 (SGML → HTML → XML)
- **Solution**: Build robust parsing logic that handles multiple formats, test across time periods

### Challenge 7: Processing Time
- **Issue**: Unzipping and processing 51 GB of data could take significant time
- **Solution**: Parallel processing where possible, start with subset (one year) to validate pipeline

### Challenge 8: AWS Costs
- **Issue**: Running r6i.4xlarge 24/7 costs ~$600-800/month
- **Solution**: 
  - Stop instance when not in use
  - Use spot instances for batch processing (60-90% discount)
  - Process data once, serve queries on-demand

---

## Next Steps
1. ✅ Download and test gpt-oss model locally
2. [ ] Set up Docker Desktop on local laptop
3. [ ] Create `docker-compose.dev.yml` for local development
4. [ ] Test Ollama container with gpt-oss
5. [ ] Copy RAPTOR class from FinGPT GitHub to `src/models/raptor.py`
6. [ ] Complete Phase 2 with sample data and gpt-oss
7. [ ] Provision AWS EC2 r6i.4xlarge instance
8. [ ] Deploy llama3-sec on EC2 for production

---

## References
- **FinGPT GitHub:** https://github.com/AI4Finance-Foundation/FinGPT
- **FinGPT RAPTOR Implementation:** https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py
- **FinGPT Models on Hugging Face:** https://huggingface.co/AI4Finance-Foundation
- **RAPTOR RAG Documentation:** https://deepwiki.com/AI4Finance-Foundation/FinGPT/5.1-raptor-rag-system
- **SEC EDGAR API:** https://www.sec.gov/edgar/sec-api-documentation
- **Ollama:** https://ollama.ai/
- **Ollama Docker:** https://hub.docker.com/r/ollama/ollama
- **Ollama Models Library:** https://ollama.com/library
- **Arcee AI llama3-sec:** https://ollama.com/arcee-ai/llama3-sec
- **Open WebUI:** https://github.com/open-webui/open-webui
- **Docker Documentation:** https://docs.docker.com/
- **AWS EC2 Instance Types:** https://aws.amazon.com/ec2/instance-types/