# SEC 10-K/10-Q Analysis - RAPTOR RAG Project Plan

## Project Overview
AI-powered system for analyzing complete SEC 10-K and 10-Q filings (1993-2024) using RAPTOR RAG (Recursive Adaptive Processing and Topical Organizational Retrieval). The system will create an enhanced knowledge base from financial filings that users can query interactively to identify year-over-year changes, patterns, and potential anomalies.

**Data Coverage:** 1993-2024 (31 years of SEC EDGAR filings, ~51 GB)

---

## Core Architecture

### Infrastructure
- **Development Environment**: Local Windows laptop with Ollama + Docker Desktop
- **Production Deployment**: AWS EC2 Instance: secAI (8 vCPUs, 64 GB RAM, Ubuntu 24.04)
- **Containerization**: Docker Compose for service orchestration
- **Model Hosting**: Ollama (installed on EC2, not containerized)
- **User Interface**: Open WebUI (containerized)
- **Vector Database**: ChromaDB (containerized, port 8000)
- **Data Storage**: EC2 EBS volume for processed embeddings and knowledge base

### Deployment Architecture

**Development (Phase 2):**
```
Local Laptop (32 GB RAM)
├── Docker Compose
│   ├── Ollama Container (gpt-oss 13 GB)
│   ├── RAPTOR API Container (Python app)
│   └── Open WebUI Container
└── Local Volumes
    ├── ./data/external (sample data)
    ├── ./data/embeddings (ChromaDB)
    └── ./models (Ollama models)
```

**Production (Phase 4):**
```
AWS EC2 Instance: secAI (8 vCPUs, 64 GB RAM, Ubuntu 24.04)
├── Ollama (installed directly, not containerized)
├── Docker Containers
│   ├── ChromaDB Container (port 8000)
│   ├── Open WebUI Container
│   └── Data Processing Containers (chunking, embedding)
└── /app/data/ (EBS Volume)
    ├── edgar/extracted/ (51 GB SEC filings - 1993-2024, organized by year/quarter)
    ├── processed/ (chunked filings with metadata)
    ├── embeddings/ (ChromaDB vector DB)
    └── models/ (Ollama models)
```

### Architecture Diagram

![System Architecture](diagrams/architecture.png)

### RAPTOR RAG System
Unlike traditional RAG systems that use simple similarity search, RAPTOR implements:
- **Hierarchical Clustering**: Multi-level organization (global + local) using UMAP and Gaussian Mixture Models
- **Recursive Summarization**: 3-level hierarchical summaries capturing both granular details and high-level concepts
- **Enhanced Context Retrieval**: Cluster-aware retrieval providing richer context for LLM queries

---

## Technical Stack

### NLP & ML

**Development Model (Phase 2):**
- **gpt-oss** (13 GB) - General purpose with reasoning capabilities
- Suitable for: Local testing, rapid iteration, pipeline validation
- RAM requirement: ~16-20 GB

**Production Model (Phase 4):**
- **llama3-sec** (Arcee AI, Llama-3-70B based, fine-tuned for SEC filings)
  - Source: `arcee-ai/llama3-sec` on Ollama
  - Trained on 72B tokens of SEC filing data (20B checkpoint available)
  - Deployment: Via Ollama in Docker container
  - Specialized for: SEC data analysis, investment analysis, risk assessment

**Fallback Models:**
- qwen2.5:1.5b (986 MB) - Lightweight for quick testing

**RAPTOR Implementation:**
- Adapted from FinGPT's `FinancialReportAnalysis/utils/rag.py`
- Source: https://github.com/AI4Finance-Foundation/FinGPT
- Custom implementation in `src/models/raptor.py`

**Embeddings & Clustering:**
- Sentence Transformers (`all-MiniLM-L6-v2`) for local, cost-free embedding generation (GPU-accelerated)
- UMAP (dimensionality reduction) + scikit-learn GMM for clustering
- LLM Interface: Ollama Python client

### Infrastructure & Deployment

**Containerization:**
- Docker + Docker Compose for all services
- Reproducible environments (dev → production)
- Service isolation and orchestration

**AWS EC2 Configuration (Production):**
- **Instance Name**: secAI
- **Instance Type**: t3.xlarge
  - 8 vCPUs
  - 64 GB RAM (upgraded from 32 GB)
- **Storage**: EBS volume at /app/data
  - 51 GB raw data (1993-2024, ~1.2M files)
  - Organized: /app/data/edgar/extracted/YEAR/QTRN/
  - Processed output: /app/data/processed/
- **OS**: Ubuntu 24.04 LTS
- **Security**: VPC with restricted security groups, SSH key access

**Data Processing:**
- **Chunking Strategy**: 500-token chunks with 100-token contextual window (Anthropic's approach)
  - Base chunk: 500 tokens (semantic unit)
  - Contextual summary: 100-token LLM-generated summary prepended to each chunk
  - Total overhead: ~19.9% additional tokens for context
  - Library: `tiktoken` for accurate token counting
- **Vector Storage**: ChromaDB (containerized, port 8000)
- **Data Format**: JSON for structured storage (metadata + chunks)

### Libraries
- `langchain`, `langchain_community` - LLM orchestration
- `sentence-transformers` - Local embeddings (GPU-accelerated)
- `umap-learn` - Dimensionality reduction
- `scikit-learn` - Clustering algorithms (GMM)
- `pandas`, `numpy` - Data manipulation
- `requests` - SEC EDGAR API access
- `ollama` - Python client for Ollama
- `docker`, `docker-compose` - Containerization
- `tiktoken` - Token counting for chunking
- `tqdm` - Progress bars

---

## Data Scope

### Current Data Holdings
- **Time Period:** 1993-2024 (31 years)
- **Data Size:** ~51 GB
- **Data Location:** `/app/data/edgar/extracted/` on EC2
- **File Count:** ~1.2 million files
- **Organization:** YEAR/QTRN structure (e.g., 2024/QTR1/)
- **Filing Types:** Complete 10-K (annual reports) and 10-Q (quarterly filings)
- **Processing Scope:** Full filing text (all sections)
- **Analysis Focus:** Year-over-year changes, topic trends, boilerplate vs. substantive disclosure

**Why Process Complete Filings:**
- RAG enables users to ask questions about ANY section (not just risk factors)
- RAPTOR clustering naturally organizes content by topic regardless of section
- Maximizes system flexibility and future-proofs the knowledge base
- Supports queries like: "How did revenue recognition policies change?" or "Compare executive compensation across years"
- Historical coverage (1993-2024) enables long-term trend analysis

---

## RAPTOR Pipeline Flowchart

![RAPTOR Pipeline](diagrams/raptor_pipeline.png)

---

## Data Processing Workflow

![Data Processing Workflow](diagrams/data_processing_workflow.png)

---

## Implementation Phases

### Phase 1: Model Research & Setup (Week 1) ✅ COMPLETE
**Objectives:**
- [x] Clarify FinGPT components: RAPTOR (Python implementation)
- [x] Set up Ollama and test local LLM deployment
- [x] Evaluate and download appropriate models for SEC filing analysis
- [x] Set up project structure (`src/`, `data/`, `notebooks/`, `dashboard/`)
- [ ] Copy and adapt RAPTOR class from FinGPT's `rag.py` (deferred to Phase 3)
- [ ] Create base `Raptor` class skeleton in `src/models/raptor.py` (deferred to Phase 3)

**Completed Actions:**
- ✅ Installed Ollama v0.12.5 on Windows
- ✅ Downloaded qwen2.5:1.5b (986 MB) - lightweight model for testing
- ✅ Downloaded gpt-oss (13 GB) - reasoning-capable model for development
- ✅ Verified Python ollama package integration
- ✅ Tested model inference via command line and Python

**Model Selection Strategy:**
- **Development (Phase 2):** gpt-oss (13 GB) on local laptop
- **Production (Phase 4):** llama3-sec on AWS EC2 g6.2xlarge
- **Rationale:** Start with smaller model for faster iteration, scale to specialized model in production

**Key Clarification:**
- ✅ **RAPTOR** = Python implementation for hierarchical clustering/summarization (we copy the code)
- ✅ **llama3-sec** = Domain-specific SEC filing model (production)
- ✅ **gpt-oss** = General purpose reasoning model (development)

---

### Phase 2: Data Processing Pipeline (Week 2) - IN PROGRESS
**Objectives:**
- [x] Extract filings from archives (complete - 1.2M files on EC2)
- [x] Parse complete 10-K/10-Q text from HTML/XML/SGML formats (complete)
- [x] Implement document chunking with contextual embedding (500-token chunks + 100-token context)
- [ ] Generate embeddings using local Sentence Transformers
- [ ] Store structured data (chunks + metadata) in JSON
- [x] Set up Docker for data processing containers

**Key Files:**
- `src/data/filing_extractor.py` - Unzip archives, extract full filing text
- `src/data/text_processor.py` - Clean text, chunk into 500-token segments with metadata extraction
- `src/models/embedding_generator.py` - Embedding creation
- `src/Dockerfile` - Data processing container definition
- `docker-compose.chunking.yml` - Chunking service orchestration
- `deploy_and_run_chunking.py` - Automated deployment to EC2

**Docker-Based Processing:**
- Built image: `edgar-chunking` (8.4 GB, Python 3.12 + dependencies)
- Volume mounts:
  - Input: `/app/data/edgar/` (read-only)
  - Output: `/app/data/processed/`
- Resource limits: 4 CPUs, 8GB memory
- Deployment: Automated via `deploy_and_run_chunking.py`

**Current Status:**
- Complete dataset extracted on EC2 (1.2M files, 1993-2024)
- Docker image built successfully on EC2
- Ready to process 2024 Q1 as initial test (~6,337 files)
- Chunking implementation: 500 tokens + 100-token context (Anthropic method)

**Validation:**
- Test on 2024 Q1 first (single quarter)
- Verify text extraction accuracy and metadata capture
- Confirm chunking preserves semantic coherence
- Validate JSON output structure

---

### Phase 3: RAPTOR System Implementation (Week 3)
**Objectives:**
- [ ] Copy RAPTOR class from FinGPT to `src/models/raptor.py`
- [ ] Implement hierarchical clustering (adapted from FinGPT's RAPTOR):
  - Global clustering (UMAP → GMM with BIC for optimal cluster count)
  - Local clustering (secondary refinement within global clusters)
- [ ] Build recursive summarization engine (3 levels deep) using llama3-sec
- [ ] Create enhanced knowledge base combining:
  - Original document chunks
  - Level 1 summaries (cluster summaries)
  - Level 2 summaries (summary of summaries)
  - Level 3 summaries (highest abstraction)
- [ ] Implement cluster-aware retrieval mechanism
- [ ] Test on sample data with gpt-oss

**Key Methods in `Raptor` class (adapted from FinGPT):**
```python
def global_cluster_embeddings(embeddings, dim, n_neighbors, metric="cosine")
def local_cluster_embeddings(embeddings, dim, num_neighbors=10)
def get_optimal_clusters(embeddings, max_clusters=50)
def GMM_cluster(embeddings, threshold, random_state=0)
def perform_clustering(embeddings, dim, threshold)
def recursive_embed_cluster_summarize(texts, level=1, n_levels=3)
```

**Source Reference:**
- Original implementation: https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py

**Testing:**
- Validate clustering quality on sample documents
- Review generated summaries for coherence
- Ensure topics are properly grouped (e.g., all revenue-related content clusters together)
- Benchmark with gpt-oss before scaling to production

---

### Phase 4: Production Deployment on AWS EC2 (Week 4-5)
**Objectives:**
- [x] Provision AWS EC2 instance (secAI - 8 vCPUs, 64GB RAM)
- [x] Set up Docker + Docker Compose on EC2
- [x] Deploy Ollama (installed directly, not containerized)
- [x] Deploy ChromaDB container (port 8000)
- [x] Deploy Open WebUI container
- [ ] Deploy RAPTOR API container with full pipeline
- [ ] Process complete 51 GB dataset into knowledge base
- [ ] Implement query handling for diverse topics across filings
- [ ] Create sample query templates for common use cases
- [ ] Set up monitoring and logging

**EC2 Setup Steps:**
1. ✅ Launch EC2 instance (secAI - t3.xlarge, 64GB RAM)
2. ✅ Attach EBS volume at /app/data
3. ✅ Install Docker + Docker Compose
4. ✅ Install Ollama directly on EC2
5. ✅ Clone repository to EC2
6. ✅ Extract all SEC filings (1993-2024, 1.2M files)
7. ✅ Deploy ChromaDB container (port 8000)
8. ✅ Deploy Open WebUI container
9. [ ] Process full 51 GB dataset with chunking pipeline
10. [ ] Generate embeddings and store in ChromaDB

**Integration Workflow:**
1. User submits query via Open WebUI (web interface)
2. RAPTOR retrieves relevant chunks + hierarchical summaries from ChromaDB
3. Context passed to Ollama LLM (llama3-sec) via API
4. LLM generates response with supporting evidence
5. Results displayed in WebUI with citations

**Example Queries:**
- "What cyber risks did Apple disclose in 2023?"
- "How have revenue recognition policies evolved from 2010 to 2024?"
- "Compare executive compensation disclosures between tech companies"
- "Show boilerplate vs. substantive language in risk disclosures"

**Deliverables:**
- Functional production system on EC2
- Dockerized, reproducible deployment
- Full 51 GB dataset processed into knowledge base
- Open WebUI interface accessible
- Documentation for deployment and maintenance

---

## Understanding Docker: A Beginner's Guide

### What is Docker?

Docker is a tool that packages software and all its dependencies into standardized units called **containers**. Think of it like shipping containers for software - everything your code needs to run is packaged together in one box.

### Docker Images vs. Docker Containers

**Docker Image:**
- A **blueprint** or **recipe** containing your code + dependencies + configuration
- Stored on disk, doesn't run by itself
- Like an MP3 file sitting on your hard drive
- Example: `edgar-chunking` image (8.4 GB) contains Python 3.12 + tiktoken + tqdm + our chunking script

**Docker Container:**
- A **running instance** of an image
- Actively executing your code
- Like playing an MP3 file (the music you hear)
- Example: When we run `docker run edgar-chunking`, it creates a container that executes our chunking script

**Analogy:**
- **Image** = Recipe for chocolate chip cookies (stored in a cookbook)
- **Container** = Actual cookies baking in the oven (active process)

### Why Use Docker for This Project?

**1. Consistency Across Environments**
- Your laptop, EC2 instance, teammate's machine - all run the same code the exact same way
- "It works on my machine" problem solved

**2. Dependency Management**
- No need to manually install Python, tiktoken, tqdm, etc. on EC2
- Everything bundled in the image
- Avoids version conflicts

**3. Isolation**
- Each container runs independently
- ChromaDB container won't interfere with Ollama container
- Different services can use different Python versions if needed

**4. Reproducibility**
- Dockerfile is code - checked into git
- Anyone can rebuild the exact same environment
- Deployment = `docker-compose up`

**5. Easy Deployment**
- Build image once, run anywhere
- Update code → rebuild image → deploy new container
- No manual server configuration

### Our Docker Setup

**Current Docker Images on EC2:**
- `edgar-chunking` (8.4 GB) - Contains Python 3.12 + all dependencies + chunking script
- Used to process SEC filing text into 500-token chunks

**How We Use Docker:**
1. Write code locally (e.g., `text_processor.py`)
2. Create `Dockerfile` (recipe for building the image)
3. Upload files to EC2 via SCP
4. Build image on EC2: `docker build -t edgar-chunking .`
5. Run container: `docker run edgar-chunking` or `docker compose run chunking`
6. Container executes our script, outputs processed data to `/app/data/processed/`

**Docker Compose:**
- Tool for managing multiple containers at once
- We use `docker-compose.chunking.yml` to configure:
  - Which image to use (`edgar-chunking`)
  - Volume mounts (share data between host and container)
  - Resource limits (CPU, memory)
  - Command to run

**Volume Mounts Explained:**
- Volumes let containers access files on the host machine
- Example: `-v /app/data/edgar:/app/data/edgar:ro`
  - `/app/data/edgar` on EC2 host → `/app/data/edgar` inside container
  - `:ro` means read-only (container can't modify input data)
- Output: `-v /app/data/processed:/app/data/processed` (read-write)

---

## Docker Strategy

### Development vs. Production

**Development (Local Laptop):**
- Purpose: Build and test pipeline with sample data
- Model: gpt-oss (13 GB) - fits in 32 GB RAM
- Data: Sample filings (~1-5 GB)
- Services: Ollama + RAPTOR API + Open WebUI
- Command: `docker-compose -f docker-compose.dev.yml up`

**Production (AWS EC2):**
- Purpose: Full-scale deployment with complete dataset
- Model: llama3-sec - optimized for SEC filings
- Data: Complete 51 GB SEC filings (1993-2024)
- Services: Ollama (installed) + ChromaDB (containerized) + Open WebUI (containerized) + Data processing (containerized)
- Command: `docker compose -f docker-compose.chunking.yml run --rm chunking`

### Benefits of Docker Approach:

1. **Reproducibility**: Same environment dev → production
2. **Isolation**: Services don't conflict (different Python versions, dependencies)
3. **Portability**: Works on laptop, EC2, teammate's machine
4. **Scalability**: Easy to add services (monitoring, caching, etc.)
5. **Version Control**: Infrastructure as code (`Dockerfile`, `docker-compose.yml`)
6. **Easy Deployment**: `git pull && docker-compose up` deploys updates
7. **Resource Management**: Set CPU/memory limits per container

### Repository Structure with Docker
```
edgar_anomaly_detection/
├── data/
│   ├── external/         # Downloaded filings (gitignored)
│   ├── processed/        # Chunked filings (gitignored)
│   └── embeddings/       # ChromaDB files (gitignored)
├── src/
│   ├── Dockerfile        # Data processing container definition
│   ├── requirements.txt  # Python dependencies
│   ├── data/
│   │   ├── filing_extractor.py
│   │   └── text_processor.py
│   ├── models/
│   │   ├── raptor.py
│   │   ├── embedding_generator.py
│   │   └── clustering.py
│   └── api/
│       └── main.py       # FastAPI server for RAPTOR
├── notebooks/
│   └── 01_project_plan.ipynb
├── docker-compose.chunking.yml   # Chunking service
├── docker-compose.dev.yml        # Development setup
├── docker-compose.prod.yml       # Production setup (EC2)
├── deploy_and_run_chunking.py    # Automated deployment script
├── .dockerignore
├── .gitignore
├── requirements.txt
└── README.md
```

---

## RAPTOR vs. Traditional RAG Comparison

| Feature | Traditional RAG | RAPTOR RAG |
|---------|----------------|------------|
| Text Processing | Simple chunking | Recursive, hierarchical |
| Clustering | None or basic | Multi-level (global + local) |
| Summarization | None or single-level | Recursive, 3-level |
| Context Selection | Similarity-based only | Cluster-aware + similarity |
| Document Understanding | Flat representation | Hierarchical representation |
| Knowledge Integration | Direct chunks only | Chunks + multi-level summaries |

**Why RAPTOR for Financial Filings?**
- Financial documents have hierarchical structure (sections, subsections, themes)
- YoY analysis requires understanding both granular changes and high-level shifts
- Boilerplate detection benefits from cluster analysis (repetitive language clusters together)
- Complex queries need multi-level context (e.g., "How did regulatory disclosures evolve?")
- Historical coverage (1993-2024) enables long-term trend analysis

---

## Success Metrics
- [ ] Successfully process 90%+ of downloaded filings (1993-2024) into knowledge base
- [ ] Clustering produces coherent, interpretable topic groups
- [ ] Generated summaries accurately capture content at each hierarchical level
- [ ] LLM queries return relevant, accurate responses with supporting evidence
- [ ] System responds to queries in <10 seconds (including retrieval + generation)
- [ ] Manual validation: Test 10 diverse queries across different topics and decades, verify accuracy
- [ ] Docker deployment: Services start successfully on both dev and production
- [ ] EC2 deployment: System runs stably for 7+ days without intervention

---

## Key Advantages of AI-First Approach
1. **No Manual Feature Engineering**: LLM infers patterns from enhanced context (vs. building YoY diff algorithms)
2. **Flexible Queries**: Users can ask arbitrary questions about ANY topic or section
3. **Semantic Understanding**: Detects substantive changes even when wording differs
4. **Scalable**: Adding new filings just requires re-running RAPTOR pipeline
5. **Explainable**: LLM can cite specific sections supporting its conclusions
6. **Historical Depth**: 31 years of data enables long-term trend analysis
7. **Reproducible**: Docker ensures consistent environment across deployments

---

## Technical Challenges & Mitigations

### Challenge 1: Model Size vs. Available RAM
- **Issue**: llama3-sec requires significant RAM, local laptop has 32 GB
- **Solution**: 
  - Development: Use gpt-oss (13 GB) on laptop
  - Production: Deploy llama3-sec on AWS EC2 (64 GB RAM)
  - Benefits: Faster iteration locally, best quality in production

### Challenge 2: Embedding Generation at Scale
- **Issue**: Processing 31 years of complete filings (~51 GB, 1.2M files) requires significant compute power
- **Solution**: Process in batches (quarterly), parallelize where possible

### Challenge 3: Infrastructure Complexity
- **Issue**: Managing multiple services (Ollama, RAPTOR API, WebUI, ChromaDB)
- **Solution**: 
  - Docker Compose orchestrates all services
  - Single command deployment: `docker-compose up`
  - Services isolated and independently scalable

### Challenge 4: Clustering Quality
- **Issue**: Poorly defined clusters reduce summary quality
- **Solution**: Use BIC for optimal cluster count, validate clusters manually on samples

### Challenge 5: Context Window Limits
- **Issue**: LLMs have token limits, can't ingest entire knowledge base
- **Solution**: RAPTOR's hierarchical retrieval provides most relevant chunks + summaries

### Challenge 6: Data Format Evolution
- **Issue**: SEC filing formats changed significantly between 1993 and 2024 (SGML → HTML → XML)
- **Solution**: Built robust parsing logic that handles multiple formats, tested across time periods

### Challenge 7: Processing Time
- **Issue**: Processing 1.2M files takes significant time
- **Solution**: Start with single quarter (2024 Q1), then scale to full dataset, use Docker for resource management

---

## Next Steps
1. ✅ Download and test gpt-oss model locally
2. ✅ Extract complete SEC filing dataset to EC2 (1.2M files)
3. ✅ Build Docker image for data processing (`edgar-chunking`)
4. [ ] Run chunking pipeline on 2024 Q1 (test run)
5. [ ] Validate chunked output and metadata extraction
6. [ ] Copy RAPTOR class from FinGPT GitHub to `src/models/raptor.py`
7. [ ] Scale chunking to full dataset (1993-2024)
8. [ ] Generate embeddings and store in ChromaDB

---

## References
- **FinGPT GitHub:** https://github.com/AI4Finance-Foundation/FinGPT
- **FinGPT RAPTOR Implementation:** https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py
- **FinGPT Models on Hugging Face:** https://huggingface.co/AI4Finance-Foundation
- **RAPTOR RAG Documentation:** https://deepwiki.com/AI4Finance-Foundation/FinGPT/5.1-raptor-rag-system
- **SEC EDGAR API:** https://www.sec.gov/edgar/sec-api-documentation
- **Ollama:** https://ollama.ai/
- **Ollama Docker:** https://hub.docker.com/r/ollama/ollama
- **Ollama Models Library:** https://ollama.com/library
- **Arcee AI llama3-sec:** https://ollama.com/arcee-ai/llama3-sec
- **Open WebUI:** https://github.com/open-webui/open-webui
- **Docker Documentation:** https://docs.docker.com/
- **AWS EC2 Instance Types:** https://aws.amazon.com/ec2/instance-types/
- **Anthropic Contextual Embeddings:** https://www.anthropic.com/news/contextual-retrieval

## EC2 Production File Structure & Processing Status (Updated 2025-10-24)

### Actual EC2 Data Organization

**Location:** AWS EC2 Instance `secAI` at `/app/data/`

```
/app/data/
├── edgar/                          # Raw SEC EDGAR filings
│   └── extracted/                  # Unzipped filing text files
│       └── 2024/                   # Year-based organization
│           ├── QTR1/               # Q1 2024 (6,337 .txt files)   ✅ CHUNKED
│           ├── QTR2/               # Q2 2024 (7,247 .txt files)   ✅ CHUNKED
│           ├── QTR3/               # Q3 2024 (6,248 .txt files)   ✅ CHUNKED
│           └── QTR4/               # Q4 2024 (6,182 .txt files)   ✅ CHUNKED
│
├── processed/                      # Chunked JSON output (500-token chunks)
│   └── 2024/                       # Mirrors input structure
│       ├── QTR1/                   # 6,337 JSON files (one per filing)
│       │   └── YYYYMMDD_FORM_edgar_data_CIK_ACCESSION.json
│       ├── QTR2/                   # 7,247 JSON files
│       ├── QTR3/                   # 6,248 JSON files
│       └── QTR4/                   # 6,182 JSON files
│
└── embeddings/                     # Vector embeddings (NEXT PHASE)
    ├── test/                       # Test subset (3 files for validation)
    │   ├── embeddings.parquet      # Embedding vectors (768-dim)
    │   ├── metadata.parquet        # Chunk metadata (CIK, date, form, etc.)
    │   └── index.faiss             # Optional: FAISS index for fast search
    │
    └── 2024/                       # Full 2024 embeddings (future)
        ├── QTR1/
        │   ├── embeddings.parquet
        │   └── metadata.parquet
        ├── QTR2/
        ├── QTR3/
        └── QTR4/
```

### 2024 Processing Results (✅ COMPLETE - All Quarters)

**Chunking completed on 2025-10-24 using Docker container `edgar-chunking`**

| Quarter | Files | Chunks | Tokens | Status |
|---------|-------|--------|--------|--------|
| **Q1** | 6,337 | 1,235,886 | 616,372,446 | ✅ Complete |
| **Q2** | 7,247 | 584,914 | 290,670,736 | ✅ Complete |
| **Q3** | 6,248 | 522,716 | 259,802,386 | ✅ Complete |
| **Q4** | 6,182 | 498,682 | 247,809,543 | ✅ Complete |
| **TOTAL** | **26,014** | **2,842,198** | **1,414,655,111** | ✅ Complete |

**Processing Details:**
- **Method**: Docker Compose orchestration via `docker-compose.chunking.yml`
- **Container**: `edgar-chunking` (8.4 GB image with Python 3.12 + tiktoken + tqdm)
- **Total Processing Time**: ~4 hours (sequential quarterly processing)
- **Chunk Size**: 500 tokens (tiktoken tokenizer)
- **Metadata Extracted**: CIK, company name, form type, filing date per chunk
- **Output Format**: JSON (one file per filing, array of chunks with metadata)

### Chunking Strategy: 500-Token Base Chunks (No Contextual Window)

**Implementation:**
- **Core Approach**: Direct 500-token chunks using `tiktoken` tokenizer
- **No Contextual Window**: Initial implementation does NOT use Anthropic's contextual retrieval method
- **Rationale**: Simpler baseline for testing; can add contextual embeddings later if needed

**Why We Chose This Approach:**
1. **Faster Initial Processing**: No LLM calls required for context generation
2. **Baseline for Comparison**: Establishes simple RAG baseline before adding complexity
3. **Sufficient for High-Dimensional Embeddings**: 768-dim embeddings capture nuance without extra context
4. **Can Add Later**: If retrieval quality is insufficient, can reprocess with contextual windows

**Alternative Considered (Anthropic Contextual Retrieval):**
- 500-token core + 100-token LLM-generated contextual summary
- ~19.9% token overhead
- Deferred for future iteration if needed

---

## Embedding Strategy: High-Dimensional for Precise Retrieval

### Model Selection: `multi-qa-mpnet-base-dot-v1` (768 dimensions)

**Why High-Dimensional Embeddings for This Project?**

**Use Case Requirements:**
- **Exact wording retrieval** from SEC filings (not general summarization)
- **Fine-grained distinctions** between similar financial/legal terms
- **No overfitting concerns** (using pre-trained models, not training)
- **Jargon preservation** (e.g., "material adverse effect" vs "material impact")

**Model Comparison:**

| Model | Dimensions | Use Case | Decision |
|-------|-----------|----------|----------|
| **all-MiniLM-L6-v2** | 384 | General semantic similarity | ❌ Too low-dimensional, loses nuance |
| **all-mpnet-base-v2** | 768 | Best overall quality | ✅ Good option |
| **multi-qa-mpnet-base-dot-v1** | 768 | Question-answering retrieval | ✅✅ **SELECTED** |

**Why `multi-qa-mpnet-base-dot-v1`?**
1. **768 dimensions** → Captures fine-grained semantic distinctions
2. **Trained for Q&A tasks** → Perfect for "find exact wording about X" queries
3. **Dot-product similarity** → Faster search than cosine similarity
4. **High quality** → 420M parameters, state-of-the-art MPNet architecture
5. **Exact retrieval** → Preserves legal/financial terminology precision

**Storage Impact:**
- **2.8M chunks × 768 dims × 4 bytes = ~8.6 GB** (embeddings only)
- Manageable for EC2 storage
- Quality improvement worth the 2x storage vs 384-dim

### Test Plan: 3-File Validation

**Test Files (Q4 2024):**
1. `20241024_10-Q_edgar_data_1318605_0001628280-24-043486.txt`
2. `20241030_10-Q_edgar_data_789019_0000950170-24-118967.txt`
3. `20241101_10-K_edgar_data_320193_0000320193-24-000123.txt`

**Test Objectives:**
1. Validate embedding generation pipeline
2. Test retrieval quality with high-dimensional embeddings
3. Verify metadata preservation
4. Benchmark performance before scaling to full 2024 dataset

**Expected Outputs:**
- `embeddings.parquet` → 768-dim vectors for all chunks in 3 files
- `metadata.parquet` → CIK, company, form, date, chunk_id for each chunk
- Test queries to evaluate retrieval precision

---

## Why High Dimensions Are Better For This Project

**1. Precision Over Generalization**
- **Low-dim (384)**: Good for general topics, loses subtle distinctions
- **High-dim (768)**: Distinguishes "revenue decreased" vs "revenue declined slightly" vs "revenue fell sharply"

**2. No Overfitting Risk**
- Overfitting only matters when **training** models
- We use **pre-trained** embeddings for inference only
- Higher dimensions = more information capacity = better retrieval

**3. Financial/Legal Jargon Preservation**
- SEC filings use highly specific terminology
- "Subsequent event" vs "subsequent development" → legally distinct
- High-dim embeddings preserve these critical distinctions

**4. Storage Trade-off Is Acceptable**
- 384-dim: ~4.3 GB for 2.8M chunks
- 768-dim: ~8.6 GB for 2.8M chunks
- **2x storage for significantly better retrieval quality = worth it**

**5. Supports Complex Queries**
- "Find all instances where companies disclosed cybersecurity incidents in Q1 vs Q4"
- Requires distinguishing between similar but distinct concepts
- High-dimensional space enables precise matching

---

## Research-Backed Embedding Selection

**Sentence-BERT Foundation (2019):**
- Reimers & Gurevych: SBERT is 10,000x faster than BERT for similarity search
- Source: https://arxiv.org/abs/1908.10084
- MPNet builds on SBERT architecture

**MPNet: Masked and Permuted Pre-training (2020):**
- Microsoft Research: MPNet outperforms BERT, RoBERTa, XLNet on GLUE/SQuAD
- Source: https://arxiv.org/abs/2004.09297
- Best semantic representation for retrieval tasks

**MTEB Benchmark (2022):**
- `all-mpnet-base-v2` ranks in **top 10%** across 58 embedding tasks
- Source: https://arxiv.org/abs/2210.07316
- Leaderboard: https://huggingface.co/spaces/mteb/leaderboard

**Multi-QA Training:**
- Fine-tuned on question-answer pairs from Stack Exchange, Yahoo Answers, etc.
- Optimized for: "given a question, find the best passage"
- Perfect match for RAG retrieval: "given a query, find exact wording in filings"

---

## Next Steps (Embedding Phase)

**Immediate (Test on 3 files):**
1. Create `src/models/embedding_generator.py` script
2. Load 3 test files from `/app/data/processed/2024/QTR4/`
3. Generate embeddings using `multi-qa-mpnet-base-dot-v1`
4. Save to `/app/data/embeddings/test/`
5. Validate retrieval quality with sample queries

**After Test Success (Scale to Full 2024):**
1. Process all 26,014 files in 2024
2. Generate ~2.8M embeddings
3. Store in `/app/data/embeddings/2024/` (organized by quarter)
4. Set up ChromaDB or FAISS for similarity search
5. Benchmark retrieval performance

**Future (Full 1993-2024 Dataset):**
1. Scale embedding generation to all 31 years
2. Implement RAPTOR clustering on embeddings
3. Generate hierarchical summaries
4. Deploy production RAG system with Open WebUI

---