# Prototype Plan: RAPTOR RAG with 2024 SEC Filings

## Overview
This notebook outlines the step-by-step plan for prototyping the RAPTOR RAG system using:
- **Data**: All 2024 SEC 10-K/10-Q filings (~26K filings)
- **Models**: Test with both `gpt-oss` (13 GB) and `llama3-sec` (49 GB) via Ollama
- **Goal**: Validate complete RAPTOR pipeline before scaling to full 51 GB dataset

## Why 2024 Data Only?

**Previous approach:** 1,375 sample filings spread across 1993-2024
**New approach:** All ~26K filings from 2024

**Rationale:**
- **More data = better clustering**: 26K filings vs 1,375 samples (19x more data)
- **Temporal consistency**: Same regulatory environment, accounting standards, economic conditions
- **Better testing**: Can answer "compare Apple vs Microsoft 2024 risks" type queries
- **Cleaner baseline**: Avoids format/style drift across 30+ years during prototyping
- **Statistical robustness**: More documents for RAPTOR hierarchical clustering

**Archive location:** Multi-year prototype archived in `archive_v1_multi_year/`

---

## Prototype Objectives

1. **Process all 2024 filings** - Extract, clean, chunk 26K filings
2. **Test RAPTOR hierarchical clustering** on substantial dataset
3. **Compare model performance** - gpt-oss vs llama3-sec for summarization
4. **Verify recursive summarization** quality (3 levels)
5. **Build query interface** for retrieving and answering questions
6. **Measure performance** (speed, quality, resource usage)
7. **Identify issues** before production deployment

---

## Data Scope

**Source:** `data/external/10-X_C_2024.zip`
- **Time period:** Full year 2024 (Q1-Q4)
- **Total filings:** 26,018
- **Compressed size:** 1.6 GB
- **Estimated uncompressed:** 5-8 GB
- **Form types:** 10-K, 10-Q, 10-K/A, 10-Q/A, 10-QT

**Expected processing output:**
- **Chunk size:** 500 tokens (validated optimal from FinGPT research)
- **Total chunks:** ~2.9M (26K filings × ~111 chunks/filing)
- **Embedding storage:** ~17 GB (1536-dim embeddings)
- **Processing time:** 1-2 hours for chunking + embedding

---

## Current Status

### ✅ Step 1: Archive Multi-Year Prototype
- Moved previous work to `archive_v1_multi_year/`
- Renamed files: `01_prototype_plan_multi_year.ipynb`, `02_text_processing_multi_year.ipynb`
- Preserved all 12 chunk size outputs (200-8000 tokens)
- Key finding validated: 500-1000 tokens optimal for SEC filings

### ⏳ Step 2: Text Processing (IN PROGRESS)
**Notebook:** `02_text_processing.ipynb`

**Status:** Notebook created, ready to run

**Tasks:**
1. Extract all 26K filings from `10-X_C_2024.zip`
2. Parse SRAF-XML format (metadata + clean text)
3. Chunk at **500 tokens** with 50 token overlap (10%)
4. Add contextual headers (company, CIK, form type, date)
5. Export to `output/processed_2024_500tok.json`

**Expected output:**
- ~2.9M contextual chunks
- Avg ~111 chunks per filing
- JSON file size: ~6-8 GB

**Estimated time:** 30-60 minutes

---

## Remaining Pipeline Steps

### Step 3: Embedding Generation ⏸️
**Notebook:** `03_embedding_generation.ipynb` (to be created)

**Tasks:**
1. Load processed chunks from Step 2
2. Load Sentence Transformers model (`all-MiniLM-L6-v2`)
3. Generate embeddings for all ~2.9M chunks
4. Store embeddings as NumPy array
5. Measure embedding generation time and memory usage

**Implementation:**
```python
from sentence_transformers import SentenceTransformer
import json
import numpy as np

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Load processed chunks
with open('output/processed_2024_500tok.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Extract all chunk texts
all_chunks = []
for filing in data:
    for chunk in filing['chunks']:
        all_chunks.append(chunk['text'])

print(f"[INFO] Generating embeddings for {len(all_chunks):,} chunks...")

# Generate embeddings in batches
batch_size = 1000
embeddings = model.encode(
    all_chunks,
    batch_size=batch_size,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Save embeddings
np.save('output/embeddings_2024_500tok.npy', embeddings)
print(f"[OK] Saved embeddings: shape {embeddings.shape}")
```

**Output:** 
- NumPy array: shape `[~2.9M, 384]`
- File size: ~4.4 GB

**Success Criteria:**
- All chunks embedded successfully
- Dimension = 384 (all-MiniLM-L6-v2 output)
- Generation time < 1 second per chunk
- Memory usage < 16 GB during generation

**Estimated time:** 1-2 hours (for 2.9M chunks)

---

### Step 4: RAPTOR Hierarchical Clustering ⏸️
**Notebook:** `04_raptor_clustering.ipynb` (to be created)

**Tasks:**
1. **Copy RAPTOR class from FinGPT** to `src/models/raptor.py`
   - Source: [FinGPT rag.py](https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py)
2. Load embeddings from Step 3
3. Implement global clustering (UMAP → GMM)
4. Implement local clustering within global clusters
5. Determine optimal cluster count using BIC
6. Visualize clusters (UMAP plot)
7. Validate cluster coherence (manual review)

**Implementation:**
```python
from src.models.raptor import RaptorRAG
import numpy as np

# Load embeddings
embeddings = np.load('output/embeddings_2024_500tok.npy')

# Initialize RAPTOR
raptor = RaptorRAG(
    embedding_model="all-MiniLM-L6-v2",
    llm_model="gpt-oss",
    ollama_host="localhost:11434"
)

# Perform hierarchical clustering
clusters = raptor.cluster_embeddings(
    embeddings=embeddings,
    dim=10,  # UMAP dimensions
    n_neighbors=10,
    metric="cosine"
)

print(f"[OK] Formed {len(set(clusters))} clusters")

# Save cluster assignments
np.save('output/clusters_2024_500tok.npy', clusters)
```

**Output:**
- Cluster assignments for each chunk
- Cluster metadata (size, topic keywords)
- UMAP visualization

**Success Criteria:**
- Clusters are semantically coherent (manual review of 20+ clusters)
- No single dominant cluster (>30% of chunks)
- Reasonable cluster count (50-500 for 2.9M chunks)
- Clear topic separation in UMAP plot

**Estimated time:** 3-4 hours (larger dataset than multi-year)

---

### Step 5: Recursive Summarization (3 Levels) ⏸️
**Notebook:** `05_raptor_summarization.ipynb` (to be created)

**Tasks:**
1. Load chunks and cluster assignments
2. **Test both models:** gpt-oss vs llama3-sec
3. Generate **Level 1 summaries** (per-chunk)
4. Generate **Level 2 summaries** (cluster-level)
5. Generate **Level 3 summaries** (document-level)
6. Compare model quality (manual review)
7. Measure summarization time for each model

**Model Comparison:**

| Model | Size | Speed | Quality (Expected) |
|-------|------|-------|-------------------|
| gpt-oss | 13 GB | Fast (~2-3 sec/chunk) | Good |
| llama3-sec | 49 GB | Slower (~5-7 sec/chunk) | Excellent (fine-tuned for SEC) |

**Implementation:**
```python
import ollama
import json

# Test both models
models_to_test = ['gpt-oss', 'llama3-sec']

for model_name in models_to_test:
    print(f"\n[INFO] Testing {model_name}...")
    
    # Level 1: Summarize sample chunks (not all 2.9M - too expensive)
    sample_chunks = all_chunks[:1000]  # Test on 1000 chunks
    
    level1_summaries = []
    for i, chunk in enumerate(sample_chunks):
        if i % 100 == 0:
            print(f"  Processing chunk {i}/{len(sample_chunks)}...")
        
        prompt = f"""Summarize this SEC filing excerpt in 2-3 sentences.
        Focus on key financial metrics, risks, and business updates.
        
        Text:
        {chunk}
        
        Summary:"""
        
        response = ollama.chat(
            model=model_name,
            messages=[{"role": "user", "content": prompt}]
        )
        level1_summaries.append(response['message']['content'])
    
    # Save summaries for comparison
    with open(f'output/level1_summaries_{model_name}_sample.json', 'w') as f:
        json.dump(level1_summaries, f, indent=2)
    
    print(f"[OK] {model_name} summaries saved")
```

**Output:** 
- 3-level summary hierarchy for each model
- Quality comparison report
- Performance metrics (time per chunk)

**Success Criteria:**
- Level 1 summaries capture chunk essence (no hallucination)
- Level 2 summaries coherently integrate themes
- Level 3 summaries provide accurate document overview
- llama3-sec outperforms gpt-oss in quality (expected)
- 90%+ summaries factually accurate (manual review)

**Estimated time:** 6-10 hours (testing both models on sample)

**Note:** Don't summarize all 2.9M chunks yet - test on subset (1K-10K chunks) first

---

### Step 6: Vector Database Setup (ChromaDB) ⏸️
**Notebook:** `06_chromadb_setup.ipynb` (to be created)

**Tasks:**
1. Initialize ChromaDB
2. Create collection for 2024 SEC filings
3. Store chunks + embeddings + metadata
4. Store summaries from chosen model
5. Test similarity search
6. Benchmark query performance

**Implementation:**
```python
import chromadb
import json
import numpy as np

# Initialize ChromaDB
client = chromadb.PersistentClient(path="../../data/embeddings/chromadb")

# Create collection
collection = client.create_collection(
    name="sec_filings_2024_raptor",
    metadata={"description": "2024 SEC 10-K/10-Q filings with RAPTOR (500 token chunks)"}
)

# Load data
with open('output/processed_2024_500tok.json', 'r') as f:
    data = json.load(f)

embeddings = np.load('output/embeddings_2024_500tok.npy')
clusters = np.load('output/clusters_2024_500tok.npy')

# Prepare for ChromaDB (in batches - 2.9M is large)
batch_size = 5000
chunk_idx = 0

for batch_start in range(0, len(all_chunks), batch_size):
    batch_end = min(batch_start + batch_size, len(all_chunks))
    
    batch_docs = all_chunks[batch_start:batch_end]
    batch_embeddings = embeddings[batch_start:batch_end]
    batch_ids = [f"chunk_{i}" for i in range(batch_start, batch_end)]
    
    # Add batch metadata here...
    
    collection.add(
        embeddings=batch_embeddings.tolist(),
        documents=batch_docs,
        ids=batch_ids
    )
    
    if batch_start % 50000 == 0:
        print(f"[Progress] Added {batch_end:,} / {len(all_chunks):,} chunks")

print(f"[OK] Added {len(all_chunks):,} chunks to ChromaDB")
```

**Output:** ChromaDB database in `data/embeddings/chromadb/`

**Success Criteria:**
- All 2.9M chunks stored successfully
- Semantic search returns relevant results
- Query time < 2 seconds for top-10 retrieval
- Metadata correctly attached

**Estimated time:** 3-4 hours (large dataset)

---

### Step 7: RAG Query Interface ⏸️
**Notebook:** `07_rag_query_interface.ipynb` (to be created)

**Tasks:**
1. Build complete query pipeline: retrieve → augment → generate
2. Implement cluster-aware retrieval (RAPTOR)
3. Test with both models (gpt-oss vs llama3-sec)
4. Test diverse sample queries
5. Evaluate answer quality
6. Measure end-to-end latency
7. Compare RAPTOR vs simple RAG

**Test Queries (2024-specific):**
1. "What are the main risk factors disclosed in 2024 10-Ks?"
2. "Compare revenue trends across tech companies in 2024."
3. "What cybersecurity risks did companies disclose in 2024?"
4. "How did companies describe AI impacts in their 2024 filings?"
5. "What supply chain issues were mentioned in Q1 2024?"
6. "Summarize regulatory risks disclosed in 2024."
7. "What are common executive compensation structures in 2024?"
8. "How did companies discuss inflation in 2024 filings?"
9. "What climate-related risks were disclosed in 2024?"
10. "Compare Apple vs Microsoft 2024 risk factors."

**Success Criteria:**
- Answers factually accurate (90%+ on manual review)
- Citations reference correct filings
- End-to-end query time < 10 seconds
- Cluster-aware retrieval improves context
- No hallucinations

**Estimated time:** 3-4 hours

---

## Performance Metrics to Track

### Measure and Record:

1. **Data Processing** (Step 2)
   - Text extraction: 26K filings
   - Chunking time
   - Total chunks: ~2.9M
   - Output file size

2. **Embedding Generation** (Step 3)
   - Embeddings per second
   - Total embedding time
   - Memory usage peak

3. **RAPTOR Clustering** (Step 4)
   - Clustering time (UMAP + GMM)
   - Number of clusters formed
   - Cluster size distribution
   - Coherence score (manual)

4. **Summarization** (Step 5)
   - **gpt-oss:** Time per chunk, quality score
   - **llama3-sec:** Time per chunk, quality score
   - Model comparison results
   - Winner for production use

5. **Query Performance** (Step 7)
   - Retrieval time (ChromaDB)
   - LLM generation time (both models)
   - End-to-end latency
   - Answer quality (1-5 scoring)
   - RAPTOR vs simple RAG comparison

### Expected Performance:
- Total processing time: 8-15 hours for 26K filings
- Query response time: < 10 seconds
- Memory usage: < 32 GB RAM during processing

---

## Validation Checklist

### Before Moving to Full Dataset:

**Data Quality**
- [ ] All 26K filings processed successfully (>95%)
- [ ] Metadata correctly parsed
- [ ] Chunks maintain semantic coherence
- [ ] No major data quality issues

**RAPTOR System**
- [ ] Clustering produces interpretable topics
- [ ] Level 1-3 summaries accurate and useful
- [ ] Hierarchical structure adds value
- [ ] Cluster coherence validated

**RAG Pipeline**
- [ ] ChromaDB stores/retrieves correctly
- [ ] Similarity search returns relevant chunks
- [ ] LLM generates accurate answers
- [ ] Citations work correctly

**Performance**
- [ ] Query latency acceptable (< 10 sec)
- [ ] Memory usage within limits (< 32 GB)
- [ ] No crashes or errors
- [ ] System stable under load

**Quality**
- [ ] 20+ query responses verified accurate
- [ ] Edge cases tested (long filings, unusual formats)
- [ ] RAPTOR outperforms simple RAG
- [ ] Model comparison complete (choose winner)

---

## Model Selection Decision

**After Step 5, choose production model:**

**If gpt-oss is sufficient:**
- Pros: Faster (2-3 sec/chunk), smaller (13 GB), good quality
- Cons: Not fine-tuned for SEC filings
- Use case: Quick prototyping, resource-constrained

**If llama3-sec is better:**
- Pros: Fine-tuned for SEC, excellent quality, better understanding
- Cons: Slower (5-7 sec/chunk), larger (49 GB), more resources
- Use case: Production deployment, best quality needed

**Decision criteria:**
1. Manual quality review (10+ summaries each model)
2. Accuracy on test queries
3. Hallucination rate
4. Speed vs quality tradeoff
5. Available compute resources

---

## Success Definition

**Prototype is successful if:**

1. **End-to-end pipeline runs** on 26K 2024 filings (>95% success rate)
2. **RAPTOR clustering coherent** (manual validation of 30+ clusters)
3. **Summaries accurate** at all 3 levels (90%+ accuracy)
4. **Query responses correct** and well-cited (85%+ accuracy)
5. **Performance acceptable** (< 10 sec query, < 32 GB RAM)
6. **Clear model winner** identified for production
7. **Measurable improvement** over simple RAG

**If successful → proceed to full dataset (51 GB) + EC2 deployment**

**If issues found → iterate on 2024 data until resolved**

---

## Next Steps After Prototype

### If Prototype Succeeds:

1. **Document lessons learned**
   - Optimal parameters (chunk size: 500, UMAP dims, etc.)
   - Winning model (gpt-oss vs llama3-sec)
   - Performance characteristics
   - Known issues and workarounds

2. **Containerize pipeline**
   - Ollama container with chosen model
   - RAPTOR API (FastAPI)
   - ChromaDB container
   - Open WebUI
   - Docker Compose setup

3. **Prepare for EC2 deployment**
   - Provision r6i.4xlarge (128 GB RAM, 16 vCPUs)
   - Attach 500 GB EBS volume
   - Security group configuration
   - GPU instance if needed for embeddings

4. **Scale to full dataset**
   - Process all 51 GB (1993-2024)
   - Generate embeddings for full dataset
   - Full RAPTOR clustering
   - Production ChromaDB setup

5. **Deploy to production**
   - Launch Docker containers on EC2
   - Configure Open WebUI
   - Set up monitoring
   - Create backup strategy

### Timeline:
- **Step 2 (Processing):** 30-60 min
- **Step 3 (Embeddings):** 1-2 hours
- **Step 4 (Clustering):** 3-4 hours
- **Step 5 (Summarization):** 6-10 hours
- **Step 6 (ChromaDB):** 3-4 hours
- **Step 7 (Query Interface):** 3-4 hours
- **Validation:** 1-2 days
- **Total:** 3-4 days for prototype

**Then:**
- **Docker Setup:** 2-3 days
- **EC2 Deployment:** 1 week
- **Full Dataset Processing:** 2-3 weeks

**Total time to production: 4-5 weeks**

---

## Resources

### Key Files:
- `02_text_processing.ipynb` - Text extraction and chunking (ready to run)
- `archive_v1_multi_year/` - Previous multi-year prototype
- `data/external/10-X_C_2024.zip` - 2024 SEC filings source

### To Be Created:
- `03_embedding_generation.ipynb`
- `04_raptor_clustering.ipynb`
- `05_raptor_summarization.ipynb`
- `06_chromadb_setup.ipynb`
- `07_rag_query_interface.ipynb`
- `src/models/raptor.py`

### References:
- [FinGPT RAPTOR](https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py)
- [ChromaDB Docs](https://docs.trychroma.com/)
- [Sentence Transformers](https://www.sbert.net/)
- [Ollama Python](https://github.com/ollama/ollama-python)

---

**Status:** ✅ Archive Complete | ⏳ Ready for Step 2 (Text Processing)

**Last Updated:** 2025-10-15