# Prototype Plan: RAPTOR RAG with Sample Data + gpt-oss

## Overview
This notebook outlines the step-by-step plan for prototyping the RAPTOR RAG system using:
- **Sample data**: 1,375 SEC 10-K/10-Q filings (already processed)
- **Model**: gpt-oss (13 GB) running locally via Ollama
- **Goal**: Validate the complete RAPTOR pipeline before scaling to full 51 GB dataset

---

## Prototype Objectives

1. ✅ **Data processing pipeline validated** - 1,375 filings processed (see `02_text_processing.ipynb`)
2. **Test RAPTOR hierarchical clustering** on sample documents
3. **Verify recursive summarization** quality with gpt-oss
4. **Build query interface** for retrieving and answering questions
5. **Measure performance** (speed, quality, resource usage)
6. **Identify issues** before production deployment

---

## Current Status

### ✅ Steps 1-2: COMPLETED (see `02_text_processing.ipynb`)

**What's been done:**
- ✅ Processed 1,375 sample SEC filings from `eda/samples/`
- ✅ Extracted metadata (CIK, company name, filing date, form type)
- ✅ Cleaned text (removed SRAF wrappers, HTML/XML tags)
- ✅ Tested **12 different chunk sizes** (200-8000 tokens)
- ✅ Created contextual chunks with document metadata
- ✅ Exported 12 JSON files (one per chunk size) + comparison CSV

**Key Results:**
- Total files: 1,375 filings
- Chunk sizes tested: 200, 300, 400, 500, 750, 1000, 1500, 2000, 3000, 4000, 5000, 8000 tokens
- FinGPT optimal range confirmed: 500-1000 tokens
- Files available: `output/processed_samples_{size}tok.json`

**Sample Statistics (500 tokens - FinGPT optimal):**
- Total chunks: 153,207
- Avg chunks/filing: 111.4
- Storage requirement: 897.70 MB (embeddings)

**Next decision:** Choose optimal chunk size(s) for remainder of pipeline (Step 3 onwards)

---

## Remaining Pipeline Steps

### Step 3: Embedding Generation ⏳ NEXT
**Notebook**: `03_embedding_generation.ipynb` (to be created)

**Tasks**:
1. Select chunk size(s) for embedding (recommended: 500, 1000, 2000 tokens)
2. Load Sentence Transformers model (`all-MiniLM-L6-v2`)
3. Generate embeddings for selected chunk size(s)
4. Store embeddings with chunk IDs
5. Measure embedding generation time and memory usage

**Implementation**:
```python
from sentence_transformers import SentenceTransformer
import json
import numpy as np
from pathlib import Path

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Load processed chunks (choose chunk size)
chunk_size = 500  # or 1000, 2000
with open(f'output/processed_samples_{chunk_size}tok.json', 'r') as f:
    data = json.load(f)

# Extract all chunk texts
all_chunks = []
for filing in data:
    for chunk in filing['chunks']:
        all_chunks.append(chunk['text'])

# Generate embeddings
print(f"Generating embeddings for {len(all_chunks)} chunks...")
embeddings = model.encode(all_chunks, show_progress_bar=True)

# Save embeddings
np.save(f'output/embeddings_{chunk_size}tok.npy', embeddings)
print(f"Saved embeddings: shape {embeddings.shape}")
```

**Output**: NumPy array with embeddings (shape: [num_chunks, 384])

**Success Criteria**:
- Embeddings generated for all chunks
- Dimension = 384 (all-MiniLM-L6-v2 output size)
- Generation time < 1 second per chunk
- Memory usage < 8 GB during generation

**Estimated time**: 30-60 minutes (for 153K chunks at 500 tokens)

---

### Step 4: RAPTOR Hierarchical Clustering ⏸️
**Notebook**: `04_raptor_clustering.ipynb` (to be created)

**Tasks**:
1. **Copy RAPTOR class from FinGPT** to `src/models/raptor.py`
   - Source: https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py
2. Load embeddings from Step 3
3. Implement global clustering (UMAP → GMM)
4. Implement local clustering within global clusters
5. Determine optimal cluster count using BIC (Bayesian Information Criterion)
6. Visualize clusters (t-SNE or UMAP plot)
7. Validate cluster coherence (manually review sample chunks from each cluster)

**Implementation**:
```python
from src.models.raptor import RaptorRAG
import numpy as np

# Load embeddings
embeddings = np.load('output/embeddings_500tok.npy')

# Initialize RAPTOR
raptor = RaptorRAG(
    embedding_model="all-MiniLM-L6-v2",
    llm_model="gpt-oss",
    ollama_host="localhost:11434"
)

# Perform hierarchical clustering
clusters = raptor.cluster_embeddings(
    embeddings=embeddings,
    dim=10,  # UMAP dimensions
    n_neighbors=10,  # UMAP neighbors
    metric="cosine"  # Distance metric
)

print(f"Formed {len(set(clusters))} clusters")

# Analyze cluster distribution
from collections import Counter
cluster_counts = Counter(clusters)
print(f"Cluster sizes: min={min(cluster_counts.values())}, "
      f"max={max(cluster_counts.values())}, "
      f"avg={sum(cluster_counts.values())/len(cluster_counts):.1f}")
```

**Output**: 
- Cluster assignments for each chunk
- Cluster metadata (size, topic keywords)
- Visualization plots

**Success Criteria**:
- Clusters are semantically coherent (manual review of 10+ clusters)
- No single dominant cluster (>50% of chunks)
- Cluster count reasonable (10-100 clusters for 153K chunks)
- Clear topic separation visible in UMAP plot

**Estimated time**: 2-3 hours

---

### Step 5: Recursive Summarization (3 Levels) ⏸️
**Notebook**: `05_raptor_summarization.ipynb` (to be created)

**Tasks**:
1. Load chunks and cluster assignments from Steps 3-4
2. Generate **Level 1 summaries** (per-chunk summaries using gpt-oss)
3. Generate **Level 2 summaries** (cluster-level summaries)
4. Generate **Level 3 summaries** (document-level summaries)
5. Test summarization quality (manual review of samples)
6. Measure summarization time and gpt-oss performance

**Implementation**:
```python
import ollama
import json

# Initialize Ollama client
ollama_client = ollama.Client()

# Level 1: Summarize individual chunks
def summarize_chunk(chunk_text):
    prompt = f"""Summarize the following SEC filing excerpt in 2-3 sentences.
    Focus on key financial metrics, risks, and business updates.
    
    Text:
    {chunk_text}
    
    Summary:"""
    
    response = ollama_client.chat(
        model="gpt-oss",
        messages=[{"role": "user", "content": prompt}]
    )
    return response['message']['content']

# Process all chunks
level1_summaries = []
for i, chunk in enumerate(all_chunks):
    if i % 100 == 0:
        print(f"Processing chunk {i}/{len(all_chunks)}...")
    summary = summarize_chunk(chunk)
    level1_summaries.append(summary)

# Save Level 1 summaries
with open('output/level1_summaries_500tok.json', 'w') as f:
    json.dump(level1_summaries, f, indent=2)


# Level 2: Summarize clusters
def summarize_cluster(cluster_summaries):
    combined = "\n\n".join(cluster_summaries[:10])  # Limit to 10 chunks per cluster
    prompt = f"""Create a cohesive summary integrating these related SEC filing sections.
    
    Sections:
    {combined}
    
    Integrated summary:"""
    
    response = ollama_client.chat(
        model="gpt-oss",
        messages=[{"role": "user", "content": prompt}]
    )
    return response['message']['content']

# Group summaries by cluster
cluster_groups = {}
for idx, cluster_id in enumerate(clusters):
    if cluster_id not in cluster_groups:
        cluster_groups[cluster_id] = []
    cluster_groups[cluster_id].append(level1_summaries[idx])

# Summarize each cluster
level2_summaries = {}
for cluster_id, summaries in cluster_groups.items():
    print(f"Summarizing cluster {cluster_id} ({len(summaries)} chunks)...")
    level2_summaries[cluster_id] = summarize_cluster(summaries)

# Save Level 2 summaries
with open('output/level2_summaries_500tok.json', 'w') as f:
    json.dump(level2_summaries, f, indent=2)


# Level 3: Document-level summary
def summarize_document(filing_data):
    # Get all cluster summaries for this filing
    relevant_clusters = [level2_summaries[cid] for cid in filing_data['cluster_ids']]
    combined = "\n\n".join(relevant_clusters)
    
    prompt = f"""Create a high-level executive summary of this SEC filing.
    
    Key themes:
    {combined}
    
    Executive summary:"""
    
    response = ollama_client.chat(
        model="gpt-oss",
        messages=[{"role": "user", "content": prompt}]
    )
    return response['message']['content']

# Generate Level 3 summaries for each filing
level3_summaries = []
for filing in data:
    filing_summary = summarize_document(filing)
    level3_summaries.append({
        'file_name': filing['file_name'],
        'company': filing['metadata']['company'],
        'summary': filing_summary
    })

# Save Level 3 summaries
with open('output/level3_summaries_500tok.json', 'w') as f:
    json.dump(level3_summaries, f, indent=2)
```

**Output**: 3-level summary hierarchy stored in JSON files

**Success Criteria**:
- Level 1 summaries capture chunk essence without hallucination
- Level 2 summaries coherently integrate cluster themes
- Level 3 summary provides accurate document overview
- Summarization time < 5 seconds per chunk with gpt-oss
- Manual review: 90%+ summaries are factually accurate

**Estimated time**: 4-6 hours (depending on gpt-oss speed for 153K chunks)

**Note**: This is the most time-consuming step. Consider processing subset first (e.g., 100 filings) to validate approach.

---

### Step 6: Vector Database Setup (ChromaDB) ⏸️
**Notebook**: `06_chromadb_setup.ipynb` (to be created)

**Tasks**:
1. Install and initialize ChromaDB
2. Create collection for SEC filings
3. Store chunks + embeddings + metadata
4. Store all 3 levels of summaries as additional documents
5. Test similarity search
6. Benchmark query performance

**Implementation**:
```python
import chromadb
from chromadb.config import Settings
import json
import numpy as np

# Initialize ChromaDB
client = chromadb.PersistentClient(path="../../data/embeddings/chromadb")

# Create collection
collection = client.create_collection(
    name="sec_filings_raptor_500tok",
    metadata={"description": "SEC 10-K/10-Q filings with RAPTOR summaries (500 token chunks)"}
)

# Load data
with open('output/processed_samples_500tok.json', 'r') as f:
    data = json.load(f)

embeddings = np.load('output/embeddings_500tok.npy')
clusters = np.load('output/clusters_500tok.npy')

with open('output/level1_summaries_500tok.json', 'r') as f:
    level1_summaries = json.load(f)

# Prepare documents and metadata
documents = []
metadatas = []
ids = []

chunk_idx = 0
for filing in data:
    for chunk in filing['chunks']:
        documents.append(chunk['text'])
        metadatas.append({
            'company': filing['metadata']['company'],
            'cik': filing['metadata']['cik'],
            'form_type': filing['metadata']['form_type'],
            'filing_date': filing['metadata']['filing_date'],
            'file_name': filing['file_name'],
            'chunk_index': chunk['chunk_id'],
            'cluster_id': int(clusters[chunk_idx]),
            'level1_summary': level1_summaries[chunk_idx]
        })
        ids.append(f"chunk_{chunk_idx}")
        chunk_idx += 1

# Add to ChromaDB (in batches for efficiency)
batch_size = 1000
for i in range(0, len(documents), batch_size):
    print(f"Adding batch {i//batch_size + 1}/{len(documents)//batch_size + 1}...")
    batch_end = min(i + batch_size, len(documents))
    
    collection.add(
        embeddings=embeddings[i:batch_end].tolist(),
        documents=documents[i:batch_end],
        metadatas=metadatas[i:batch_end],
        ids=ids[i:batch_end]
    )

print(f"Added {len(documents)} chunks to ChromaDB")

# Test query
results = collection.query(
    query_texts=["What are the company's main risks?"],
    n_results=5
)

print("Top 5 results:")
for i, doc in enumerate(results['documents'][0]):
    meta = results['metadatas'][0][i]
    print(f"{i+1}. {meta['company']} ({meta['form_type']}) - {meta['filing_date']}")
    print(f"   {doc[:200]}...")
```

**Output**: ChromaDB database in `data/embeddings/chromadb/`

**Success Criteria**:
- All 153K chunks stored successfully in ChromaDB
- Semantic search returns relevant results (manual verification)
- Query time < 1 second for top-5 retrieval
- Metadata correctly attached to all chunks

**Estimated time**: 1-2 hours

---

### Step 7: RAG Query Interface ⏸️
**Notebook**: `07_rag_query_interface.ipynb` (to be created)

**Tasks**:
1. Build complete query pipeline: retrieve → augment → generate
2. Implement cluster-aware retrieval (RAPTOR's key differentiator)
3. Test with diverse sample queries
4. Evaluate answer quality (manual review)
5. Measure end-to-end latency
6. Compare RAPTOR vs. simple RAG (baseline)

**Implementation**:
```python
import chromadb
import ollama
import json

# Load ChromaDB
client = chromadb.PersistentClient(path="../../data/embeddings/chromadb")
collection = client.get_collection(name="sec_filings_raptor_500tok")

# Load cluster summaries
with open('output/level2_summaries_500tok.json', 'r') as f:
    cluster_summaries = json.load(f)

def query_raptor(query, top_k=5, use_cluster_context=True):
    """
    RAPTOR RAG query with cluster-aware retrieval
    """
    # Step 1: Retrieve relevant chunks from ChromaDB
    results = collection.query(
        query_texts=[query],
        n_results=top_k
    )
    
    retrieved_chunks = results['documents'][0]
    metadata = results['metadatas'][0]
    
    # Step 2: Get cluster summaries for retrieved chunks (RAPTOR advantage)
    if use_cluster_context:
        cluster_ids = list(set([m['cluster_id'] for m in metadata]))
        cluster_context = [cluster_summaries[str(cid)] for cid in cluster_ids]
    
    # Step 3: Build context (chunks + cluster summaries)
    context_parts = []
    
    # Add retrieved chunks
    context_parts.append("[Retrieved Excerpts]")
    for i, chunk in enumerate(retrieved_chunks[:3]):  # Use top 3 chunks
        meta = metadata[i]
        context_parts.append(f"\nSource: {meta['company']} ({meta['form_type']}) {meta['filing_date']}")
        context_parts.append(chunk[:1000])  # Limit chunk size
    
    # Add cluster summaries (RAPTOR's hierarchical context)
    if use_cluster_context:
        context_parts.append("\n\n[Thematic Context from Clusters]")
        for summary in cluster_context[:2]:  # Use top 2 cluster summaries
            context_parts.append(summary)
    
    context = "\n".join(context_parts)
    
    # Step 4: Generate answer with gpt-oss
    prompt = f"""You are analyzing SEC filings. Use the context below to answer the question.
    Provide citations to specific companies and filing dates.
    
    Context:
    {context}
    
    Question: {query}
    
    Answer:"""
    
    response = ollama.chat(
        model="gpt-oss",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return {
        "answer": response['message']['content'],
        "sources": [
            {
                "company": m['company'],
                "form_type": m['form_type'],
                "filing_date": m['filing_date'],
                "file_name": m['file_name']
            }
            for m in metadata
        ],
        "retrieved_chunks": retrieved_chunks,
        "cluster_context": cluster_context if use_cluster_context else []
    }

# Test queries
test_queries = [
    "What are the main risk factors disclosed in these filings?",
    "Summarize revenue trends across companies.",
    "What cybersecurity risks are mentioned?",
    "How do companies describe their competitive positions?",
    "What legal proceedings are disclosed?",
    "What are the main environmental and regulatory risks?",
    "How have companies disclosed their intellectual property?",
    "What supply chain risks are mentioned?",
    "Describe compensation structures for executives.",
    "What customer concentration risks exist?"
]

print("Testing RAPTOR RAG Query Interface\n")
print("="*80)

for i, query in enumerate(test_queries, 1):
    print(f"\n[Query {i}] {query}")
    print("-"*80)
    
    result = query_raptor(query, top_k=5)
    
    print(f"Answer: {result['answer'][:300]}...")
    print(f"\nSources ({len(result['sources'])} filings):")
    for source in result['sources'][:3]:
        print(f"  - {source['company']} ({source['form_type']}) {source['filing_date']}")
    print("="*80)
```

**Test Queries** (Diverse types):
1. "What are the main risk factors disclosed in these filings?"
2. "Summarize revenue trends across companies."
3. "What cybersecurity risks are mentioned?"
4. "How do companies describe their competitive positions?"
5. "What legal proceedings are disclosed?"
6. "What are the main environmental and regulatory risks?"
7. "How have companies disclosed their intellectual property?"
8. "What supply chain risks are mentioned?"
9. "Describe compensation structures for executives."
10. "What customer concentration risks exist?"

**Output**: Query results with answers, sources, and context

**Success Criteria**:
- Answers are factually accurate (90%+ accuracy on manual review)
- Citations correctly reference source filings (company, form, date)
- End-to-end query time < 10 seconds
- Cluster-aware retrieval provides better context than simple RAG (A/B test)
- No hallucinations (facts not in source documents)

**Estimated time**: 2-3 hours

---

## Performance Metrics to Track

### Measure and Record:

1. **Data Processing** ✅ (Complete)
   - Text extraction: 1,375 filings processed
   - Chunking: 12 sizes tested (200-8000 tokens)
   - Total chunks: 153,207 (at 500 tokens)

2. **Embedding Generation** ⏳
   - Embeddings per second
   - Total embedding time
   - Memory usage peak

3. **RAPTOR Clustering** ⏸️
   - UMAP + GMM clustering time
   - Number of clusters formed
   - Cluster size distribution
   - Coherence score (manual)

4. **Summarization** ⏸️
   - Time per chunk summary (Level 1)
   - Time per cluster summary (Level 2)
   - Time per document summary (Level 3)
   - Total summarization time
   - gpt-oss tokens used

5. **Query Performance** ⏸️
   - Retrieval time (ChromaDB)
   - LLM generation time (gpt-oss)
   - End-to-end query latency
   - Answer quality (manual scoring 1-5)
   - RAPTOR vs simple RAG comparison

### Expected Performance (Sample Data):
- Total processing time: 8-12 hours for 1,375 filings (most time in summarization)
- Query response time: < 10 seconds
- Memory usage: < 16 GB RAM during processing

---

## Validation Checklist

### Before Moving to Full Dataset:

**Data Quality** ✅
- [x] Text extraction works across different filing formats (HTML/XML/SGML)
- [x] Metadata correctly parsed for all sample files
- [x] Chunks maintain semantic coherence (12 sizes tested)

**RAPTOR System** ⏸️
- [ ] Clustering produces interpretable topic groups
- [ ] Level 1-3 summaries are accurate and useful
- [ ] Hierarchical structure provides value over flat retrieval

**RAG Pipeline** ⏸️
- [ ] ChromaDB stores and retrieves embeddings correctly
- [ ] Similarity search returns relevant chunks
- [ ] gpt-oss generates accurate, citation-backed answers

**Performance** ⏸️
- [ ] Query latency acceptable (< 10 seconds)
- [ ] Memory usage within limits (< 16 GB for sample data)
- [ ] No crashes or errors during processing

**Quality** ⏸️
- [ ] Manually verify 10+ query responses for accuracy
- [ ] Test edge cases (very long filings, unusual formats)
- [ ] Compare RAPTOR vs. simple RAG on same queries (ablation study)

---

## Issues to Watch For

### Common Problems:

1. **Chunking Issues** ✅ (Resolved in Step 2)
   - Tested 12 chunk sizes to find optimal
   - 10% overlap preserves context
   - Contextual headers add document metadata

2. **Clustering Problems** ⏸️
   - Too many/too few clusters → Use BIC to optimize
   - Incoherent cluster themes → Manual review and iterate
   - Single dominant cluster → Adjust UMAP parameters

3. **Summarization Quality** ⏸️
   - Hallucinations (gpt-oss inventing facts) → Add "only use provided text" to prompts
   - Missing key information → Test different prompt templates
   - Summaries too generic → Adjust prompt specificity

4. **Retrieval Accuracy** ⏸️
   - Irrelevant chunks retrieved → Test different embedding models
   - Missing obvious relevant content → Check embedding quality
   - Poor semantic matching → Consider query expansion

5. **Performance Bottlenecks** ⏸️
   - Slow embedding generation → Use GPU if available
   - Slow LLM summarization → Consider batching or faster model
   - High memory usage → Process in smaller batches

---

## Success Definition

**Prototype is successful if:**

1. **End-to-end pipeline runs without errors** on 1,375 sample filings
2. **RAPTOR clustering produces coherent themes** (manual validation of 20+ clusters)
3. **Summaries accurately capture content** at all 3 levels (90%+ accuracy)
4. **Query responses are factually correct** and well-cited (80%+ accuracy)
5. **Performance is acceptable** (< 10 sec query time, < 16 GB RAM)
6. **System provides measurable value** over simple keyword search or basic RAG

**If successful → proceed to Phase 4 (full dataset + EC2 deployment)**

**If issues found → iterate on prototype until resolved**

---

## Next Steps After Prototype

### If Prototype Succeeds:

1. **Document lessons learned** and optimal parameters
   - Optimal chunk size: 500-1000 tokens (to be confirmed)
   - RAPTOR clustering params (UMAP dim, GMM threshold)
   - gpt-oss prompt templates that work best
   
2. **Set up Docker Compose** for local development
   - Ollama container with gpt-oss
   - RAPTOR API container (FastAPI)
   - Open WebUI container
   
3. **Containerize the pipeline** (Ollama + RAPTOR API + WebUI)
   - Create Dockerfile for RAPTOR service
   - Create docker-compose.dev.yml
   
4. **Test Docker setup** with sample data
   - Ensure reproducibility
   - Validate performance in containers
   
5. **Prepare for EC2 deployment** (provision instance, configure)
   - Provision r6i.4xlarge (128 GB RAM)
   - Attach 500 GB EBS volume
   
6. **Scale to full 51 GB dataset** on EC2
   - Process all filings (not just samples)
   - Generate embeddings for full dataset
   
7. **Deploy llama3-sec** for production quality
   - Pull 49 GB llama3-sec model
   - Re-run summarization with llama3-sec
   - Compare quality vs gpt-oss

### Timeline:
- **Step 3 (Embeddings)**: 1-2 hours
- **Step 4 (Clustering)**: 2-3 hours
- **Step 5 (Summarization)**: 4-6 hours (most time-consuming)
- **Step 6 (ChromaDB)**: 1-2 hours
- **Step 7 (Query Interface)**: 2-3 hours
- **Validation & Iteration**: 1-2 days
- **Total for Steps 3-7**: 2-3 days

**Then:**
- **Docker Setup**: 2-3 days
- **EC2 Deployment**: 1 week
- **Full Dataset Processing**: 1-2 weeks

**Total estimated time: 3-4 weeks from now to production**

---

## Resources

### Key Files:
- ✅ `02_text_processing.ipynb` - Data extraction and chunking (COMPLETE)
- ✅ `output/processed_samples_{size}tok.json` - 12 chunk size variants
- ✅ `output/chunk_size_comparison.csv` - Chunk size statistics
- ⏸️ `src/models/raptor.py` - RAPTOR implementation (to be created)
- ⏸️ `data/embeddings/chromadb/` - Vector database (to be created)

### Available Data:
- 1,375 processed SEC filings
- 12 chunk size variants (200-8000 tokens)
- Optimal range identified: 500-1000 tokens per FinGPT research

### References:
- FinGPT RAPTOR: https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py
- ChromaDB Docs: https://docs.trychroma.com/
- Sentence Transformers: https://www.sbert.net/
- Ollama Python: https://github.com/ollama/ollama-python

---

**Status**: ✅ Steps 1-2 Complete | ⏳ Ready for Step 3 (Embedding Generation)

**Last Updated**: 2025-10-14