# SEC 10-K Risk Factor Analysis - Project Plan

## Project Overview
AI-powered system for analyzing SEC 10-K and 10-Q filings using RAPTOR RAG (Recursive Adaptive Processing and Topical Organizational Retrieval). The system will create an enhanced knowledge base from financial filings that users can query interactively to identify year-over-year changes, risk patterns, and potential fraud indicators.

---

## Core Architecture

### Infrastructure
- **Deployment**: AWS EC2 instance with GPU (in progress)
- **Model Hosting**: Ollama for local LLM deployment
- **User Interface**: Open WebUI for interactive queries
- **Data Storage**: Cloud-based storage for processed embeddings and knowledge base

### Architecture Diagram

```mermaid
graph TB
    subgraph "Data Sources"
        A[SEC EDGAR API]
        B[Downloaded ZIP Files<br/>10-K/10-Q Filings]
    end

    subgraph "AWS EC2 Instance with GPU"
        C[Filing Extractor]
        D[Text Processor<br/>Chunking & Cleaning]
        E[RAPTOR System]
        F[Embedding Generator<br/>Sentence Transformers]
        G[Knowledge Base<br/>ChromaDB]
        H[Ollama LLM<br/>FinGPT Model]
        I[Open WebUI]
    end

    subgraph "Storage"
        J[(Processed Data<br/>JSON/Parquet)]
        K[(Vector Embeddings)]
    end

    subgraph "Users"
        L[End Users]
    end

    A -->|Download Filings| B
    B -->|Extract| C
    C -->|Parse HTML/XML| D
    D -->|Chunks + Metadata| F
    F -->|Generate Embeddings| E
    E -->|Hierarchical Clustering<br/>& Summarization| G
    G -->|Store| K
    D -->|Store| J

    L -->|Submit Query| I
    I -->|Retrieve Context| G
    G -->|Relevant Chunks<br/>+ Summaries| H
    H -->|Generate Response| I
    I -->|Display Results| L

    style E fill:#ff9800,stroke:#f57c00,stroke-width:3px
    style H fill:#4caf50,stroke:#388e3c,stroke-width:2px
    style I fill:#2196f3,stroke:#1976d2,stroke-width:2px
```

### RAPTOR RAG System
Unlike traditional RAG systems that use simple similarity search, RAPTOR implements:
- **Hierarchical Clustering**: Multi-level organization (global + local) using UMAP and Gaussian Mixture Models
- **Recursive Summarization**: 3-level hierarchical summaries capturing both granular details and high-level themes
- **Enhanced Context Retrieval**: Cluster-aware retrieval providing richer context for LLM queries

---

## Technical Stack

### NLP & ML
- **Base Model**: FinGPT (Hugging Face compatible version) or alternative financial LLM
- **Embeddings**: Sentence Transformers (`all-MiniLM-L6-v2`) for local, cost-free embedding generation
- **Clustering**: UMAP (dimensionality reduction) + scikit-learn GMM
- **LLM Interface**: Ollama (local) or OpenAI API (for testing/comparison)

### Data Processing
- **Chunking**: LangChain `RecursiveCharacterTextSplitter` (~2000 tokens/chunk)
- **Vector Storage**: ChromaDB or similar for efficient retrieval
- **Data Format**: JSON/Parquet for structured storage

### Libraries
- `langchain`, `langchain_community` - LLM orchestration
- `sentence-transformers` - Local embeddings
- `umap-learn` - Dimensionality reduction
- `scikit-learn` - Clustering algorithms
- `pandas`, `numpy` - Data manipulation
- `requests` - SEC EDGAR API access

---

## Data Scope
- **Current Holdings**: ZIP folders containing 10-K and 10-Q filings (already downloaded)
- **Target Sections**: 
  - Item 1A (Risk Factors) - primary focus
  - Other sections as needed for comprehensive analysis
- **Analysis Focus**: Year-over-year changes, new/removed risks, boilerplate vs. substantive disclosure

---

## RAPTOR Pipeline Flowchart

```mermaid
flowchart TD
    Start([Start: Raw 10-K/10-Q Filing]) --> Extract[Extract Text from HTML/XML]
    Extract --> Chunk[Chunk Document<br/>RecursiveCharacterTextSplitter<br/>2000 tokens/chunk]
    Chunk --> Embed1[Generate Embeddings<br/>Sentence Transformers]
    
    Embed1 --> GlobalCluster[Global Clustering<br/>UMAP + GMM]
    GlobalCluster --> LocalCluster[Local Clustering<br/>Refine within each cluster]
    
    LocalCluster --> Summarize1[Level 1 Summarization<br/>Cluster summaries]
    Summarize1 --> Embed2[Embed Level 1 Summaries]
    
    Embed2 --> Cluster2[Level 2 Clustering<br/>UMAP + GMM]
    Cluster2 --> Summarize2[Level 2 Summarization<br/>Summary of summaries]
    
    Summarize2 --> Embed3[Embed Level 2 Summaries]
    Embed3 --> Cluster3[Level 3 Clustering<br/>UMAP + GMM]
    Cluster3 --> Summarize3[Level 3 Summarization<br/>Highest abstraction]
    
    Summarize3 --> Combine[Combine All Levels<br/>Original chunks + L1/L2/L3 summaries]
    Combine --> Store[Store in Knowledge Base<br/>ChromaDB]
    Store --> End([Knowledge Base Ready])
    
    style GlobalCluster fill:#ffeb3b,stroke:#fbc02d,stroke-width:2px
    style LocalCluster fill:#ffeb3b,stroke:#fbc02d,stroke-width:2px
    style Summarize1 fill:#9c27b0,stroke:#7b1fa2,stroke-width:2px
    style Summarize2 fill:#9c27b0,stroke:#7b1fa2,stroke-width:2px
    style Summarize3 fill:#9c27b0,stroke:#7b1fa2,stroke-width:2px
    style Store fill:#4caf50,stroke:#388e3c,stroke-width:3px
```

---

## Data Processing Workflow

```mermaid
sequenceDiagram
    participant User
    participant FileSystem as File System<br/>(ZIP Archives)
    participant Extractor as Filing Extractor
    participant Parser as Text Processor
    participant Embedder as Embedding Generator
    participant RAPTOR as RAPTOR Engine
    participant KB as Knowledge Base
    participant Storage as Cloud Storage

    User->>FileSystem: Access ZIP files
    FileSystem->>Extractor: Read 10-K/10-Q archives
    Extractor->>Extractor: Unzip and extract filings
    Extractor->>Parser: Send raw HTML/XML
    
    Parser->>Parser: Parse Item 1A (Risk Factors)
    Parser->>Parser: Clean and normalize text
    Parser->>Parser: Chunk into 2000 token segments
    Parser->>Storage: Save chunks as JSON/Parquet
    
    Parser->>Embedder: Send text chunks
    Embedder->>Embedder: Generate embeddings<br/>(Sentence Transformers)
    
    Embedder->>RAPTOR: Send chunks + embeddings
    
    RAPTOR->>RAPTOR: Global clustering (UMAP + GMM)
    RAPTOR->>RAPTOR: Local clustering refinement
    RAPTOR->>RAPTOR: Level 1 summarization
    RAPTOR->>RAPTOR: Recursive clustering (L2)
    RAPTOR->>RAPTOR: Level 2 summarization
    RAPTOR->>RAPTOR: Recursive clustering (L3)
    RAPTOR->>RAPTOR: Level 3 summarization
    
    RAPTOR->>KB: Store enhanced knowledge base<br/>(chunks + summaries)
    KB->>Storage: Persist vector embeddings
    
    Storage-->>User: Processing complete
    
    Note over RAPTOR,KB: Knowledge base now contains:<br/>- Original chunks<br/>- L1/L2/L3 summaries<br/>- Hierarchical structure
```

---

## Implementation Phases

### Phase 1: Model Research & Setup (Week 1)
**Objectives:**
- [ ] Research FinGPT models on Hugging Face (avoid outdated fingpt-rag from 2 years ago)
- [ ] Select model compatible with Ollama deployment
- [ ] Set up project structure (`src/`, `data/`, `notebooks/`, `dashboard/`)
- [ ] Initialize Git repository with proper `.gitignore`
- [ ] Create base `Raptor` class skeleton

**Deliverables:**
- Model selection document
- Project repository structure
- `Raptor` class foundation

---

### Phase 2: Data Processing Pipeline (Week 2)
**Objectives:**
- [ ] Extract filings from ZIP archives
- [ ] Parse 10-K/10-Q HTML/XML to extract Item 1A and other sections
- [ ] Implement document chunking (2000 token chunks with tiktoken)
- [ ] Generate embeddings using local Sentence Transformers
- [ ] Store structured data (chunks + metadata) in JSON/Parquet

**Key Files:**
- `src/data/filing_extractor.py` - Unzip and parse filings
- `src/data/text_processor.py` - Chunking and cleaning
- `src/models/embedding_generator.py` - Embedding creation

**Validation:**
- Test on 3-5 sample filings before scaling
- Verify Item 1A extraction accuracy

---

### Phase 3: RAPTOR System Implementation (Week 3)
**Objectives:**
- [ ] Implement hierarchical clustering:
  - Global clustering (UMAP → GMM with BIC for optimal cluster count)
  - Local clustering (secondary refinement within global clusters)
- [ ] Build recursive summarization engine (3 levels deep)
- [ ] Create enhanced knowledge base combining:
  - Original document chunks
  - Level 1 summaries (cluster summaries)
  - Level 2 summaries (summary of summaries)
  - Level 3 summaries (highest abstraction)
- [ ] Implement cluster-aware retrieval mechanism

**Key Methods in `Raptor` class:**
```python
def global_cluster_embeddings(embeddings, dim, n_neighbors, metric="cosine")
def local_cluster_embeddings(embeddings, dim, num_neighbors=10)
def get_optimal_clusters(embeddings, max_clusters=50)
def GMM_cluster(embeddings, threshold, random_state=0)
def perform_clustering(embeddings, dim, threshold)
def recursive_embed_cluster_summarize(texts, level=1, n_levels=3)
```

**Testing:**
- Validate clustering quality on sample documents
- Review generated summaries for coherence

---

### Phase 4: LLM Integration & Deployment (Week 4)
**Objectives:**
- [ ] Set up Ollama on EC2 instance with selected FinGPT model
- [ ] Deploy Open WebUI for user interaction
- [ ] Integrate RAPTOR knowledge base with LLM query system
- [ ] Implement query handling:
  - YoY change detection queries
  - Risk classification questions
  - Boilerplate vs. substantive disclosure analysis
- [ ] Create sample query templates for common use cases

**Integration Workflow:**
1. User submits query via Open WebUI
2. RAPTOR retrieves relevant chunks + hierarchical summaries
3. Context passed to Ollama LLM
4. LLM generates response with supporting evidence
5. Results displayed in WebUI

**Deliverables:**
- Functional Open WebUI interface
- End-to-end query processing pipeline
- Documentation for common queries

---

## RAPTOR vs. Traditional RAG Comparison

| Feature | Traditional RAG | RAPTOR RAG |
|---------|----------------|------------|
| Text Processing | Simple chunking | Recursive, hierarchical |
| Clustering | None or basic | Multi-level (global + local) |
| Summarization | None or single-level | Recursive, 3-level |
| Context Selection | Similarity-based only | Cluster-aware + similarity |
| Document Understanding | Flat representation | Hierarchical representation |
| Knowledge Integration | Direct chunks only | Chunks + multi-level summaries |

**Why RAPTOR for Financial Filings?**
- Financial documents have hierarchical structure (sections, subsections, themes)
- YoY analysis requires understanding both granular changes and high-level shifts
- Boilerplate detection benefits from cluster analysis (repetitive language clusters together)
- Complex queries need multi-level context (e.g., "How did cyber risk disclosures evolve?")

---

## Success Metrics
- [ ] Successfully process 90%+ of downloaded filings into knowledge base
- [ ] Clustering produces coherent, interpretable groups
- [ ] Generated summaries accurately capture document content at each level
- [ ] LLM queries return relevant, accurate responses with supporting evidence
- [ ] System responds to queries in <10 seconds (including retrieval + generation)
- [ ] Manual validation: Test 10 YoY comparison queries, verify accuracy

---

## Key Advantages of AI-First Approach
1. **No Manual Feature Engineering**: LLM infers patterns from enhanced context (vs. building YoY diff algorithms)
2. **Flexible Queries**: Users can ask arbitrary questions beyond predefined analyses
3. **Semantic Understanding**: Detects substantive changes even when wording differs
4. **Scalable**: Adding new filings just requires re-running RAPTOR pipeline
5. **Explainable**: LLM can cite specific sections supporting its conclusions

---

## Technical Challenges & Mitigations

### Challenge 1: Embedding Generation at Scale
- **Issue**: Processing hundreds of large documents requires compute power
- **Solution**: Use EC2 GPU instance, batch processing, cache embeddings

### Challenge 2: Model Selection
- **Issue**: fingpt-rag outdated (2 years old), not on Hugging Face
- **Solution**: Research alternative FinGPT models on Hugging Face with recent updates

### Challenge 3: Clustering Quality
- **Issue**: Poorly defined clusters reduce summary quality
- **Solution**: Use BIC for optimal cluster count, validate clusters manually on samples

### Challenge 4: Context Window Limits
- **Issue**: LLMs have token limits, can't ingest entire knowledge base
- **Solution**: RAPTOR's hierarchical retrieval provides most relevant chunks + summaries

---

## Repository Structure
```
edgar_anomaly_detection/
├── data/
│   ├── raw/              # ZIP files of 10-K/10-Q (gitignored)
│   ├── processed/        # Extracted, chunked filings (gitignored)
│   └── embeddings/       # Generated embeddings (gitignored)
├── src/
│   ├── data/
│   │   ├── filing_extractor.py
│   │   └── text_processor.py
│   ├── models/
│   │   ├── raptor.py           # Main RAPTOR class
│   │   ├── embedding_generator.py
│   │   └── clustering.py
│   └── pipeline/
│       └── knowledge_base_builder.py
├── notebooks/
│   ├── 01_project_plan.ipynb   # This file
│   ├── 02_data_exploration.ipynb
│   └── 03_raptor_testing.ipynb
├── dashboard/
│   └── README.md               # Open WebUI setup instructions
├── .gitignore
├── requirements.txt
└── README.md
```

---

## Next Steps
1. Begin Phase 1: Research FinGPT models on Hugging Face
2. Create `src/models/raptor.py` skeleton
3. Test embedding generation on 1-2 sample filings
4. Coordinate with team on EC2 instance access and GPU availability

---

## References
- FinGPT Documentation: https://deepwiki.com/AI4Finance-Foundation/FinGPT/
- RAPTOR RAG System: https://deepwiki.com/AI4Finance-Foundation/FinGPT/5.1-raptor-rag-system
- SEC EDGAR API: https://www.sec.gov/edgar/sec-api-documentation
- Ollama: https://ollama.ai/
- Open WebUI: https://github.com/open-webui/open-webui