# SEC 10-K Risk Factor Analysis - Project Plan

## Project Overview
AI-powered system for analyzing SEC 10-K and 10-Q filings using RAPTOR RAG (Recursive Adaptive Processing and Topical Organizational Retrieval). The system will create an enhanced knowledge base from financial filings that users can query interactively to identify year-over-year changes, risk patterns, and potential fraud indicators.

**Data Coverage:** 1993-2024 (31 years of SEC EDGAR filings)

---

## Core Architecture

### Infrastructure
- **Deployment**: AWS EC2 instance with GPU (in progress)
- **Model Hosting**: Ollama for local LLM deployment
- **User Interface**: Open WebUI for interactive queries
- **Data Storage**: Cloud-based storage for processed embeddings and knowledge base

### Architecture Diagram

![System Architecture](diagrams/architecture.png)

### RAPTOR RAG System
Unlike traditional RAG systems that use simple similarity search, RAPTOR implements:
- **Hierarchical Clustering**: Multi-level organization (global + local) using UMAP and Gaussian Mixture Models
- **Recursive Summarization**: 3-level hierarchical summaries capturing both granular details and high-level themes
- **Enhanced Context Retrieval**: Cluster-aware retrieval providing richer context for LLM queries

---

## Technical Stack

### NLP & ML
- **LLM Model**: FinGPT-v3 (Llama2-based, fine-tuned for financial analysis)
  - Source: `AI4Finance-Foundation/FinGPT-v3` on Hugging Face
  - Deployment: Via Ollama (`ollama pull hf.co/AI4Finance-Foundation/FinGPT-v3`)
  - Alternative: Any Ollama-compatible model (Llama3, Mistral, etc.)
- **RAPTOR Implementation**: Adapted from FinGPT's `FinancialReportAnalysis/utils/rag.py`
  - Source: https://github.com/AI4Finance-Foundation/FinGPT
  - Custom implementation in `src/models/raptor.py`
- **Embeddings**: Sentence Transformers (`all-MiniLM-L6-v2`) for local, cost-free embedding generation
- **Clustering**: UMAP (dimensionality reduction) + scikit-learn GMM
- **LLM Interface**: Ollama (primary) or OpenAI API (testing/comparison)

### Data Processing
- **Chunking**: LangChain `RecursiveCharacterTextSplitter` (~2000 tokens/chunk)
- **Vector Storage**: ChromaDB for efficient retrieval
- **Data Format**: JSON/Parquet for structured storage

### Libraries
- `langchain`, `langchain_community` - LLM orchestration
- `sentence-transformers` - Local embeddings
- `umap-learn` - Dimensionality reduction
- `scikit-learn` - Clustering algorithms (GMM)
- `pandas`, `numpy` - Data manipulation
- `requests` - SEC EDGAR API access

---

## Data Scope

### Current Data Holdings
- **Time Period:** 1993-2024 (31 years)
- **Data Location:** `data/external/`
- **Filing Types:** 10-K (annual reports) and 10-Q (quarterly filings)
- **Target Sections:** 
  - Item 1A (Risk Factors) - primary focus
  - MD&A (Management Discussion & Analysis)
  - Other disclosure sections as needed
- **Analysis Focus:** Year-over-year changes, new/removed risks, boilerplate vs. substantive disclosure

---

## RAPTOR Pipeline Flowchart

![RAPTOR Pipeline](diagrams/raptor_pipeline.png)

---

## Data Processing Workflow

![Data Processing Workflow](diagrams/data_processing_workflow.png)

---

## Implementation Phases

### Phase 1: Model Research & Setup (Week 1)
**Objectives:**
- [x] Clarify FinGPT components: FinGPT-v3 (LLM model) vs RAPTOR (Python implementation)
- [ ] Pull FinGPT-v3 model into Ollama for testing
- [ ] Copy and adapt RAPTOR class from FinGPT's `rag.py`
- [ ] Set up project structure (`src/`, `data/`, `notebooks/`, `dashboard/`)
- [ ] Create base `Raptor` class skeleton in `src/models/raptor.py`

**Deliverables:**
- FinGPT-v3 running in Ollama
- RAPTOR class adapted from FinGPT source
- Project repository structure

**Key Clarification:**
- ✅ **FinGPT-v3** = Fine-tuned LLM model (downloadable from Hugging Face, runnable in Ollama)
- ✅ **RAPTOR** = Python implementation for hierarchical clustering/summarization (we copy the code)
- ✅ **fingpt-rag** = Deprecated project name (not a model), replaced by newer implementations

---

### Phase 2: Data Processing Pipeline (Week 2)
**Objectives:**
- [ ] Extract filings from downloaded archives (1993-2024)
- [ ] Parse 10-K/10-Q HTML/XML to extract Item 1A and other sections
- [ ] Implement document chunking (2000 token chunks with tiktoken)
- [ ] Generate embeddings using local Sentence Transformers
- [ ] Store structured data (chunks + metadata) in JSON/Parquet

**Key Files:**
- `src/data/filing_extractor.py` - Unzip and parse filings
- `src/data/text_processor.py` - Chunking and cleaning
- `src/models/embedding_generator.py` - Embedding creation

**Validation:**
- Test on 3-5 sample filings before scaling
- Verify Item 1A extraction accuracy across different time periods

---

### Phase 3: RAPTOR System Implementation (Week 3)
**Objectives:**
- [ ] Implement hierarchical clustering (adapted from FinGPT's RAPTOR):
  - Global clustering (UMAP → GMM with BIC for optimal cluster count)
  - Local clustering (secondary refinement within global clusters)
- [ ] Build recursive summarization engine (3 levels deep)
- [ ] Create enhanced knowledge base combining:
  - Original document chunks
  - Level 1 summaries (cluster summaries)
  - Level 2 summaries (summary of summaries)
  - Level 3 summaries (highest abstraction)
- [ ] Implement cluster-aware retrieval mechanism

**Key Methods in `Raptor` class (adapted from FinGPT):**
```python
def global_cluster_embeddings(embeddings, dim, n_neighbors, metric="cosine")
def local_cluster_embeddings(embeddings, dim, num_neighbors=10)
def get_optimal_clusters(embeddings, max_clusters=50)
def GMM_cluster(embeddings, threshold, random_state=0)
def perform_clustering(embeddings, dim, threshold)
def recursive_embed_cluster_summarize(texts, level=1, n_levels=3)
```

**Source Reference:**
- Original implementation: https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py

**Testing:**
- Validate clustering quality on sample documents
- Review generated summaries for coherence

---

### Phase 4: LLM Integration & Deployment (Week 4)
**Objectives:**
- [ ] Set up Ollama on EC2 instance with FinGPT-v3 model
- [ ] Deploy Open WebUI for user interaction
- [ ] Integrate RAPTOR knowledge base with LLM query system
- [ ] Implement query handling:
  - YoY change detection queries
  - Risk classification questions
  - Boilerplate vs. substantive disclosure analysis
- [ ] Create sample query templates for common use cases

**Integration Workflow:**
1. User submits query via Open WebUI
2. RAPTOR retrieves relevant chunks + hierarchical summaries
3. Context passed to Ollama LLM (FinGPT-v3)
4. LLM generates response with supporting evidence
5. Results displayed in WebUI

**Deliverables:**
- Functional Open WebUI interface
- End-to-end query processing pipeline
- Documentation for common queries

---

## RAPTOR vs. Traditional RAG Comparison

| Feature | Traditional RAG | RAPTOR RAG |
|---------|----------------|------------|
| Text Processing | Simple chunking | Recursive, hierarchical |
| Clustering | None or basic | Multi-level (global + local) |
| Summarization | None or single-level | Recursive, 3-level |
| Context Selection | Similarity-based only | Cluster-aware + similarity |
| Document Understanding | Flat representation | Hierarchical representation |
| Knowledge Integration | Direct chunks only | Chunks + multi-level summaries |

**Why RAPTOR for Financial Filings?**
- Financial documents have hierarchical structure (sections, subsections, themes)
- YoY analysis requires understanding both granular changes and high-level shifts
- Boilerplate detection benefits from cluster analysis (repetitive language clusters together)
- Complex queries need multi-level context (e.g., "How did cyber risk disclosures evolve?")
- Historical coverage (1993-2024) enables long-term trend analysis

---

## Success Metrics
- [ ] Successfully process 90%+ of downloaded filings (1993-2024) into knowledge base
- [ ] Clustering produces coherent, interpretable groups
- [ ] Generated summaries accurately capture document content at each level
- [ ] LLM queries return relevant, accurate responses with supporting evidence
- [ ] System responds to queries in <10 seconds (including retrieval + generation)
- [ ] Manual validation: Test 10 YoY comparison queries across different decades, verify accuracy

---

## Key Advantages of AI-First Approach
1. **No Manual Feature Engineering**: LLM infers patterns from enhanced context (vs. building YoY diff algorithms)
2. **Flexible Queries**: Users can ask arbitrary questions beyond predefined analyses
3. **Semantic Understanding**: Detects substantive changes even when wording differs
4. **Scalable**: Adding new filings just requires re-running RAPTOR pipeline
5. **Explainable**: LLM can cite specific sections supporting its conclusions
6. **Historical Depth**: 31 years of data enables long-term trend analysis

---

## Technical Challenges & Mitigations

### Challenge 1: Embedding Generation at Scale
- **Issue**: Processing 31 years of large documents requires significant compute power
- **Solution**: Use EC2 GPU instance, batch processing, cache embeddings, process in chronological chunks

### Challenge 2: Model Selection & Deployment
- **Issue**: Need to clarify what FinGPT components to use (model vs. implementation)
- **Solution**: 
  - Use FinGPT-v3 LLM model via Ollama (Hugging Face → Ollama)
  - Copy RAPTOR implementation from FinGPT's `rag.py`
  - Maintain flexibility to swap models (Llama3, Mistral, etc.)

### Challenge 3: Clustering Quality
- **Issue**: Poorly defined clusters reduce summary quality
- **Solution**: Use BIC for optimal cluster count, validate clusters manually on samples

### Challenge 4: Context Window Limits
- **Issue**: LLMs have token limits, can't ingest entire knowledge base
- **Solution**: RAPTOR's hierarchical retrieval provides most relevant chunks + summaries

### Challenge 5: Data Format Evolution
- **Issue**: SEC filing formats changed significantly between 1993 and 2024
- **Solution**: Build robust parsing logic that handles HTML, SGML, and modern XML formats

---

## Repository Structure
```
edgar_anomaly_detection/
├── data/
│   ├── external/         # Downloaded filings 1993-2024 (gitignored)
│   ├── processed/        # Extracted, chunked filings (gitignored)
│   └── embeddings/       # Generated embeddings (gitignored)
├── src/
│   ├── data/
│   │   ├── filing_extractor.py
│   │   └── text_processor.py
│   ├── models/
│   │   ├── raptor.py           # RAPTOR class (adapted from FinGPT)
│   │   ├── embedding_generator.py
│   │   └── clustering.py
│   └── pipeline/
│       └── knowledge_base_builder.py
├── notebooks/
│   ├── 01_project_plan.ipynb   # This file
│   ├── 02_data_collection.ipynb
│   └── 03_raptor_testing.ipynb
├── dashboard/
│   └── README.md               # Open WebUI setup instructions
├── .gitignore
├── requirements.txt
└── README.md
```

---

## Next Steps
1. Pull FinGPT-v3 model into Ollama: `ollama pull hf.co/AI4Finance-Foundation/FinGPT-v3`
2. Copy RAPTOR class from FinGPT GitHub to `src/models/raptor.py`
3. Test FinGPT-v3 on sample financial text queries
4. Test embedding generation on 1-2 sample filings from different time periods (e.g., 1995, 2010, 2024)
5. Coordinate with team on EC2 instance access and GPU availability

---

## References
- **FinGPT GitHub:** https://github.com/AI4Finance-Foundation/FinGPT
- **FinGPT RAPTOR Implementation:** https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py
- **FinGPT Models on Hugging Face:** https://huggingface.co/AI4Finance-Foundation
- **RAPTOR RAG Documentation:** https://deepwiki.com/AI4Finance-Foundation/FinGPT/5.1-raptor-rag-system
- **SEC EDGAR API:** https://www.sec.gov/edgar/sec-api-documentation
- **Ollama:** https://ollama.ai/
- **Open WebUI:** https://github.com/open-webui/open-webui