# Project Resources & Citations

**Purpose:** Centralized list of all research papers, tools, frameworks, and techniques used in this project

**Last Updated:** 2025-10-16

---

## Research Papers & Academic Sources

### RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)
- **Paper:** "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval"
- **Venue:** ICLR 2024
- **Link:** https://arxiv.org/abs/2401.18059
- **Used for:** Hierarchical clustering and recursive summarization approach
- **Key takeaway:** Uses SBERT embeddings, 100-token base chunks, hierarchical retrieval improves performance

---

### Sentence-BERT (SBERT)
- **Paper:** "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"
- **Authors:** Reimers & Gurevych (2019)
- **Link:** https://arxiv.org/abs/1908.10084
- **Used for:** Embedding model selection (all-MiniLM-L6-v2 is from SBERT family)
- **Key takeaway:** 10,000x faster than BERT for similarity search, optimized for semantic similarity

---

### MTEB (Massive Text Embedding Benchmark)
- **Paper:** "MTEB: Massive Text Embedding Benchmark"
- **Year:** 2022
- **Link:** https://arxiv.org/abs/2210.07316
- **Leaderboard:** https://huggingface.co/spaces/mteb/leaderboard
- **Used for:** Validating embedding model quality (all-MiniLM-L6-v2 scores 56.26/100, top 20%)
- **Key takeaway:** 58 embedding tasks for comprehensive evaluation

---

### Anthropic Contextual Retrieval
- **Article:** "Introducing Contextual Retrieval"
- **Date:** September 2024
- **Link:** https://www.anthropic.com/news/contextual-retrieval
- **Used for:** Contextual chunking strategy (50-100 token context window)
- **Key takeaway:** 35% reduction in retrieval failures with contextual embeddings, 49% with BM25, 67% with reranking

---

### RAGAS (Retrieval-Augmented Generation Assessment)
- **Paper:** "Ragas: Automated Evaluation of Retrieval Augmented Generation"
- **Venue:** EACL 2024 (Demo Track)
- **Link:** https://arxiv.org/abs/2309.15217
- **ACL Anthology:** https://aclanthology.org/2024.eacl-demo.16/
- **Docs:** https://docs.ragas.io/en/stable/
- **Used for:** Evaluating RAG system performance (faithfulness, relevancy, precision, recall)
- **Key takeaway:** Reference-free evaluation using LLMs, no need for human-annotated ground truth

---

### Financial Report Chunking for RAG
- **Paper:** "Financial Report Chunking for Effective Retrieval Augmented Generation"
- **Authors:** Jimeno Yepes et al.
- **Year:** 2024
- **Link:** https://arxiv.org/html/2402.05131v3
- **Chunk sizes tested:** 128, 256, 512 tokens + element-based chunking
- **Key findings:** 512 tokens produces similar results to element-based chunking for financial documents
- **Used for:** Validating 500-token chunk size choice for SEC filings

---

### LlamaIndex Chunk Size Evaluation
- **Article:** "Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex"
- **Author:** Ravi Theja
- **Date:** October 5, 2023
- **Link:** https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5
- **Chunk sizes tested:** 128, 256, 512, 1024, 2048 tokens
- **Key findings:** 1024 tokens optimal for faithfulness and relevancy; 512 second best
- **Used for:** Understanding chunk size vs performance tradeoffs

---

### NVIDIA Chunking Strategy Research
- **Article:** "Finding the Best Chunking Strategy for Accurate AI Responses"
- **Author:** Steve Han (NVIDIA)
- **Date:** June 18, 2024
- **Link:** https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/
- **Chunk sizes tested:** 128, 256, 512, 1024, 2048 tokens with 10%, 15%, 20% overlap
- **Key findings:** 15% overlap best for FinanceBench; 256-512 tokens best for factoid queries; 1024 for complex queries
- **Used for:** Validating overlap strategy and chunk size for financial data

---

## Embedding Models

### all-MiniLM-L6-v2 (Sentence Transformers)
- **Model:** sentence-transformers/all-MiniLM-L6-v2
- **Link:** https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- **Docs:** https://www.sbert.net/
- **Specifications:**
  - Dimensions: 384
  - Parameters: 22.7M
  - Speed: ~1000 sentences/second on CPU
  - Context window: 512 tokens
  - Training: 1B+ sentence pairs
- **Used for:** Generating embeddings for 2.7M SEC filing chunks
- **Why chosen:** Best performance-to-size ratio, RAPTOR paper uses SBERT family, 200M+ downloads

---

## LLM Models & Serving

### Ollama
- **Website:** https://ollama.com/
- **Alternate URL:** https://ollama.ai/
- **GitHub:** https://github.com/ollama/ollama
- **Python SDK:** https://github.com/ollama/ollama-python
- **Models Library:** https://ollama.com/library
- **Used for:** Local LLM serving and inference
- **Why chosen:** Easy local deployment, GGUF support, Docker-friendly

---

### gpt-oss (13 GB model)
- **Pulled via:** `ollama pull gpt-oss`
- **Size:** 13 GB
- **Used for:** LLM inference and summarization testing
- **Format:** GGUF (quantized)

---

### Arcee AI llama3-sec
- **Model Page:** https://ollama.com/arcee-ai/llama3-sec
- **Pulled via:** `ollama pull arcee-ai/llama3-sec`
- **Used for:** SEC filing-specific LLM (fine-tuned for financial documents)
- **Note:** Fine-tuned for SEC/financial domain

---

### GGUF (GPT-Generated Unified Format)
- **Description:** Quantized model format for efficient inference
- **Used by:** Ollama, llama.cpp
- **Quantization levels:**
  - q2_K: 2-bit (smallest, fastest, lower quality)
  - q4_K_M: 4-bit (balanced) ← Recommended
  - q8_0: 8-bit (best quality, largest)
- **Benefit:** 8x smaller than full precision (32-bit → 4-bit)

---

## Vector Database

### ChromaDB
- **Website:** https://www.trychroma.com/
- **Docs:** https://docs.trychroma.com/
- **GitHub:** https://github.com/chroma-core/chroma
- **Used for:** Storing and retrieving 2.7M chunk embeddings
- **Why chosen:** Python-native, easy local setup, supports metadata filtering

---

## Python Libraries & Frameworks

### LangChain
- **Docs:** https://python.langchain.com/
- **GitHub:** https://github.com/langchain-ai/langchain
- **Used for:** LLM orchestration and RAG pipeline building

---

### Sentence Transformers
- **Docs:** https://www.sbert.net/
- **GitHub:** https://github.com/UKPLab/sentence-transformers
- **PyPI:** https://pypi.org/project/sentence-transformers/
- **Used for:** Loading and running embedding models (all-MiniLM-L6-v2)

---

### tiktoken
- **GitHub:** https://github.com/openai/tiktoken
- **PyPI:** https://pypi.org/project/tiktoken/
- **Used for:** Token counting (cl100k_base encoding)
- **Why chosen:** Accurate token counts for GPT-style tokenization

---

### UMAP (Uniform Manifold Approximation and Projection)
- **Paper:** https://arxiv.org/abs/1802.03426
- **Docs:** https://umap-learn.readthedocs.io/
- **PyPI:** https://pypi.org/project/umap-learn/
- **Used for:** Dimensionality reduction before clustering (384 dims → 10 dims)
- **Why chosen:** RAPTOR paper uses UMAP for clustering

---

### scikit-learn (GMM - Gaussian Mixture Models)
- **Docs:** https://scikit-learn.org/
- **GMM Docs:** https://scikit-learn.org/stable/modules/mixture.html
- **Used for:** Clustering embeddings after UMAP reduction
- **Why chosen:** RAPTOR uses GMM for hierarchical clustering

---

### NumPy
- **Docs:** https://numpy.org/doc/
- **Used for:** Array operations and embedding storage (.npy files)

---

### pandas
- **Docs:** https://pandas.pydata.org/docs/
- **Used for:** Data manipulation and statistics

---

## Data Sources

### SEC EDGAR 10-X Files (Notre Dame SRAF)
- **Source:** Notre Dame Software Repository for Accounting and Finance (SRAF)
- **Link:** https://sraf.nd.edu/sec-edgar-data/cleaned-10x-files/
- **Data Cleaning Documentation:** https://sraf.nd.edu/sec-edgar-data/cleaned-10x-files/10x-stage-one-parsing-documentation/
- **Description:** Pre-cleaned SEC 10-K and 10-Q filings (1993-2024)
- **Format:** SRAF-XML wrapper around original HTML/XML/SGML filings
- **Coverage:** 31 years of data (~51 GB total)
- **Our dataset:** `10-X_C_2024.zip` (26,014 filings, 1.6 GB compressed)

---

### SEC EDGAR API
- **Documentation:** https://www.sec.gov/edgar/sec-api-documentation
- **Used for:** Understanding SEC filing structure and metadata
- **Note:** Raw API access (we use pre-cleaned SRAF data instead)

---

## Deployment & Infrastructure

### Docker
- **Website:** https://www.docker.com/
- **Docs:** https://docs.docker.com/
- **Used for:** Containerizing LLM services, ChromaDB, and Open WebUI

---

### Docker Compose
- **Docs:** https://docs.docker.com/compose/
- **Used for:** Multi-container orchestration (Ollama + ChromaDB + Open WebUI)

---

### AWS EC2
- **Docs:** https://docs.aws.amazon.com/ec2/
- **Instance Types:** https://aws.amazon.com/ec2/instance-types/
- **Planned instance:** r6i.4xlarge (128 GB RAM, 16 vCPUs)
- **Used for:** Production deployment of RAPTOR RAG system

---

### AWS S3
- **Docs:** https://docs.aws.amazon.com/s3/
- **Used for:** Storing GGUF models for EC2 deployment (free intra-region transfer)

---

### Open WebUI
- **GitHub:** https://github.com/open-webui/open-webui
- **Docs:** https://docs.openwebui.com/
- **Used for:** Interactive web interface for querying RAG system

---

### Open WebUI Custom Themes
- **Tutorial:** "How to Build Custom Open WebUI Themes"
- **Author:** Jonas Scholz (code42cate)
- **Link:** https://dev.to/code42cate/how-to-build-custom-open-webui-themes-55hh
- **Used for:** Customizing Open WebUI appearance and styling
- **Key features:** CSS customization, color schemes, layout modifications

---

## Implementation References

### FinGPT RAPTOR Implementation
- **GitHub:** https://github.com/AI4Finance-Foundation/FinGPT
- **RAPTOR Code:** https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_FinancialReportAnalysis/utils/rag.py
- **Used for:** Reference implementation of RAPTOR for financial documents
- **Key insight:** Adapted for SEC 10-K/10-Q filings

---

### RAPTOR RAG Documentation (FinGPT)
- **Documentation:** https://deepwiki.com/AI4Finance-Foundation/FinGPT/5.1-raptor-rag-system
- **Used for:** Understanding FinGPT's RAPTOR implementation details
- **Key sections:** Hierarchical clustering, recursive summarization, query interface

---

## Techniques & Methodologies

### Contextual Chunking
- **Source:** Anthropic Contextual Retrieval (Sept 2024)
- **Link:** https://www.anthropic.com/news/contextual-retrieval
- **Description:** Embed extended chunks (with context) but store only core chunks
- **Our implementation:**
  - Core chunk: 500 tokens (stored)
  - Context window: 100 tokens (50 before + 50 after)
  - Extended chunk: ~700 tokens (embedded)
  - Header: Company, form type, filing date, CIK
- **Expected improvement:** 35-49% better retrieval accuracy

---

### RAPTOR Hierarchical Clustering
- **Source:** RAPTOR Paper (ICLR 2024)
- **Link:** https://arxiv.org/abs/2401.18059
- **Description:** Multi-level clustering and recursive summarization
- **Steps:**
  1. Embed all chunks (SBERT)
  2. Reduce dimensions (UMAP: 384 → 10 dims)
  3. Cluster (GMM with BIC for optimal K)
  4. Summarize at 3 levels (chunk → cluster → document)
  5. Store summaries for hierarchical retrieval

---

### RAG (Retrieval-Augmented Generation)
- **Description:** Retrieve relevant context, augment prompt, generate answer
- **Pipeline:**
  1. User query → embed query
  2. Similarity search in ChromaDB → retrieve top-K chunks
  3. Augment prompt with retrieved context
  4. LLM generates answer based on context
- **Variants tested:**
  - **Baseline:** No RAG (LLM alone)
  - **Simple RAG:** Basic retrieval (no RAPTOR)
  - **RAPTOR RAG:** Hierarchical retrieval with cluster-aware context

---

## Evaluation Frameworks

### RAGAS Metrics
- **Framework:** https://docs.ragas.io/en/stable/
- **Metrics used:**
  - **Faithfulness:** Does the answer align with retrieved context?
  - **Answer Relevancy:** Does the answer address the question?
  - **Context Precision:** Are relevant chunks ranked higher?
  - **Context Recall:** Was all needed info retrieved?
- **Used for:** Comparing Baseline vs Simple RAG vs RAPTOR RAG

---

### Manual Evaluation
- **Method:** Human review of 5 test questions across 3 systems (15 answers total)
- **Criteria:**
  - Factual accuracy (verifiable from filings)
  - Citation correctness (references right documents)
  - Completeness (addresses all parts of question)
  - No hallucinations (all claims grounded in context)

---

## Project-Specific Decisions

### Chunk Size: 500 Tokens
- **Source:** Multi-year prototype testing (archive_v1_multi_year/) + 2024 research validation
- **Test range:** 200-8000 tokens (12 variants tested on 1,375 filings)
- **Empirical findings:** 500-1000 tokens optimal for SEC filings
- **Chosen:** 500 tokens (balance between granularity and context)

**Research validation:**
1. **Financial Report Chunking (2024):** 512 tokens produces similar results to element-based chunking
2. **NVIDIA FinanceBench (2024):** 256-512 tokens best for factoid queries; 1024 for complex queries
3. **LlamaIndex study (2023):** 512 tokens second best overall (1024 best, but slower)

**Why 500 over 512 or 1000?**
- **Granularity:** 104.8 chunks/filing vs 56 for 1000 tokens → better for RAPTOR clustering
- **Performance:** Close to research-validated 512 tokens
- **Storage:** Reasonable at 15 GB for 26K filings
- **Context boost:** With 100-token context window, effective chunk = ~700 tokens
- **RAPTOR advantage:** More chunks = richer hierarchical tree structure

**Chunk size comparison (1,375 multi-year sample filings):**
- 200 tokens: 381,933 chunks (too granular, 2.2 GB storage)
- 500 tokens: 153,207 chunks (optimal balance, 897 MB)
- 1000 tokens: 76,953 chunks (less granular, 450 MB)
- 8000 tokens: 10,214 chunks (too coarse, 59 MB)

**Citations:**
- Jimeno Yepes et al. (2024): https://arxiv.org/html/2402.05131v3
- NVIDIA (2024): https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/
- LlamaIndex (2023): https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5

---

### Context Window: 100 Tokens
- **Source:** Anthropic research + RAPTOR paper
- **Recommendation:** 50-100 tokens
- **Our choice:** 100 tokens (50 before + 50 after)
- **Rationale:** Aligns with both Anthropic and RAPTOR recommendations

**Citations:**
- Anthropic Contextual Retrieval (Sept 2024): https://www.anthropic.com/news/contextual-retrieval
- RAPTOR Paper (ICLR 2024): https://arxiv.org/abs/2401.18059

---

### 2024 Data Only (vs Multi-Year)
- **Decision:** 26K filings from 2024 instead of 1,375 samples across 1993-2024
- **Rationale:**
  - 19x more data
  - Temporal consistency
  - Better for clustering
  - Cleaner baseline for prototyping
- **Source:** Coworker suggestion, validated by research

---

## Additional Resources

### SEC Filing Guides
- **How to Read 10-K/10-Q:** https://www.sec.gov/resources-for-investors/investor-alerts-bulletins/how-read-10-k10-q
- **SEC.gov:** https://www.sec.gov/

---

### Related Research
- **LLM Analysis of 10-K/10-Q:** https://www.researchgate.net/publication/377746616_LLM_Analysis_of_10-K_and_10-Q_Filings_RAG_Results
- **RAG Evaluation Study (Oct 2024):** https://www.mdpi.com/2076-3417/14/20/9318
- **Financial Chatbot with RAG:** https://medium.com/@RobuRishabh/financial-analysis-chatbot-for-10-q-10-k-reports-using-retrieval-augmented-generation-rag-ef3938892086

---

## Summary Statistics (Our Project)

### Data Processing
- **Filings processed:** 26,014 (100% success rate)
- **Total chunks:** 2,725,171
- **Processing time:** 42.1 minutes
- **Output size:** 14,957 MB (~15 GB)

### Token Statistics
- **Total tokens:** 1,356,067,955 (1.36 billion)
- **Core tokens (stored):** 1,356,067,955
- **Extended tokens (embedded):** 1,625,920,116
- **Context overhead:** 19.9%
- **Avg tokens/filing:** 52,128

### Expected Outputs
- **Embeddings:** ~4.2 GB (2.7M × 384 dims × 4 bytes)
- **ChromaDB:** ~10-15 GB (with metadata)
- **Total storage:** ~30 GB for complete system

---

**Last Updated:** 2025-10-16