# SEC Filing Evaluation Tests

## Overview
This notebook contains standardized test questions for evaluating different system configurations:
- **Baseline (No RAG)**: Base LLM without access to SEC filing data
- **Simple RAG**: Traditional vector similarity retrieval
- **RAPTOR RAG**: Hierarchical clustering with recursive summarization

## Evaluation Approach
Each question tests different capabilities:
1. Specific fact retrieval from single documents
2. Multi-document synthesis and comparison
3. Cross-company pattern recognition
4. Temporal frequency analysis
5. Multi-hop financial reasoning

**Last Updated:** 2025-10-16

---

## Test Questions

### Question 1: Specific Fact Retrieval

**Question:**
> "What was Apple Inc.'s total revenue for fiscal year 2024 as reported in their 10-K filing?"

**Purpose:** Tests whether the system can locate and extract precise numerical data from a specific document.

**Expected Behavior:**
- Baseline: Will either refuse to answer or guess based on outdated training data
- Simple RAG: Should retrieve the correct figure
- RAPTOR RAG: Should retrieve correct figure with additional context

---

### Question 2: Multi-Document Synthesis

**Question:**
> "Compare Microsoft's revenue growth strategy between Q1 and Q4 2024 based on their 10-Q filings. What changed?"

**Purpose:** Requires pulling information from two separate quarterly filings and identifying strategic differences.

**Expected Behavior:**
- Baseline: Cannot access these documents
- Simple RAG: Can retrieve relevant sections but may return disconnected chunks
- RAPTOR RAG: Summarization layers designed to capture temporal patterns

---

### Question 3: Cross-Company Pattern Recognition

**Question:**
> "What are the top 3 cybersecurity risks mentioned across all financial services companies in their 2024 10-K filings?"

**Purpose:** Tests semantic search across multiple companies plus the ability to aggregate similar concepts.

**Expected Behavior:**
- Baseline: Will produce generic cybersecurity talking points
- Simple RAG: Can find relevant sections but won't naturally group similar risks
- RAPTOR RAG: Clustering should automatically group conceptually similar risks

---

### Question 4: Temporal Frequency Analysis

**Question:**
> "Did Tesla's 'Management Discussion and Analysis' section mention supply chain issues more or less frequently in Q4 2024 compared to Q1 2024?"

**Purpose:** Checks whether the system can track topic frequency across time periods.

**Expected Behavior:**
- Baseline: No data access
- Simple RAG: Can retrieve both sections but won't automatically count mentions
- RAPTOR RAG: Multi-level summaries might capture frequency shifts

---

### Question 5: Multi-Hop Financial Reasoning

**Question:**
> "Based on 2024 10-K filings, which tech companies reported both increased R&D spending AND declining profit margins? Explain the relationship."

**Purpose:** Finding companies meeting two financial criteria, then explaining the connection.

**Expected Behavior:**
- Baseline: Will make up plausible-sounding examples
- Simple RAG: Must separately retrieve R&D and margin data for multiple companies
- RAPTOR RAG: Hierarchical structure should make multi-part queries more manageable

---

---

# Baseline Test Results (No RAG)

Testing base models without any access to SEC filing data.

**Date:** 2025-10-16  
**Models Tested:** gpt-oss (13GB), llama3.1:8b (4.7GB)

## GPT-OSS (13GB)

### Question 1: Apple's 2024 Total Revenue

**Response:**
```
[Paste GPT-OSS response here]
```

**Analysis:**
- Factual Accuracy: [correct / incorrect / partially correct / refused to answer]
- Hallucination: [yes / no]
- Notes:

---

### Question 2: Microsoft Q1 vs Q4 2024 Strategy

**Response:**
```
[Paste GPT-OSS response here]
```

**Analysis:**
- Factual Accuracy:
- Hallucination:
- Notes:

---

### Question 3: Financial Services Cybersecurity Risks

**Response:**
```
[Paste GPT-OSS response here]
```

**Analysis:**
- Factual Accuracy:
- Hallucination:
- Notes:

---

### Question 4: Tesla Supply Chain Mentions

**Response:**
```
[Paste GPT-OSS response here]
```

**Analysis:**
- Factual Accuracy:
- Hallucination:
- Notes:

---

### Question 5: Tech Companies R&D + Margin Decline

**Response:**
```
[Paste GPT-OSS response here]
```

**Analysis:**
- Factual Accuracy:
- Hallucination:
- Notes:

---

## Llama 3.1:8b (4.7GB)

### Question 1: Apple's 2024 Total Revenue

**Response:**
```
[Paste Llama 3.1 response here]
```

**Analysis:**
- Factual Accuracy: [correct / incorrect / partially correct / refused to answer]
- Hallucination: [yes / no]
- Notes:

---

### Question 2: Microsoft Q1 vs Q4 2024 Strategy

**Response:**
```
[Paste Llama 3.1 response here]
```

**Analysis:**
- Factual Accuracy:
- Hallucination:
- Notes:

---

### Question 3: Financial Services Cybersecurity Risks

**Response:**
```
[Paste Llama 3.1 response here]
```

**Analysis:**
- Factual Accuracy:
- Hallucination:
- Notes:

---

### Question 4: Tesla Supply Chain Mentions

**Response:**
```
[Paste Llama 3.1 response here]
```

**Analysis:**
- Factual Accuracy:
- Hallucination:
- Notes:

---

### Question 5: Tech Companies R&D + Margin Decline

**Response:**
```
[Paste Llama 3.1 response here]
```

**Analysis:**
- Factual Accuracy:
- Hallucination:
- Notes:

---

## Baseline Summary

### GPT-OSS Performance
- Overall Accuracy:
- Hallucination Rate:
- Best Performance On:
- Worst Performance On:

### Llama 3.1:8b Performance
- Overall Accuracy:
- Hallucination Rate:
- Best Performance On:
- Worst Performance On:

### Key Findings
- [Add observations here]

---

---

# Simple RAG Test Results

**Status:** Not yet started  
**Prerequisites:** Embeddings complete, ChromaDB setup  
**Date:** TBD

Traditional vector similarity retrieval using 2024 SEC filings.

## GPT-OSS + Simple RAG

[Will add results after ChromaDB setup]

---

## Llama 3.1:8b + Simple RAG

[Will add results after ChromaDB setup]

---

---

# RAPTOR RAG Test Results

**Status:** Not yet started  
**Prerequisites:** RAPTOR clustering complete, hierarchical summaries generated  
**Date:** TBD

Hierarchical clustering with recursive summarization using 2024 SEC filings.

## GPT-OSS + RAPTOR RAG

[Will add results after RAPTOR implementation]

---

## Llama 3.1:8b + RAPTOR RAG

[Will add results after RAPTOR implementation]

---

---

# Comparative Analysis

## Cross-Configuration Comparison

| Question | GPT-OSS Baseline | GPT-OSS Simple RAG | GPT-OSS RAPTOR | Llama Baseline | Llama Simple RAG | Llama RAPTOR |
|----------|------------------|--------------------|--------------------|----------------|------------------|------------------|
| Q1: Apple Revenue | TBD | TBD | TBD | TBD | TBD | TBD |
| Q2: Microsoft Strategy | TBD | TBD | TBD | TBD | TBD | TBD |
| Q3: Cyber Risks | TBD | TBD | TBD | TBD | TBD | TBD |
| Q4: Tesla Supply Chain | TBD | TBD | TBD | TBD | TBD | TBD |
| Q5: R&D + Margins | TBD | TBD | TBD | TBD | TBD | TBD |

## RAGAS Metrics (When Available)

| Model + Configuration | Context Precision | Context Recall | Faithfulness | Answer Relevancy |
|----------------------|-------------------|----------------|--------------|------------------|
| GPT-OSS Baseline | N/A | N/A | TBD | TBD |
| GPT-OSS Simple RAG | TBD | TBD | TBD | TBD |
| GPT-OSS RAPTOR | TBD | TBD | TBD | TBD |
| Llama 3.1 Baseline | N/A | N/A | TBD | TBD |
| Llama 3.1 Simple RAG | TBD | TBD | TBD | TBD |
| Llama 3.1 RAPTOR | TBD | TBD | TBD | TBD |

## Key Findings

[Will populate after all tests complete]

---

---

# Notes

## Testing Procedure

1. Run baseline tests first (no RAG) to establish performance floor
2. Complete embedding generation (in progress)
3. Set up ChromaDB with 2024 filing embeddings
4. Run Simple RAG tests
5. Implement RAPTOR clustering and summarization
6. Run RAPTOR RAG tests
7. Apply RAGAS evaluation framework
8. Conduct human evaluation (2 analysts)

## Adding New Models

To add a new model for testing:
1. Add a new section under each test configuration (Baseline, Simple RAG, RAPTOR RAG)
2. Use the same question structure
3. Add the model to the comparative analysis tables

## References

- RAGAS Framework: https://github.com/explodinggradients/ragas
- RAPTOR Paper: https://arxiv.org/abs/2401.18059
- Test questions developed from edgar_ai_project.txt evaluation methodology

---