# SEC EDGAR Data Collection - Technical Summary

**Date:** October 8, 2025  
**Updated:** Scope expanded to full SRAF dataset (2001-2024)  
**Objective:** Document data collection approach and decisions for fraud detection project

## 1. Project Overview

### Goal
Develop an AI-powered fraud detection system using RAG (Retrieval-Augmented Generation) and optional classification models to analyze SEC EDGAR filings:
- **10-K/10-Q Analysis:** Year-over-year changes in risk disclosures, financial statements, MD&A
- **Focus:** All sections (not just Item 1A) to detect comprehensive fraud signals
- **Optional 8-K Analysis:** Event-driven anomalies for supplementary validation

### Known Fraud Cases (Validation Set)
1. **Under Armour (UAA)** - Accounting fraud 2015-2017 (revenue manipulation)
2. **Luckin Coffee (LK)** - Accounting fraud 2019-2020 (fabricated sales, delisted)
3. **Nikola Corporation (NKLA)** - Securities fraud 2019-2020 (false product claims)

### Final Dataset Scope (Updated October 8, 2025)
- **Companies:** ALL companies in SRAF dataset (thousands)
- **Time Period:** 2001-2024 (24 years)
- **Primary Source:** Notre Dame SRAF 10-X Cleaned Files
- **Supplementary Source:** Self-scraped 8-K filings (3,964 filings, 2015-2024)

---
## 2. Initial Data Collection Approach (Scrapers)

### 2.1 Technology Stack
```python
# Core libraries
import sec_edgar_downloader  # SEC API wrapper with rate limiting
import json
from pathlib import Path
from dotenv import load_dotenv
```

### 2.2 Custom Scrapers Built (Proof of Effort)

**Note:** These scrapers were built to understand the data structure and validate that we could access SEC filings programmatically. After evaluation of external datasets, we switched to SRAF as the primary source (see Section 5).

#### **10-K Scraper** (`src/data/edgar_10k_scraper.py`)
- **Purpose:** Download annual 10-K reports for 23 selected companies
- **Date Range:** 2015-2024 (captures Under Armour fraud period 2015-2017)
- **Rate Limiting:** Built-in via `sec-edgar-downloader` (10 req/sec)
- **Output:** `data/raw/sec-edgar-filings/{CIK}/10-K/{ACCESSION}/`
  - `full-submission.txt` - Complete filing
  - `primary-document.html` - Main 10-K document

**Results:**
- **230 filings** across 23 companies
- **Fraud case coverage:**
  - Under Armour (UAA): 11 filings
  - Nikola (NKLA): 6 filings
  - Luckin Coffee (LK): 0 filings (delisted)

#### **8-K Scraper** (`src/data/edgar_8k_scraper.py`)
- **Purpose:** Download event-driven 8-K filings for fraud indicators
- **Target Items:**
  - **Item 4.02:** Non-reliance on financials (restatements)
  - **Item 5.02:** Departure of directors/officers
  - **Item 8.01:** Other events (investigations, lawsuits)
  - **Item 2.01:** Acquisition/disposition
- **Date Range:** 2015-2024

**Results:**
- **3,964 filings** across 23 companies
- **Fraud case coverage:**
  - Under Armour (UAA): 141 filings
  - Nikola (NKLA): 118 filings
- **Kept as supplementary data source**

### 2.3 Why We Built These Scrapers

1. **Validate SEC API access** - Confirmed we could programmatically download filings
2. **Understand data structure** - Learned about HTML/XML formatting variations
3. **Fraud case validation** - Confirmed UAA and NKLA data availability
4. **Backup data source** - 8-K filings provide supplementary event signals

### 2.3 File Structure

```
data/raw/sec-edgar-filings/
├── {CIK}/                          # 10-digit company identifier
│   ├── 10-K/
│   │   └── {ACCESSION_NUMBER}/     # e.g., 0000012927-24-000010
│   │       ├── full-submission.txt # Complete HTML/XML filing
│   │       └── primary-document.html
│   └── 8-K/
│       └── {ACCESSION_NUMBER}/
│           ├── full-submission.txt
│           └── primary-document.html
```

**Storage:** ~2-3 GB total (uncompressed)

### 2.4 Company Configuration

Companies selected for sector diversity and fraud case representation:

In [1]:
# Load company list
import json

with open('../config/companies.json', 'r') as f:
    config = json.load(f)

print(f"Total companies: {len(config['companies'])}\n")

# Show fraud cases
fraud_cases = [c for c in config['companies'] if c.get('fraud_case', False)]
print("Fraud Cases:")
for company in fraud_cases:
    print(f"  - {company['ticker']}: {company['name']} ({company['sector']})")

# Sector breakdown
from collections import Counter
sectors = Counter(c['sector'] for c in config['companies'])
print("\nSector Distribution:")
for sector, count in sectors.most_common():
    print(f"  {sector}: {count}")

Total companies: 23

Fraud Cases:
  - LK: Luckin Coffee Inc. (Restaurants/Beverages)
  - NKLA: Nikola Corporation (Automotive/Electric Vehicles)
  - UAA: Under Armour Inc. (Consumer Goods/Apparel)

Sector Distribution:
  Technology: 3
  Financial Services: 3
  Healthcare/Pharma: 2
  Retail: 2
  Energy: 2
  Consumer Goods: 2
  E-commerce/Tech: 1
  Automotive/Tech: 1
  Healthcare: 1
  Aerospace/Defense: 1
  Entertainment/Media: 1
  Beverages: 1
  Restaurants/Beverages: 1
  Automotive/Electric Vehicles: 1
  Consumer Goods/Apparel: 1


---
## 3. Download Statistics

### 3.1 Load Download Summaries

In [2]:
# Load 10-K summary
with open('../data/raw/download_summary.json', 'r') as f:
    summary_10k = json.load(f)

print("=== 10-K Download Summary ===")
print(f"Companies: {summary_10k['companies']}")
print(f"Target Years: {summary_10k['target_years']}")
print(f"Total Filings: {summary_10k['total_filings']}")
print(f"\nTop 5 by filing count:")
import pandas as pd
df_10k = pd.DataFrame(list(summary_10k['by_company'].items()), columns=['Ticker', 'Count'])
print(df_10k.sort_values('Count', ascending=False).head())

=== 10-K Download Summary ===
Companies: 23
Target Years: [2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]
Total Filings: 230

Top 5 by filing count:
  Ticker  Count
1   MSFT     11
4   TSLA     11
3   AMZN     11
5    JPM     11
6    BAC     11


In [3]:
# Load 8-K summary
with open('../data/raw/download_summary_8k.json', 'r') as f:
    summary_8k = json.load(f)

print("=== 8-K Download Summary ===")
print(f"Total Filings: {summary_8k['total_filings']}")
print(f"Critical Items: {summary_8k['critical_items']}")
print(f"\nTop 5 by filing count:")
df_8k = pd.DataFrame(list(summary_8k['by_company'].items()), columns=['Ticker', 'Count'])
print(df_8k.sort_values('Count', ascending=False).head())

=== 8-K Download Summary ===
Total Filings: 3967
Critical Items: ['4.02', '5.02', '8.01', '2.01']

Top 5 by filing count:
   Ticker  Count
7     WFC    870
5     JPM    344
19     PG    229
4    TSLA    184
10    UNH    171


### 3.2 Fraud Case Coverage Analysis

In [4]:
fraud_tickers = ['UAA', 'NKLA', 'LK']
fraud_periods = {
    'UAA': '2015-2017 (accounting fraud - revenue manipulation)',
    'NKLA': '2019-2020 (securities fraud - false product claims)',
    'LK': '2019-2020 (accounting fraud - fabricated sales, delisted)'
}

print("Fraud Case Filing Coverage:\n")
for ticker in fraud_tickers:
    count_10k = summary_10k['by_company'].get(ticker, 0)
    count_8k = summary_8k['by_company'].get(ticker, 0)
    print(f"{ticker} - {fraud_periods[ticker]}")
    print(f"  10-K: {count_10k} filings")
    print(f"  8-K: {count_8k} filings")
    print()

Fraud Case Filing Coverage:

UAA - 2015-2017 (accounting fraud - revenue manipulation)
  10-K: 11 filings
  8-K: 141 filings

NKLA - 2019-2020 (securities fraud - false product claims)
  10-K: 6 filings
  8-K: 118 filings

LK - 2019-2020 (accounting fraud - fabricated sales, delisted)
  10-K: 0 filings
  8-K: 0 filings



---
## 4. Decision to Switch to SRAF Dataset

### 4.1 Problem with Scraper Approach

**Limitations:**
- **Limited scope:** Only 23 companies (not enough for robust model)
- **Time period:** Only 2015-2024 (missing historical context)
- **Manual selection:** Potential selection bias
- **Rate limiting:** Slow downloads (10 req/sec)
- **Processing burden:** Need to parse HTML/XML ourselves

### 4.2 External Dataset Evaluation

Evaluated 4 datasets (see `external_datasets_evaluation.xlsx`):

1. **❌ Kaggle - SEC EDGAR Company Facts (Lang)** - 13.86 GB of XBRL numbers only
2. **❌ Kaggle - Parsed 10-Q Filings (Aenlle)** - Quarterly metrics only
3. **❌ Kaggle - Company Facts v2 (Vanak)** - 75.4M rows of numerical facts
4. **✅ Notre Dame SRAF 10-X Files** - **FULL TEXT** of 10-K/10-Q (1993-2024)

**Key Finding:** All Kaggle datasets contain only structured XBRL numbers (no narrative text). SRAF is the only source with full filing text needed for NLP fraud detection.

### 4.3 SRAF Dataset Verification

**Test:** Extracted sample 10-K from 2016-2020 zip file

```bash
unzip -p "10-X_C_2016-2020.zip" "2016/QTR1/20160104_10-K_*.txt" | head -100
```

**Result:** Found actual Item 1A Risk Factors text, MD&A sections, full narrative content ✅

**Decision:** Use SRAF as primary source, keep scraped 8-K data as supplementary

---
## 5. SRAF Dataset - Final Data Source

### 5.1 Dataset Overview

**Source:** Notre Dame Software Repository for Accounting and Finance (SRAF)  
**URL:** https://sraf.nd.edu/sec-edgar-data/  
**Content:** Full text of 10-K and 10-Q filings (cleaned format)  
**Coverage:** 1993-2024 (31 years)

### 5.2 Downloaded Files

All files stored in `data/external/`:

| File | Period | Status |
|------|--------|--------|
| 10-X_C_2001-2005.zip | 2001-2005 | ✅ Downloaded |
| 10-X_C_2006-2010.zip | 2006-2010 | ✅ Downloaded |
| 10-X_C_2011-2015.zip | 2011-2015 | ✅ Downloaded |
| 10-X_C_2016-2020.zip | 2016-2020 | ✅ Downloaded |
| 10-X_C_2021.zip | 2021 | ✅ Downloaded |
| 10-X_C_2022.zip | 2022 | ✅ Downloaded |
| 10-X_C_2023.zip | 2023 | ✅ Downloaded |
| 10-X_C_2024.zip | 2024 | ✅ Downloaded |

**Total:** 8 files, ~42 GB compressed

### 5.3 File Structure

```
data/external/10-X_C_2016-2020.zip
    └── 2016/
        └── QTR1/
            └── 20160104_10-K_edgar_data_12345_0001234567-16-000123.txt
```

**Naming convention:** `YYYYMMDD_FORM_edgar_data_CIK_ACCESSION.txt`

---
## 6. Current Status

### ✅ Completed
- [x] Built 10-K scraper (230 filings, 23 companies)
- [x] Built 8-K scraper (3,964 filings, 23 companies)
- [x] Evaluated external datasets (Kaggle + SRAF)
- [x] Downloaded SRAF 10-X files (2001-2024, ~42 GB compressed)
- [x] Verified SRAF contains full text (not just numbers)

### 📋 Next Steps (See `01_project_plan.ipynb` for details)
1. **Text Extraction:** Extract full text from SRAF zip files
2. **Embeddings:** Create vector representations of filings
3. **Vector Database:** Store embeddings (ChromaDB or FAISS)
4. **RAG Pipeline:** Build retrieval system with FinGPT/Ollama
5. **Interface:** Set up Open WebUI for team interaction
6. **Deployment:** Deploy on AWS EC2 with GPU
7. **Optional:** Train fraud classifier for automated detection

---
## 7. Technical Notes

### SEC API Compliance (Scrapers)
- **User-Agent Required:** Set in `.env` as `SEC_USER_AGENT`
- **Rate Limit:** 10 requests/second (enforced by `sec-edgar-downloader`)
- **Retry Logic:** Built-in for 503 errors
- **Errors Encountered:** 3 temporary 503 errors during 8-K download

### SRAF Data Quality
- **Format:** Pre-cleaned .txt files (HTML/XML already parsed)
- **Consistency:** Standardized structure across years
- **Coverage:** Comprehensive (all public companies filing 10-K/10-Q)
- **Advantages over raw SEC data:**
  - No HTML parsing needed
  - Consistent formatting
  - Bulk download (vs. API rate limits)
  - Historical data readily available

### Storage Considerations
- **Git exclusions:** `data/raw/`, `data/processed/`, `data/external/` in `.gitignore`
- **Current storage:** ~42 GB compressed, ~100+ GB uncompressed (estimated)
- **Processing approach:** Stream from zip files (avoid uncompressing all at once)
- **Embeddings storage:** ~1-2 GB for vector database (manageable)

---
## 8. References

### Fraud Cases
- **Under Armour:** SEC charges (2021) - https://www.sec.gov/newsroom/press-releases/2021-78
- **Luckin Coffee:** SEC settlement (2020) - https://www.sec.gov/newsroom/press-releases/2020-319
- **Nikola:** Trevor Milton indictment (2021) - https://www.sec.gov/newsroom/press-releases/2021-141

### Technical Resources
- SEC EDGAR API: https://www.sec.gov/edgar/sec-api-documentation
- `sec-edgar-downloader`: https://github.com/jadchaar/sec-edgar-downloader
- SRAF Datasets: https://sraf.nd.edu/sec-edgar-data/