# SEC EDGAR Data Collection - Project Summary

**Date:** October 6, 2025  
**Objective:** Build a fraud detection system using SEC EDGAR filings to identify anomalies in 10-K Risk Factors and 8-K event disclosures

---

## 1. Project Overview

### Goal
Develop an AI-powered fraud detection model that analyzes:
- **10-K Item 1A (Risk Factors):** Year-over-year changes indicating emerging risks or boilerplate language
- **8-K Event Filings:** Critical fraud indicators in Items 4.02, 5.02, 8.01, 2.01

### Known Fraud Cases (Baseline)
1. **Under Armour (UAA)** - Accounting fraud 2015-2017 (revenue manipulation)
2. **Luckin Coffee (LK)** - Accounting fraud 2019-2020 (fabricated sales)
3. **Nikola Corporation (NKLA)** - Securities fraud 2019-2020 (false product claims)

### Dataset Scope
- **Companies:** 23 (20 clean baseline + 3 fraud cases)
- **Time Period:** 2015-2024 (10 years)
- **Filing Types:** 10-K (annual) + 8-K (event-driven)
- **Total Filings:** 4,194 (230 × 10-K, 3,964 × 8-K)

---
## 2. Data Collection Implementation

### 2.1 Technology Stack
```python
# Core libraries
import sec_edgar_downloader  # SEC API wrapper with rate limiting
import json
from pathlib import Path
from dotenv import load_dotenv
```

### 2.2 Scrapers Built

#### **10-K Scraper** (`src/data/edgar_10k_scraper.py`)
- **Purpose:** Download annual 10-K reports for Item 1A risk factor analysis
- **Date Range:** 2015-2024 (captures Under Armour fraud period 2015-2017)
- **Rate Limiting:** Built-in via `sec-edgar-downloader` (10 req/sec)
- **Output:** `data/raw/sec-edgar-filings/{CIK}/10-K/{ACCESSION}/`
  - `full-submission.txt` - Complete filing
  - `primary-document.html` - Main 10-K document

**Results:**
- **230 filings** across 23 companies
- **Per-company breakdown:**
  - Most companies: 11 filings (2015-2024, some filed multiple per year)
  - Apple (AAPL): 10 filings
  - Alphabet (GOOGL): 10 filings
  - Disney (DIS): 6 filings (later IPO)
  - Nikola (NKLA): 6 filings (IPO 2020)
  - **Luckin Coffee (LK): 0 filings** (delisted after fraud exposure)

#### **8-K Scraper** (`src/data/edgar_8k_scraper.py`)
- **Purpose:** Download event-driven 8-K filings for fraud indicators
- **Target Items:**
  - **Item 4.02:** Non-reliance on financials (restatements) - **PRIMARY FRAUD INDICATOR**
  - **Item 5.02:** Departure of directors/officers - governance red flag
  - **Item 8.01:** Other events (investigations, lawsuits) - regulatory issues
  - **Item 2.01:** Acquisition/disposition - M&A anomalies
- **Date Range:** 2015-2024
- **Output:** `data/raw/sec-edgar-filings/{CIK}/8-K/{ACCESSION}/`

**Results:**
- **3,964 filings** across 23 companies
- **Fraud case coverage:**
  - Under Armour (UAA): 141 filings
  - Nikola (NKLA): 118 filings
  - Luckin Coffee (LK): 0 filings (delisted before 2015)
- **Highest volume:** Wells Fargo (869 filings - likely due to 2016 scandal)
- **Download errors:** 3 temporary 503 errors (server unavailable), filings skipped

### 2.3 File Structure

```
data/raw/sec-edgar-filings/
├── {CIK}/                          # 10-digit company identifier
│   ├── 10-K/
│   │   └── {ACCESSION_NUMBER}/     # e.g., 0000012927-24-000010
│   │       ├── full-submission.txt # Complete HTML/XML filing
│   │       └── primary-document.html
│   └── 8-K/
│       └── {ACCESSION_NUMBER}/
│           ├── full-submission.txt
│           └── primary-document.html
```

**Storage:** ~2-3 GB total (uncompressed)

### 2.4 Company Configuration

Companies selected for sector diversity and fraud case representation:

In [None]:
# Load company list
import json

with open('../config/companies.json', 'r') as f:
    config = json.load(f)

print(f"Total companies: {len(config['companies'])}\n")

# Show fraud cases
fraud_cases = [c for c in config['companies'] if c.get('fraud_case', False)]
print("Fraud Cases:")
for company in fraud_cases:
    print(f"  - {company['ticker']}: {company['name']} ({company['sector']})")

# Sector breakdown
from collections import Counter
sectors = Counter(c['sector'] for c in config['companies'])
print("\nSector Distribution:")
for sector, count in sectors.most_common():
    print(f"  {sector}: {count}")

---
## 3. Download Statistics

### 3.1 Load Download Summaries

In [None]:
# Load 10-K summary
with open('../data/raw/download_summary.json', 'r') as f:
    summary_10k = json.load(f)

print("=== 10-K Download Summary ===")
print(f"Companies: {summary_10k['companies']}")
print(f"Target Years: {summary_10k['target_years']}")
print(f"Total Filings: {summary_10k['total_filings']}")
print(f"\nTop 5 by filing count:")
import pandas as pd
df_10k = pd.DataFrame(list(summary_10k['by_company'].items()), columns=['Ticker', 'Count'])
print(df_10k.sort_values('Count', ascending=False).head())

In [None]:
# Load 8-K summary
with open('../data/raw/download_summary_8k.json', 'r') as f:
    summary_8k = json.load(f)

print("=== 8-K Download Summary ===")
print(f"Total Filings: {summary_8k['total_filings']}")
print(f"Critical Items: {summary_8k['critical_items']}")
print(f"\nTop 5 by filing count:")
df_8k = pd.DataFrame(list(summary_8k['by_company'].items()), columns=['Ticker', 'Count'])
print(df_8k.sort_values('Count', ascending=False).head())

### 3.2 Fraud Case Coverage Analysis

In [None]:
fraud_tickers = ['UAA', 'NKLA', 'LK']
fraud_periods = {
    'UAA': '2015-2017 (accounting fraud - revenue manipulation)',
    'NKLA': '2019-2020 (securities fraud - false product claims)',
    'LK': '2019-2020 (accounting fraud - fabricated sales, delisted)'
}

print("Fraud Case Filing Coverage:\n")
for ticker in fraud_tickers:
    count_10k = summary_10k['by_company'].get(ticker, 0)
    count_8k = summary_8k['by_company'].get(ticker, 0)
    print(f"{ticker} - {fraud_periods[ticker]}")
    print(f"  10-K: {count_10k} filings")
    print(f"  8-K: {count_8k} filings")
    print()

---
## 4. Next Steps: Extraction & Analysis Pipeline

### 4.1 Text Extraction (Week 1-2)
Build extractors to parse specific sections:

#### **10-K Risk Factor Extractor** (`src/data/risk_extractor.py`)
- **Input:** `full-submission.txt` or `primary-document.html`
- **Output:** Structured JSON with Item 1A text
- **Challenges:**
  - Inconsistent HTML/XML formatting across years
  - Section headers vary ("Item 1A", "ITEM 1A.", "Risk Factors")
  - Nested tables, multi-page risks
- **Strategy:**
  1. Try regex patterns for section headers
  2. Fallback to BeautifulSoup HTML parsing
  3. Manual validation on 10 sample filings

#### **8-K Item Extractor** (`src/data/item_8k_extractor.py`)
- **Input:** 8-K `full-submission.txt`
- **Target Items:** 4.02, 5.02, 8.01, 2.01
- **Output:** JSON with item type, text, filing date
- **Fraud Indicators to Flag:**
  - Item 4.02 keywords: "restatement", "non-reliance", "material weakness"
  - Item 5.02 keywords: "resignation", "terminated", "CFO", "CEO"
  - Item 8.01 keywords: "investigation", "SEC", "DOJ", "subpoena"

### 4.2 Feature Engineering (Week 2-3)
- **YoY Risk Factor Changes:**
  - Semantic similarity (sentence-transformers)
  - New/removed/modified risk detection
  - Boilerplate vs. substantive scoring
- **8-K Anomaly Scores:**
  - Frequency of critical items (Item 4.02 = highest weight)
  - Clustering of events (multiple 8-Ks in short period)
  - Sentiment analysis on Item 8.01 text

### 4.3 Modeling (Week 3-4)
- **Baseline Models:**
  - Logistic regression on engineered features
  - Random forest for feature importance
- **Advanced:**
  - Fine-tuned BERT for risk factor classification
  - Anomaly detection (isolation forest on 8-K patterns)
- **Validation:**
  - Train on 20 clean companies
  - Test on UAA, NKLA (LK likely insufficient data)
  - Success metric: Flag fraud cases 1-2 years before public disclosure

### 4.4 Dashboard (Week 4)
- **Streamlit app** (`dashboard/streamlit_app.py`)
- Features:
  - Company selector
  - Timeline of 8-K filings (color-coded by risk)
  - Risk factor diff viewer (YoY changes)
  - Fraud probability score
  - Download analysis report

---
## 5. External Datasets Explored (Optional Supplements)

### Kaggle & SRAF Datasets
- **Limitation:** None include 8-K filings (critical for fraud detection)
- **Use Case:** Supplement with parsed financial metrics (balance sheet, income statement)
- **Sources:**
  1. Kaggle: SEC EDGAR Company Facts (2009-2023)
  2. SRAF Notre Dame: 10-X Summaries (1993-2024)

**Decision:** Keep scraped data as primary source, optionally download external datasets to `data/external/` for enrichment.

---
## 6. Current Status & To-Do

### ✅ Completed
- [x] Project structure setup
- [x] Company selection (23 companies, 3 fraud cases)
- [x] 10-K scraper implementation
- [x] 8-K scraper implementation
- [x] Downloaded 4,194 filings (2015-2024)
- [x] Validated fraud case coverage

### 🔄 In Progress
- [ ] Explore Kaggle datasets for supplementary data

### 📋 To-Do
1. **Week 1-2: Extraction**
   - [ ] Build 10-K Item 1A extractor
   - [ ] Build 8-K Item extractor (4.02, 5.02, 8.01, 2.01)
   - [ ] Validate extraction on 10 sample filings
   - [ ] Store extracted text in `data/processed/`

2. **Week 2-3: Feature Engineering**
   - [ ] YoY risk factor change detection
   - [ ] Boilerplate scoring
   - [ ] 8-K anomaly features (frequency, clustering, sentiment)

3. **Week 3-4: Modeling**
   - [ ] Train baseline classifiers
   - [ ] Fine-tune transformer models
   - [ ] Validate on UAA, NKLA fraud cases

4. **Week 4: Dashboard**
   - [ ] Build Streamlit interactive UI
   - [ ] Deploy locally for demo

---
## 7. Technical Notes

### SEC API Compliance
- **User-Agent Required:** Set in `.env` as `SEC_USER_AGENT`
- **Rate Limit:** 10 requests/second (enforced by `sec-edgar-downloader`)
- **Retry Logic:** Built-in for 503 errors

### Data Quality Issues Encountered
1. **Luckin Coffee (LK):** No filings (delisted after fraud exposure in 2020)
2. **Disney (DIS):** Only 6 10-Ks (CIK may be incorrect or post-2018 filings)
3. **3 × 503 Errors:** Temporary SEC server unavailability during download

### Storage Optimization
- Exclude `data/raw/` from Git (added to `.gitignore`)
- Future: Compress old filings to `.gz` after extraction

---
## 8. References

### Fraud Cases
- **Under Armour:** SEC charges (2021) - https://www.sec.gov/newsroom/press-releases/2021-78
- **Luckin Coffee:** SEC settlement (2020) - https://www.sec.gov/newsroom/press-releases/2020-319
- **Nikola:** Trevor Milton indictment (2021) - https://www.sec.gov/newsroom/press-releases/2021-141

### Technical Resources
- SEC EDGAR API: https://www.sec.gov/edgar/sec-api-documentation
- `sec-edgar-downloader`: https://github.com/jadchaar/sec-edgar-downloader
- SRAF Datasets: https://sraf.nd.edu/sec-edgar-data/