# SEC Filing Linguistic Features for Numerai

## Context
SEC filings (10-K annual reports, 10-Q quarterly, 8-K events) are the most structured, reliable public text data for US equities. Every public company must file, and filings are freely available on EDGAR.

## Why SEC Filings for Numerai
- **Point-in-time correct**: Filing dates are exact — no look-ahead bias
- **Universal coverage**: Every US public company files (covers Numerai's entire US stock universe)
- **Linguistic changes are subtle signals**: When a company's 10-K suddenly becomes harder to read or more litigious, it often precedes negative returns
- **Orthogonal to standard factors**: Readability, sentiment shifts, and topic changes are NOT captured by Barra factors (momentum, value, size)

## Features We'll Extract
1. **Readability**: Gunning Fog Index, word/sentence complexity
2. **Loughran-McDonald Sentiment**: Finance-specific word lists (positive, negative, uncertainty, litigious)
3. **Filing-over-filing deltas**: Changes in features between consecutive filings
4. **Forward-looking language**: Ratio of forward-looking statements

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
from collections import Counter

## 1. Download a Sample SEC Filing from EDGAR

EDGAR is the SEC's free filing database. We use their EFTS full-text search API.
Note: EDGAR requires a User-Agent header with your name and email.

In [None]:
# SEC EDGAR requires a User-Agent header

# TODO: implement
...

## 2. Loughran-McDonald Financial Dictionary

The standard general-purpose sentiment dictionaries (like VADER) are wrong for finance — e.g., "liability" is negative in general English but neutral in financial context.

Loughran & McDonald (2011) created finance-specific word lists. We implement a subset here.

In [None]:
# Loughran-McDonald Financial Sentiment Dictionary (representative subset)
# Full lists available at: https://sraf.nd.edu/loughranmcdonald-master-dictionary/

# TODO: implement
...

## 3. Feature Extraction Functions

In [None]:
def tokenize(text):
    """Simple word tokenization for filing text."""
    ...

def count_sentences(text):
    """Count sentences using simple regex."""
    ...

def count_complex_words(words):
    """Count words with 3+ syllables (simple heuristic: 7+ characters)."""
    ...

def gunning_fog_index(text):
    """
    Gunning Fog Index = 0.4 * (avg_words_per_sentence + pct_complex_words * 100)
    Higher = harder to read. Typical 10-K: 18-22.
    """
    ...

def lm_sentiment_scores(text):
    """Compute Loughran-McDonald sentiment proportions."""
    ...

def forward_looking_ratio(text):
    """Ratio of forward-looking language (expects, plans, believes, will, anticipates)."""
    ...

def extract_all_features(text, filing_name=""):
    """Extract all linguistic features from a filing."""
    ...


## 4. Visualize Feature Comparison

In [None]:
# TODO: implement
...

## 5. Filing-Over-Filing Deltas (The Real Signal)

The absolute feature values are less interesting than **changes over time**. When Apple's 10-K suddenly becomes harder to read, or more litigious, that's a signal.

In [None]:
# Simulate filing features over 4 years (4 annual 10-K filings)

# TODO: implement
...

## 6. Fetching Real Filings from EDGAR (TODO)

```python
# TODO: Replace synthetic filings with real ones from EDGAR
# Example using the SEC EDGAR full-text search API:
#
# import requests
# 
# # Search for Apple 10-K filings
# url = "https://efts.sec.gov/LATEST/search-index?q=%2210-K%22&dateRange=custom&startdt=2023-01-01&enddt=2024-01-01&forms=10-K"
# headers = {"User-Agent": "YourName your@email.com"}
# response = requests.get(url, headers=headers)
#
# # Or use the sec-edgar-downloader package:
# # uv add sec-edgar-downloader
# # from sec_edgar_downloader import Downloader
# # dl = Downloader("YourCompany", "your@email.com")
# # dl.get("10-K", "AAPL", after="2023-01-01")
#
# # Or use the full-text search:
# url = "https://efts.sec.gov/LATEST/search-index?q=company&forms=10-K&dateRange=custom&startdt=2023-01-01"
```

## Discussion & Interview Talking Points

### Strengths
- **Point-in-time correct by design**: SEC filings have exact filing dates — no ambiguity
- **Universal US coverage**: Every public company must file
- **Well-studied academically**: Loughran-McDonald (2011) is the gold standard; many published papers validate these features
- **Low churn**: Filings change quarterly/annually — naturally stable features (Numerai penalizes high churn)
- **Deltas are the real signal**: Filing-over-filing changes capture management's evolving risk perception

### Weaknesses
- **Low frequency**: Only 4 filings per year (10-K + 3x 10-Q). 8-K events are more frequent but shorter.
- **Delayed**: Companies have 60 days after quarter-end to file — features lag reality
- **Boilerplate**: Much 10-K language is copy-pasted year-to-year. Must focus on CHANGED sections.

### Orthogonality Assessment
- **Readability metrics**: Probably somewhat orthogonal — Numerai is unlikely to have Fog Index
- **LM sentiment on filings**: Somewhat known, but filing-level is less common than news-level
- **Filing deltas**: Likely orthogonal — tracking linguistic CHANGES is non-standard
- **Best combined with**: FinBERT on the same filings (NB01), topic modeling (NB04), or embeddings (NB02)

### Key References
- Loughran & McDonald (2011): "When is a Liability not a Liability?" — showed general dictionaries are wrong for finance
- Li (2008): "Annual Report Readability, Current Earnings, and Earnings Persistence" — firms with less readable reports have lower future earnings
- Cohen et al. (2020): "Lazy Prices" — changes in 10-K/Q language predict returns and earnings surprises

### Extensions (TODO)
- [ ] Download real filings from EDGAR using sec-edgar-downloader
- [ ] Implement full Loughran-McDonald dictionary (download from https://sraf.nd.edu/)
- [ ] Compare Item 1A (Risk Factors) across filings — most informative section
- [ ] Add Flesch-Kincaid readability score
- [ ] Track forward-looking statement ratio changes as a leading indicator
- [ ] Build pipeline for all S&P 500 companies