# SEC Filing Linguistic Features for Numerai

## Context
SEC filings (10-K annual reports, 10-Q quarterly, 8-K events) are the most structured, reliable public text data for US equities. Every public company must file, and filings are freely available on EDGAR.

## Why SEC Filings for Numerai
- **Point-in-time correct**: Filing dates are exact — no look-ahead bias
- **Universal coverage**: Every US public company files (covers Numerai's entire US stock universe)
- **Linguistic changes are subtle signals**: When a company's 10-K suddenly becomes harder to read or more litigious, it often precedes negative returns
- **Orthogonal to standard factors**: Readability, sentiment shifts, and topic changes are NOT captured by Barra factors (momentum, value, size)

## Features We'll Extract
1. **Readability**: Gunning Fog Index, word/sentence complexity
2. **Loughran-McDonald Sentiment**: Finance-specific word lists (positive, negative, uncertainty, litigious)
3. **Filing-over-filing deltas**: Changes in features between consecutive filings
4. **Forward-looking language**: Ratio of forward-looking statements

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
from collections import Counter

## 1. Download a Sample SEC Filing from EDGAR

EDGAR is the SEC's free filing database. We use their EFTS full-text search API.
Note: EDGAR requires a User-Agent header with your name and email.

In [None]:
# SEC EDGAR requires a User-Agent header
HEADERS = {"User-Agent": "NumeraiResearch research@example.com"}

# Try to fetch Apple's 2023 10-K filing
# CIK for Apple: 0000320193
# We'll use the EDGAR full-text search to find recent 10-K filings
SAMPLE_URL = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320193&type=10-K&dateb=&owner=include&count=5&search_text=&action=getcompany"

# For robustness, we'll use a synthetic filing excerpt if EDGAR is unavailable
# This is representative of actual 10-K language
SYNTHETIC_FILING_APPLE = """
ITEM 1. BUSINESS

The Company designs, manufactures and markets smartphones, personal computers, tablets, wearables and accessories, 
and sells a variety of related services. The Company's fiscal year is the 52- or 53-week period that ends on the 
last Saturday of September.

Products
iPhone. iPhone is the Company's line of smartphones based on its iOS operating system. The current lineup includes 
iPhone 15 Pro, iPhone 15 Pro Max, iPhone 15, and iPhone 15 Plus. iPhone revenue decreased year-over-year during 2023 
primarily driven by different launch timing of different iPhone models.

Mac. Mac is the Company's line of personal computers based on its macOS operating system. The current lineup includes 
MacBook Air, MacBook Pro, iMac, Mac mini, Mac Studio and Mac Pro. Mac revenue decreased year-over-year during 2023 
driven by different launch timing and different models.

Services. Services includes advertising, AppleCare, cloud services, digital content, and payment services. 
Services revenue increased year-over-year during 2023 driven primarily by higher revenue from advertising, 
the App Store, and cloud services.

Competition
The markets for the Company's products and services are highly competitive. The Company faces substantial 
competition in all markets and product categories. The Company's competitors that sell mobile devices and 
personal computers based on other operating systems have aggressively cut prices and lowered product margins. 
The Company's financial condition and operating results can be adversely affected by these and other industry-wide 
downward pressures on gross margins.

Risk Factors
The Company's operations and performance depend significantly on worldwide economic conditions and their impact 
on levels of consumer spending. Markets for the Company's products and services are competitive and the Company 
faces aggressive competition in all areas of its business. The Company faces risks related to supply chain 
disruptions, including from pandemics, geopolitical tensions, and natural disasters. The Company may be subject to 
information technology system failures, network disruptions, and cybersecurity threats. Changes in tax rates, 
adoption of new tax legislation, or exposure to additional tax liabilities could adversely affect the Company's 
financial condition. The Company is subject to complex and evolving laws and regulations regarding privacy, data 
protection, and content.

Forward-Looking Statements
This Annual Report contains forward-looking statements, within the meaning of the Private Securities Litigation 
Reform Act of 1995, that involve risks and uncertainties. The Company believes that its existing balances of cash, 
cash equivalents and unrestricted marketable securities, along with commercial paper and other short-term liquidity 
arrangements, will be sufficient to satisfy its working capital needs, capital asset purchases, dividends, share 
repurchases, debt repayments and other liquidity requirements associated with its existing operations over the next 
12 months. The Company expects total revenue for the December quarter to be similar to the year-ago quarter. 
The Company expects to continue to invest in research and development and plans to increase its services offerings 
to drive long-term growth.
"""

SYNTHETIC_FILING_TSLA = """
ITEM 1. BUSINESS

We design, develop, manufacture, sell and lease high-performance fully electric vehicles and energy generation 
and storage systems, and offer services related to our sustainable energy products. We are the world's most 
valuable automotive company based on market capitalization.

Automotive
Our current vehicle line-up includes Model S, Model 3, Model X, Model Y, and Cybertruck. We delivered approximately 
1.8 million vehicles in fiscal year 2023, representing a slight miss versus market expectations. We continue to 
focus on ramping production capacity at our manufacturing facilities in Fremont, Shanghai, Austin, and Berlin.

We have experienced significant price reductions across our vehicle lineup in response to competitive pressures 
and to maintain demand. These price reductions have negatively impacted our automotive gross margins, which 
declined to approximately 18% from 26% in the prior year.

Energy Generation and Storage
We manufacture and sell solar energy systems and energy storage products for residential, commercial and 
industrial customers. Revenue from this segment grew substantially year-over-year.

Competition
The automotive industry is intensely competitive, and we expect it will become even more competitive in the 
future. Many major automobile manufacturers have announced plans to expand their electric vehicle offerings. 
We face increasing competition from both traditional manufacturers transitioning to electric and from new 
entrants. Chinese manufacturers, in particular, have introduced competitively priced electric vehicles.

Risk Factors
We may be unable to grow our business as planned. Our vehicles may experience quality issues or defects that 
could result in voluntary or involuntary recalls. We face risks related to our CEO's other business activities 
and public statements. We have significant debt and may need additional capital. Our future growth depends on 
demand for electric vehicles, which may not materialize at the rate we expect. Regulatory changes could 
adversely impact our business. We face litigation risks including securities class actions and product liability 
claims. Supply chain constraints could adversely impact our production.

Forward-Looking Statements
We believe we will be able to fund our operations for at least the next 12 months. We plan to significantly 
increase our vehicle production capacity. We expect to launch several new vehicle models including a more 
affordable model. We anticipate continued investment in autonomous driving technology and artificial intelligence 
capabilities. We expect our energy business to become an increasingly important revenue driver.
"""

# Store filings in a dict
filings = {
    "AAPL_2023": SYNTHETIC_FILING_APPLE.strip(),
    "TSLA_2023": SYNTHETIC_FILING_TSLA.strip(),
}

print(f"Loaded {len(filings)} filings")
for name, text in filings.items():
    print(f"  {name}: {len(text)} chars, {len(text.split())} words")

## 2. Loughran-McDonald Financial Dictionary

The standard general-purpose sentiment dictionaries (like VADER) are wrong for finance — e.g., "liability" is negative in general English but neutral in financial context.

Loughran & McDonald (2011) created finance-specific word lists. We implement a subset here.

In [None]:
# Loughran-McDonald Financial Sentiment Dictionary (representative subset)
# Full lists available at: https://sraf.nd.edu/loughranmcdonald-master-dictionary/
LM_POSITIVE = {
    "achieve", "achieved", "achievement", "benefit", "beneficial", "best", "better", "boost",
    "breakthrough", "creative", "deliver", "earn", "earnings", "efficient", "enhance", "excellent",
    "exceptional", "expand", "favorable", "gain", "great", "grew", "grow", "growth", "improve",
    "improved", "improvement", "increase", "increased", "innovative", "opportunity", "outperform",
    "positive", "profit", "profitable", "progress", "prosper", "record", "reward", "rewarding",
    "strength", "strong", "succeed", "success", "successful", "superior", "surpass", "upside",
}

LM_NEGATIVE = {
    "adverse", "adversely", "against", "closing", "concern", "concerned", "concerns",
    "critical", "decline", "declined", "decrease", "decreased", "default", "deficit", "delay",
    "delayed", "deteriorate", "difficulty", "diminish", "disappoint", "disappointing", "downgrade",
    "downturn", "drop", "failure", "fell", "force", "impair", "impairment", "inability",
    "inadequate", "investigation", "lawsuit", "layoff", "liability", "liquidation", "litigation",
    "loss", "losses", "miss", "missed", "negative", "penalty", "plummet", "problem", "recall",
    "recession", "restructuring", "risk", "risks", "severe", "shortage", "slowdown", "terminate",
    "threat", "unable", "uncertain", "unfavorable", "volatile", "warn", "warning", "weak", "weakness",
    "worsen",
}

LM_UNCERTAINTY = {
    "almost", "apparent", "apparently", "appear", "appeared", "appears", "approximate",
    "approximately", "assume", "assumed", "belief", "believe", "believed", "conceivable",
    "conditional", "contingent", "could", "depend", "depending", "depends", "doubt",
    "estimate", "estimated", "expect", "expected", "expose", "exposed", "exposure",
    "fluctuate", "fluctuation", "indicate", "indefinite", "likelihood", "may", "might",
    "nearly", "occasionally", "pending", "perhaps", "possible", "possibly", "predict",
    "prediction", "preliminary", "presume", "probable", "probably", "random", "risk",
    "roughly", "seem", "seemed", "seems", "somewhat", "suggest", "uncertain", "uncertainty",
    "unclear", "unpredictable", "unusual", "variable",
}

LM_LITIGIOUS = {
    "action", "adjudicate", "allegation", "allege", "alleged", "appeal", "arbitrate",
    "arbitration", "attorney", "claim", "claims", "claimant", "class", "complaint",
    "contend", "convicted", "counsel", "court", "crime", "criminal", "damages",
    "defendant", "defraud", "deposition", "dispute", "enforce", "enforcement",
    "fine", "fined", "fraud", "guilty", "incriminate", "indict", "indictment",
    "infraction", "infringe", "injunction", "judge", "judgment", "jury",
    "law", "laws", "lawsuit", "lawsuits", "lawyer", "legal", "legislate", "legislation",
    "liable", "liabilities", "litigate", "litigation", "penalty", "plaintiff",
    "plead", "prosecute", "prosecution", "regulation", "regulations", "regulatory",
    "ruling", "sanction", "sentence", "settlement", "statute", "subpoena",
    "sue", "suit", "testify", "testimony", "tribunal", "verdict", "violate", "violation",
}

print(f"Loughran-McDonald word lists:")
print(f"  Positive: {len(LM_POSITIVE)} words")
print(f"  Negative: {len(LM_NEGATIVE)} words")
print(f"  Uncertainty: {len(LM_UNCERTAINTY)} words")
print(f"  Litigious: {len(LM_LITIGIOUS)} words")

## 3. Feature Extraction Functions

In [None]:
def tokenize(text):
    """Simple word tokenization for filing text."""
    return re.findall(r'\b[a-z]+\b', text.lower())

def count_sentences(text):
    """Count sentences using simple regex."""
    sentences = re.split(r'[.!?]+', text)
    return len([s for s in sentences if len(s.strip()) > 0])

def count_complex_words(words):
    """Count words with 3+ syllables (simple heuristic: 7+ characters)."""
    return sum(1 for w in words if len(w) >= 7)

def gunning_fog_index(text):
    """
    Gunning Fog Index = 0.4 * (avg_words_per_sentence + pct_complex_words * 100)
    Higher = harder to read. Typical 10-K: 18-22.
    """
    words = tokenize(text)
    n_sentences = count_sentences(text)
    if n_sentences == 0 or len(words) == 0:
        return 0
    avg_words_per_sentence = len(words) / n_sentences
    pct_complex = count_complex_words(words) / len(words)
    return 0.4 * (avg_words_per_sentence + pct_complex * 100)

def lm_sentiment_scores(text):
    """Compute Loughran-McDonald sentiment proportions."""
    words = tokenize(text)
    word_set = set(words)
    n = len(words) if len(words) > 0 else 1
    
    pos_count = sum(1 for w in words if w in LM_POSITIVE)
    neg_count = sum(1 for w in words if w in LM_NEGATIVE)
    unc_count = sum(1 for w in words if w in LM_UNCERTAINTY)
    lit_count = sum(1 for w in words if w in LM_LITIGIOUS)
    
    return {
        "lm_positive_pct": pos_count / n,
        "lm_negative_pct": neg_count / n,
        "lm_uncertainty_pct": unc_count / n,
        "lm_litigious_pct": lit_count / n,
        "lm_net_sentiment": (pos_count - neg_count) / n,
        "lm_positive_count": pos_count,
        "lm_negative_count": neg_count,
        "lm_uncertainty_count": unc_count,
        "lm_litigious_count": lit_count,
    }

def forward_looking_ratio(text):
    """Ratio of forward-looking language (expects, plans, believes, will, anticipates)."""
    words = tokenize(text)
    n = len(words) if len(words) > 0 else 1
    fl_words = {"expect", "expects", "expected", "plan", "plans", "planned", "believe", "believes",
                "will", "would", "anticipate", "anticipates", "intend", "intends", "forecast",
                "project", "projects", "projected", "estimate", "estimates", "aim", "aims",
                "target", "targets", "continue", "continues", "future", "forward"}
    fl_count = sum(1 for w in words if w in fl_words)
    return fl_count / n

def extract_all_features(text, filing_name=""):
    """Extract all linguistic features from a filing."""
    words = tokenize(text)
    n_sentences = count_sentences(text)
    
    features = {
        "filing": filing_name,
        "word_count": len(words),
        "sentence_count": n_sentences,
        "avg_word_length": np.mean([len(w) for w in words]) if words else 0,
        "avg_sentence_length": len(words) / n_sentences if n_sentences > 0 else 0,
        "fog_index": gunning_fog_index(text),
        "forward_looking_ratio": forward_looking_ratio(text),
        "vocab_richness": len(set(words)) / len(words) if words else 0,
    }
    features.update(lm_sentiment_scores(text))
    return features

# Test on our filings
results = []
for name, text in filings.items():
    features = extract_all_features(text, name)
    results.append(features)

features_df = pd.DataFrame(results).set_index("filing")
print("Extracted features:")
print(features_df.T.to_string())

## 4. Visualize Feature Comparison

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Readability metrics
metrics1 = ["fog_index", "avg_sentence_length", "avg_word_length"]
x = np.arange(len(metrics1))
width = 0.35
axes[0, 0].bar(x - width/2, [features_df.loc["AAPL_2023", m] for m in metrics1], width, label="AAPL", color="steelblue")
axes[0, 0].bar(x + width/2, [features_df.loc["TSLA_2023", m] for m in metrics1], width, label="TSLA", color="coral")
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(["Fog Index", "Avg Sent Len", "Avg Word Len"])
axes[0, 0].set_title("Readability Metrics")
axes[0, 0].legend()

# Plot 2: LM Sentiment proportions
metrics2 = ["lm_positive_pct", "lm_negative_pct", "lm_uncertainty_pct", "lm_litigious_pct"]
labels2 = ["Positive", "Negative", "Uncertainty", "Litigious"]
x = np.arange(len(metrics2))
axes[0, 1].bar(x - width/2, [features_df.loc["AAPL_2023", m] for m in metrics2], width, label="AAPL", color="steelblue")
axes[0, 1].bar(x + width/2, [features_df.loc["TSLA_2023", m] for m in metrics2], width, label="TSLA", color="coral")
axes[0, 1].set_xticks(x)
axes[0, 1].set_xticklabels(labels2)
axes[0, 1].set_title("Loughran-McDonald Sentiment")
axes[0, 1].legend()

# Plot 3: Word count and vocabulary
metrics3 = ["word_count", "sentence_count"]
x = np.arange(len(metrics3))
axes[1, 0].bar(x - width/2, [features_df.loc["AAPL_2023", m] for m in metrics3], width, label="AAPL", color="steelblue")
axes[1, 0].bar(x + width/2, [features_df.loc["TSLA_2023", m] for m in metrics3], width, label="TSLA", color="coral")
axes[1, 0].set_xticks(x)
axes[1, 0].set_xticklabels(["Word Count", "Sentence Count"])
axes[1, 0].set_title("Filing Size")
axes[1, 0].legend()

# Plot 4: Forward-looking ratio and net sentiment
metrics4 = ["forward_looking_ratio", "lm_net_sentiment", "vocab_richness"]
labels4 = ["Forward-Looking", "Net Sentiment", "Vocab Richness"]
x = np.arange(len(metrics4))
axes[1, 1].bar(x - width/2, [features_df.loc["AAPL_2023", m] for m in metrics4], width, label="AAPL", color="steelblue")
axes[1, 1].bar(x + width/2, [features_df.loc["TSLA_2023", m] for m in metrics4], width, label="TSLA", color="coral")
axes[1, 1].set_xticks(x)
axes[1, 1].set_xticklabels(labels4)
axes[1, 1].set_title("Language Characteristics")
axes[1, 1].legend()

plt.suptitle("SEC Filing Linguistic Feature Comparison", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 5. Filing-Over-Filing Deltas (The Real Signal)

The absolute feature values are less interesting than **changes over time**. When Apple's 10-K suddenly becomes harder to read, or more litigious, that's a signal.

In [None]:
# Simulate filing features over 4 years (4 annual 10-K filings)
np.random.seed(42)
years = [2020, 2021, 2022, 2023]
# AAPL: gradually more complex, then spike in uncertainty in 2023
simulated = pd.DataFrame({
    "year": years,
    "fog_index": [18.2, 18.5, 18.8, 20.1],  # spike in 2023
    "lm_negative_pct": [0.015, 0.014, 0.016, 0.022],  # spike in 2023  
    "lm_uncertainty_pct": [0.032, 0.030, 0.035, 0.048],  # big spike in 2023
    "forward_looking_ratio": [0.018, 0.020, 0.019, 0.015],  # drops in 2023
    "word_count": [45000, 47000, 48000, 52000],  # growing over time
})

# Compute year-over-year deltas
delta_cols = ["fog_index", "lm_negative_pct", "lm_uncertainty_pct", "forward_looking_ratio", "word_count"]
for col in delta_cols:
    simulated[f"delta_{col}"] = simulated[col].diff()

print("AAPL Simulated Filing Features Over Time:")
print(simulated.to_string(index=False))

# Plot the deltas
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

axes[0].plot(years, simulated["fog_index"], 'o-', color="steelblue", linewidth=2, markersize=8)
axes[0].set_title("Fog Readability Index")
axes[0].set_ylabel("Fog Index (higher = harder)")
axes[0].set_xlabel("Filing Year")

axes[1].plot(years, simulated["lm_negative_pct"], 'o-', color="red", linewidth=2, markersize=8, label="Negative")
axes[1].plot(years, simulated["lm_uncertainty_pct"], 's-', color="orange", linewidth=2, markersize=8, label="Uncertainty")
axes[1].set_title("Sentiment Word Proportions")
axes[1].set_ylabel("Word Proportion")
axes[1].set_xlabel("Filing Year")
axes[1].legend()

axes[2].bar(years[1:], simulated["delta_fog_index"].dropna(), color=["green", "green", "red"], edgecolor="black", alpha=0.7)
axes[2].set_title("Year-over-Year Fog Index Change")
axes[2].set_ylabel("Delta Fog Index")
axes[2].set_xlabel("Filing Year")
axes[2].axhline(y=0, color='black', linewidth=0.5)

plt.suptitle("Filing-Over-Filing Feature Changes (AAPL Simulated)", fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nKey insight: The 2023 spike in fog index + uncertainty + negative language")
print("could signal upcoming challenges BEFORE they appear in stock returns.")

## 6. Fetching Real Filings from EDGAR (TODO)

```python
# TODO: Replace synthetic filings with real ones from EDGAR
# Example using the SEC EDGAR full-text search API:
#
# import requests
# 
# # Search for Apple 10-K filings
# url = "https://efts.sec.gov/LATEST/search-index?q=%2210-K%22&dateRange=custom&startdt=2023-01-01&enddt=2024-01-01&forms=10-K"
# headers = {"User-Agent": "YourName your@email.com"}
# response = requests.get(url, headers=headers)
#
# # Or use the sec-edgar-downloader package:
# # uv add sec-edgar-downloader
# # from sec_edgar_downloader import Downloader
# # dl = Downloader("YourCompany", "your@email.com")
# # dl.get("10-K", "AAPL", after="2023-01-01")
#
# # Or use the full-text search:
# url = "https://efts.sec.gov/LATEST/search-index?q=company&forms=10-K&dateRange=custom&startdt=2023-01-01"
```

## Discussion & Interview Talking Points

### Strengths
- **Point-in-time correct by design**: SEC filings have exact filing dates — no ambiguity
- **Universal US coverage**: Every public company must file
- **Well-studied academically**: Loughran-McDonald (2011) is the gold standard; many published papers validate these features
- **Low churn**: Filings change quarterly/annually — naturally stable features (Numerai penalizes high churn)
- **Deltas are the real signal**: Filing-over-filing changes capture management's evolving risk perception

### Weaknesses
- **Low frequency**: Only 4 filings per year (10-K + 3x 10-Q). 8-K events are more frequent but shorter.
- **Delayed**: Companies have 60 days after quarter-end to file — features lag reality
- **Boilerplate**: Much 10-K language is copy-pasted year-to-year. Must focus on CHANGED sections.

### Orthogonality Assessment
- **Readability metrics**: Probably somewhat orthogonal — Numerai is unlikely to have Fog Index
- **LM sentiment on filings**: Somewhat known, but filing-level is less common than news-level
- **Filing deltas**: Likely orthogonal — tracking linguistic CHANGES is non-standard
- **Best combined with**: FinBERT on the same filings (NB01), topic modeling (NB04), or embeddings (NB02)

### Key References
- Loughran & McDonald (2011): "When is a Liability not a Liability?" — showed general dictionaries are wrong for finance
- Li (2008): "Annual Report Readability, Current Earnings, and Earnings Persistence" — firms with less readable reports have lower future earnings
- Cohen et al. (2020): "Lazy Prices" — changes in 10-K/Q language predict returns and earnings surprises

### Extensions (TODO)
- [ ] Download real filings from EDGAR using sec-edgar-downloader
- [ ] Implement full Loughran-McDonald dictionary (download from https://sraf.nd.edu/)
- [ ] Compare Item 1A (Risk Factors) across filings — most informative section
- [ ] Add Flesch-Kincaid readability score
- [ ] Track forward-looking statement ratio changes as a leading indicator
- [ ] Build pipeline for all S&P 500 companies