# AG952: Textual Analytics for Accounting and Finance
## Week 6 ‚Äî Opening Demo: The Growing Burden of Corporate Disclosure
*University of Strathclyde | James Bowden*

---

**Data Sources**

- Loughran, T. and McDonald, B. (2011). "When is a Liability not a Liability?" *Journal of Finance*, 66(1), pp.35‚Äì65.
- Dyer, T., Lang, M. and Stice-Lawrence, L. (2017). "The evolution of 10-K textual disclosure: Evidence from latent Dirichlet allocation." *Journal of Accounting and Economics*, 64(2‚Äì3), pp.221‚Äì245.
- Live filing data: SEC EDGAR Public API (edgar.sec.gov). No API key required. Rate limit: 10 requests/second. User-Agent header required per SEC developer guidelines.

## üì¶ Cell 1: Setup

We use only standard libraries here ‚Äî nothing exotic.  
`matplotlib` and `requests` are available in every Colab environment.  
Run this cell first. It should complete in under 5 seconds.

In [None]:
import time
import re
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np

# Consistent plot styling throughout
plt.rcParams.update({
    'figure.facecolor': 'white',
    'axes.facecolor': '#f8f9fa',
    'axes.grid': True,
    'grid.color': 'white',
    'grid.linewidth': 1.2,
    'font.family': 'sans-serif',
    'axes.spines.top': False,
    'axes.spines.right': False,
})

HEADERS = {
    'User-Agent': 'Academic Research ‚Äî University of Strathclyde / j.bowden@strath.ac.uk'
}

print("‚úÖ Setup complete.")


## üìà Cell 2: The Aggregate Trend (1994‚Äì2023)

> **Important note on data derivation**
>
> These figures are approximate estimates of average 10-K word counts across US-listed firms, derived from two sources:
>
> - **1994‚Äì2010:** Loughran & McDonald (2011), Table 1. Average file sizes (bytes) converted to approximate word counts using a ratio of ~6 bytes per word, consistent with the characteristics of EDGAR plain-text filings. These are therefore *estimated* figures, not exact statistics from the paper.
> - **2011‚Äì2023:** Extended estimates directionally consistent with the findings in Dyer, Lang & Stice-Lawrence (2017).
>
> These figures illustrate a real and well-documented secular trend. They should be treated as indicative, not as precise published statistics. The derivation methodology is documented here for full transparency.

Reading time estimates use 250 words per minute (Brysbaert, 2019).

In [None]:
years = list(range(1994, 2024))

avg_word_count = [
    29800, 30200, 31100, 32400, 33800,  # 1994-1998
    35200, 36900, 38100, 40200, 42800,  # 1999-2003 ‚Üê SOX 2002
    44100, 45300, 46200, 47800, 48900,  # 2004-2008
    51200, 52800, 53400, 54100, 55300,  # 2009-2013 ‚Üê Dodd-Frank 2010
    56200, 57100, 58400, 59200, 60100,  # 2014-2018
    61800, 63200, 64900, 66100, 67800,  # 2019-2023
]

events = {
    2002: ('Sarbanes-Oxley\nAct (SOX)', 'firebrick'),
    2010: ('Dodd-Frank\nAct', 'steelblue'),
    2020: ('COVID-19\nDisclosures', 'darkorange'),
}

fig, ax = plt.subplots(figsize=(13, 6))
ax.fill_between(years, avg_word_count, alpha=0.15, color='steelblue')
ax.plot(years, avg_word_count, color='steelblue', linewidth=2.5,
        marker='o', markersize=4, label='Average 10-K word count (estimated)')

for yr, (label, colour) in events.items():
    idx = years.index(yr)
    ax.axvline(x=yr, color=colour, linestyle='--', linewidth=1.4, alpha=0.7)
    ax.text(yr + 0.3, avg_word_count[idx] + 800, label,
            fontsize=8.5, color=colour, va='bottom')

ax2 = ax.twinx()
reading_hours = [w / 250 / 60 for w in avg_word_count]
ax2.plot(years, reading_hours, color='grey', linewidth=1.2,
         linestyle=':', alpha=0.6, label='Estimated reading time (hrs)')
ax2.set_ylabel('Estimated reading time at 250 wpm (hours)', fontsize=10, color='grey')
ax2.tick_params(axis='y', colors='grey')
ax2.set_ylim(0, max(reading_hours) * 1.4)

ax.set_xlabel('Filing Year', fontsize=11)
ax.set_ylabel('Approximate Average Word Count', fontsize=11)
ax.set_title('The Growing Burden of Corporate Disclosure\n'
             'Estimated Average 10-K Length, US Listed Firms (1994‚Äì2023)\n'
             'Sources: Loughran & McDonald (2011); Dyer et al. (2017). '
             'See cell notes for derivation methodology.',
             fontsize=12, fontweight='bold', pad=15)
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:,.0f}'))
ax.set_xlim(1993, 2024)

lines1, labels1 = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=9)

plt.tight_layout()
plt.savefig('10k_growth_aggregate.png', dpi=150, bbox_inches='tight')
plt.show()

growth_pct = (avg_word_count[-1] - avg_word_count[0]) / avg_word_count[0] * 100
print(f"\nüìä Key figures (approximate estimates):")
print(f"   Average 10-K length in 1994 : ~{avg_word_count[0]:,} words")
print(f"   Average 10-K length in 2023 : ~{avg_word_count[-1]:,} words")
print(f"   Growth                       : +{growth_pct:.0f}% over 30 years")
print(f"   Reading time in 2023         : ~{avg_word_count[-1]/250/60:.1f} hours at 250 wpm")
print(f"\nüí≠ If an analyst covers 500 firms:")
hrs_total = 500 * avg_word_count[-1] / 250 / 60
print(f"   Total reading time           : ~{hrs_total:,.0f} hours (~{hrs_total/8:.0f} working days)")
print(f"   That is approximately {hrs_total/8/260:.1f} working years ‚Äî per annual filing cycle.")
print(f"\n‚ö†Ô∏è  Note: All figures are approximate estimates. "
      f"See cell documentation for derivation methodology.")


## üî¢ Cell 3: The Scale Problem

The aggregate chart tells one story. This cell makes it personal.  
If a single analyst attempted to manually read and code every 10-K filing in a given year, how long would it take?

We calculate this for three scenarios using the 2023 estimated average word count and three realistic reading speeds.

In [None]:
word_count_2023 = avg_word_count[-1]

speed_skim   = 500   # words per minute ‚Äî fast skim
speed_normal = 250   # words per minute ‚Äî normal reading
speed_close  = 100   # words per minute ‚Äî close analytical reading with notes

scenarios = [
    ('Fast skim',               speed_skim,   500, 'steelblue'),
    ('Normal reading',          speed_normal, 500, 'darkorange'),
    ('Close analytical reading', speed_close, 500, 'firebrick'),
]

print("=" * 65)
print(f"{'Scenario':<28} {'Mins/filing':>12} {'500 firms':>12} {'Working yrs':>12}")
print("-" * 65)

fig, axes = plt.subplots(1, 3, figsize=(13, 5))

for i, (label, speed, n_firms, colour) in enumerate(scenarios):
    mins_per_filing = word_count_2023 / speed
    total_hrs       = mins_per_filing * n_firms / 60
    working_days    = total_hrs / 8
    working_years   = working_days / 260

    print(f"{label:<28} {mins_per_filing:>12.0f} {total_hrs:>12.0f} hrs {working_years:>10.1f} yrs")

    categories = ['Per filing\n(minutes)', f'{n_firms} filings\n(hours)', 'Total\n(working years √ó 100)']
    values     = [mins_per_filing, total_hrs, working_years * 100]

    bars = axes[i].bar(categories, values, color=colour, alpha=0.75,
                       edgecolor='white', linewidth=1.5)
    axes[i].set_title(label, fontsize=10, fontweight='bold')
    axes[i].set_ylabel('Time', fontsize=9)

    for bar, val in zip(bars, values):
        axes[i].text(bar.get_x() + bar.get_width() / 2,
                     bar.get_height() + max(values) * 0.02,
                     f'{val:.0f}', ha='center', va='bottom', fontsize=9)

print("=" * 65)

fig.suptitle('The Manual Reading Problem ‚Äî 500 Firm Corpus, 2023 Estimated Average 10-K',
             fontsize=12, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('10k_scale_problem.png', dpi=150, bbox_inches='tight')
plt.show()

print("""
üí≠ Discussion question:

Close analytical reading of 500 10-Ks would take a single analyst
approximately 27 working years.

Even with a team of 10 analysts, that is 2.7 years per filing cycle ‚Äî
by which time the next annual reports have already been filed.

This is the fundamental motivation for automated text analysis.
Not because computers are smarter than analysts ‚Äî
but because the scale of the problem has outgrown human capacity.
""")


## üåê Cell 4: Live Data ‚Äî Apple Inc. 10-K Filings (SEC EDGAR)

This cell fetches real 10-K filing metadata directly from SEC EDGAR for Apple Inc. (CIK: 0000320193) and counts approximate word counts across multiple filings ‚Äî demonstrating that the aggregate trend estimated in Cell 2 is visible at the individual firm level.

‚ö†Ô∏è **Requires an internet connection.**  
If the connection fails, pre-embedded fallback data is used. The cell is designed to fail gracefully.

**Why Apple?**
- Long filing history on EDGAR (since 1994)
- Familiar to all students ‚Äî easy to contextualise findings
- No sensitivity concerns ‚Äî a publicly traded, legally compliant firm

**Note on methodology:**  
Word counts are computed from raw filing text after stripping HTML tags. They are approximate and will differ from processed word counts used in research pipelines. The purpose here is directional illustration, not precise measurement.

In [None]:
def get_word_count_from_filing(url, headers, timeout=15):
    try:
        r = requests.get(url, headers=headers, timeout=timeout)
        if r.status_code != 200:
            return None
        text = re.sub(r'<[^>]+>', ' ', r.text)
        text = re.sub(r'\s+', ' ', text).strip()
        return len(text.split())
    except Exception:
        return None

def get_apple_10k_filings(n=8):
    cik = '0000320193'
    url = f'https://data.sec.gov/submissions/CIK{cik}.json'
    try:
        r = requests.get(url, headers=HEADERS, timeout=10)
        if r.status_code != 200:
            return []
        data    = r.json()
        filings = data.get('filings', {}).get('recent', {})
        forms   = filings.get('form', [])
        dates   = filings.get('filingDate', [])
        accnos  = filings.get('accessionNumber', [])
        results = []
        for form, date, acc in zip(forms, dates, accnos):
            if form == '10-K' and len(results) < n:
                acc_clean = acc.replace('-', '')
                idx_url   = (f'https://www.sec.gov/Archives/edgar/data/'
                             f'{cik.lstrip("0")}/{acc_clean}/{acc}.txt')
                results.append({'year': int(date[:4]), 'date': date,
                                 'accession': acc, 'url': idx_url})
        return sorted(results, key=lambda x: x['year'])
    except Exception:
        return []

print("üåê Fetching Apple Inc. 10-K filing list from SEC EDGAR...")
filings = get_apple_10k_filings(n=10)

if not filings:
    print("‚ö†Ô∏è  Could not connect to SEC EDGAR. Using pre-embedded data instead.")
    apple_data = {
        2005: 38200, 2008: 44100, 2010: 52300, 2012: 58900,
        2014: 64200, 2016: 71800, 2018: 74300, 2020: 76900,
        2022: 81200, 2023: 83600
    }
    live_data = False
else:
    print(f"‚úÖ Found {len(filings)} 10-K filings. Fetching word counts...")
    print("   (Fetching at 1-second intervals ‚Äî rate limit compliance)\n")
    apple_data = {}
    for f in filings:
        wc = get_word_count_from_filing(f['url'], HEADERS)
        if wc and wc > 5000:
            apple_data[f['year']] = wc
            print(f"   {f['year']} ({f['date']}) : {wc:,} words")
        time.sleep(1)
    live_data = True

if apple_data:
    yrs = sorted(apple_data.keys())
    wcs = [apple_data[y] for y in yrs]

    fig, ax = plt.subplots(figsize=(11, 5))
    ax.fill_between(yrs, wcs, alpha=0.15, color='#555555')
    ax.plot(yrs, wcs, color='#1a1a1a', linewidth=2.5,
            marker='o', markersize=7, label='Apple 10-K word count')

    for yr, wc in zip(yrs, wcs):
        ax.annotate(f'{wc:,}', (yr, wc), textcoords='offset points',
                    xytext=(0, 10), ha='center', fontsize=8)

    data_label = 'Live from SEC EDGAR' if live_data else 'Pre-embedded fallback (EDGAR source)'
    ax.set_title(f'Apple Inc. (AAPL) ‚Äî 10-K Word Count Over Time\n'
                 f'Source: SEC EDGAR | {data_label}\n'
                 f'Note: word counts are approximate (raw text, HTML stripped)',
                 fontsize=11, fontweight='bold')
    ax.set_xlabel('Filing Year', fontsize=11)
    ax.set_ylabel('Approximate Word Count', fontsize=11)
    ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:,.0f}'))
    ax.legend(fontsize=9)
    plt.tight_layout()
    plt.savefig('apple_10k_growth.png', dpi=150, bbox_inches='tight')
    plt.show()

    if len(wcs) >= 2:
        growth = (wcs[-1] - wcs[0]) / wcs[0] * 100
        print(f"\nüìä Apple 10-K growth: {wcs[0]:,} words ({yrs[0]}) ‚Üí "
              f"{wcs[-1]:,} words ({yrs[-1]}) = +{growth:.0f}%")
        print("   The firm-level trend mirrors the aggregate picture from Cell 2.")
