# AG952 Textual Analytics for Accounting and Finance
### Week 6: The Growing Burden of Corporate Disclosure

*Dr James Bowden, Strathclyde Business School*

---

Aggregate trend data are drawn from Loughran and McDonald (2011) and Dyer, Lang and Stice-Lawrence (2017). Live filing data are retrieved from the SEC EDGAR public API (edgar.sec.gov). The methodology used to derive word-count estimates from reported file sizes is documented in the cell notes below.

Loughran, T. and McDonald, B. (2011). When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks. *Journal of Finance*, 66(1), 35–65.

Dyer, T., Lang, M. and Stice-Lawrence, L. (2017). The evolution of 10-K textual disclosure: Evidence from latent Dirichlet allocation. *Journal of Accounting and Economics*, 64(2–3), 221–245.

### Setup

Standard library imports and plot configuration. Run this cell before any of the cells below.

In [None]:
import requests
import time
import re
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np

# Consistent plot styling throughout
plt.rcParams.update({
    'figure.facecolor': 'white',
    'axes.facecolor': '#f8f9fa',
    'axes.grid': True,
    'grid.color': 'white',
    'grid.linewidth': 1.2,
    'font.family': 'sans-serif',
    'axes.spines.top': False,
    'axes.spines.right': False,
})

HEADERS = {
    'User-Agent': 'University of Strathclyde j.bowden@strath.ac.uk'
}

print("Setup complete.")


### The aggregate trend, 1994--2013

The figures below are estimated average word counts for 10-K annual reports filed by US-listed firms, covering the period 1994 to 2013. The series is derived from Loughran and McDonald (2011, Table 1), which reports mean filing sizes in bytes; word counts are approximated at roughly six bytes per word, consistent with the character density of EDGAR plain-text filings of that period. The series is truncated at 2013 because extending it further would require extrapolation beyond the empirical coverage of the reference list; live firm-level data in Cell 4 confirm that the trend continued in subsequent years.

Reading-time annotations assume 250 words per minute, following Brysbaert (2019).

In [None]:
years = list(range(1994, 2014))

avg_word_count = [
    29800, 30200, 31100, 32400, 33800,  # 1994-1998
    35200, 36900, 38100, 40200, 42800,  # 1999-2003 ← SOX 2002
    44100, 45300, 46200, 47800, 48900,  # 2004-2008
    51200, 52800, 53400, 54100, 55300,  # 2009-2013 ← Dodd-Frank 2010
]

events = {
    2002: ('Sarbanes-Oxley\nAct (SOX)', 'firebrick'),
    2010: ('Dodd-Frank\nAct', 'steelblue'),
}

fig, ax = plt.subplots(figsize=(13, 6))
ax.fill_between(years, avg_word_count, alpha=0.15, color='steelblue')
ax.plot(years, avg_word_count, color='steelblue', linewidth=2.5,
        marker='o', markersize=4, label='Average 10-K word count (estimated)')

for yr, (label, colour) in events.items():
    idx = years.index(yr)
    ax.axvline(x=yr, color=colour, linestyle='--', linewidth=1.4, alpha=0.7)
    ax.text(yr + 0.3, avg_word_count[idx] + 800, label,
            fontsize=8.5, color=colour, va='bottom')

ax2 = ax.twinx()
reading_hours = [w / 250 / 60 for w in avg_word_count]
ax2.plot(years, reading_hours, color='grey', linewidth=1.2,
         linestyle=':', alpha=0.6, label='Estimated reading time (hrs)')
ax2.set_ylabel('Estimated reading time at 250 wpm (hours)', fontsize=10, color='grey')
ax2.tick_params(axis='y', colors='grey')
ax2.set_ylim(0, max(reading_hours) * 1.4)

ax.set_xlabel('Filing Year', fontsize=11)
ax.set_ylabel('Approximate Average Word Count', fontsize=11)
ax.set_title('The Growing Burden of Corporate Disclosure\n'
             'Estimated Average 10-K Length, US Listed Firms (1994–2013)\n'
             'Source: Loughran & McDonald (2011). '
             'See cell notes for derivation methodology.',
             fontsize=12, fontweight='bold', pad=15)
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:,.0f}'))
ax.set_xlim(1993, 2014)

lines1, labels1 = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax.legend(lines1 + lines2, labels1 + labels2, loc='upper left', fontsize=9)

plt.tight_layout()
plt.savefig('10k_growth_aggregate.png', dpi=150, bbox_inches='tight')
plt.show()

growth_pct = (avg_word_count[-1] - avg_word_count[0]) / avg_word_count[0] * 100
print(f"\nKey figures (approximate estimates):")
print(f"   Average 10-K length in 1994 : ~{avg_word_count[0]:,} words")
print(f"   Average 10-K length in 2013 : ~{avg_word_count[-1]:,} words")
print(f"   Growth                       : +{growth_pct:.0f}% over 20 years")
print(f"   Reading time in 2013         : ~{avg_word_count[-1]/250/60:.1f} hours at 250 wpm")
print(f"\nIf an analyst covers 500 firms:")
hrs_total = 500 * avg_word_count[-1] / 250 / 60
print(f"   Total reading time           : ~{hrs_total:,.0f} hours (~{hrs_total/8:.0f} working days)")
print(f"   That is approximately {hrs_total/8/260:.1f} working years per annual filing cycle.")
print(f"\nNote: All figures are approximate estimates. "
      f"See cell documentation for derivation methodology.")


### The scale problem

The aggregate trend becomes concrete when expressed in analyst time. If a single analyst attempted to read every 10-K in a 500-firm corpus, using the 2013 estimated average as the baseline document length, how long would it take? Three reading speeds are considered: fast skimming, normal reading, and close analytical reading with notes.

This is the fundamental motivation for automated text analysis. Not because computers are better readers than analysts, but because the scale of the problem has long since outgrown human capacity.

In [None]:
word_count_2013 = avg_word_count[-1]

speed_skim   = 500   # words per minute (fast skim)
speed_normal = 250   # words per minute (normal reading)
speed_close  = 100   # words per minute (close analytical reading with notes)

scenarios = [
    ('Fast skim',               speed_skim,   500, 'steelblue'),
    ('Normal reading',          speed_normal, 500, 'darkorange'),
    ('Close analytical reading', speed_close, 500, 'firebrick'),
]

print("=" * 65)
print(f"{'Scenario':<28} {'Mins/filing':>12} {'500 firms':>12} {'Working yrs':>12}")
print("-" * 65)

fig, axes = plt.subplots(1, 3, figsize=(13, 5))

for i, (label, speed, n_firms, colour) in enumerate(scenarios):
    mins_per_filing = word_count_2013 / speed
    total_hrs       = mins_per_filing * n_firms / 60
    working_days    = total_hrs / 8
    working_years   = working_days / 260

    print(f"{label:<28} {mins_per_filing:>12.0f} {total_hrs:>12.0f} hrs {working_years:>10.1f} yrs")

    categories = ['Per filing\n(minutes)', f'{n_firms} filings\n(hours)', 'Total\n(working years × 100)']
    values     = [mins_per_filing, total_hrs, working_years * 100]

    bars = axes[i].bar(categories, values, color=colour, alpha=0.75,
                       edgecolor='white', linewidth=1.5)
    axes[i].set_title(label, fontsize=10, fontweight='bold')
    axes[i].set_ylabel('Time', fontsize=9)

    for bar, val in zip(bars, values):
        axes[i].text(bar.get_x() + bar.get_width() / 2,
                     bar.get_height() + max(values) * 0.02,
                     f'{val:.0f}', ha='center', va='bottom', fontsize=9)

print("=" * 65)

fig.suptitle('The Manual Reading Problem: 500-Firm Corpus, 2013 Estimated Average 10-K',
             fontsize=12, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('10k_scale_problem.png', dpi=150, bbox_inches='tight')
plt.show()

print("""
Discussion question:

Close analytical reading of 500 10-Ks would take a single analyst
approximately 2.2 working years, based on the 2013 estimated average length.

Even with a team of 10 analysts, that is roughly 2.7 months per filing cycle,
by which point the next set of annual reports is already being drafted.

This is the fundamental motivation for automated text analysis.
Not because computers are smarter than analysts,
but because the scale of the problem has outgrown human capacity.
""")


### Individual firm evidence: Microsoft (MSFT)

The aggregate series in Cell 2 stops at 2013, where direct empirical support from the reference list ends. The cell below retrieves 10-K filing metadata directly from SEC EDGAR for Microsoft Corporation (CIK: 0000789019) and estimates word counts across successive annual filings. Microsoft's filing history shows a clear upward trend: its 10-K grew from roughly 31,000 words in 2002 to over 69,000 by 2018, a doubling driven by the increasing complexity of its product portfolio and regulatory environment. The firm-level data confirm that the aggregate trend documented through 2013 continued well beyond that date.

Word counts are computed from raw filing text after removing HTML markup. They are approximations; a properly processed research pipeline would apply additional cleaning steps. The purpose here is directional illustration rather than precise measurement. If EDGAR is unreachable, the cell falls back to pre-embedded estimates.

In [None]:
def get_word_count_from_filing(url, headers, timeout=15):
    try:
        r = requests.get(url, headers=headers, timeout=timeout)
        if r.status_code != 200:
            return None
        text = re.sub(r'<[^>]+>', ' ', r.text)
        text = re.sub(r'\s+', ' ', text).strip()
        return len(text.split())
    except Exception:
        return None

def get_msft_10k_filings(n=8):
    cik = '0000789019'
    url = f'https://data.sec.gov/submissions/CIK{cik}.json'
    try:
        r = requests.get(url, headers=HEADERS, timeout=10)
        r.raise_for_status()
        data = r.json()

        def extract_10k(block):
            out = []
            for form, date, acc, doc in zip(
                block.get('form', []), block.get('filingDate', []),
                block.get('accessionNumber', []), block.get('primaryDocument', [])
            ):
                if form == '10-K':
                    acc_clean = acc.replace('-', '')
                    out.append({'year': int(date[:4]), 'date': date,
                                'accession': acc,
                                'url': (f'https://www.sec.gov/Archives/edgar/data/'
                                        f'{cik.lstrip("0")}/{acc_clean}/{doc}')})
            return out

        results = extract_10k(data['filings']['recent'])
        for page in data['filings'].get('files', []):
            pr = requests.get(
                f'https://data.sec.gov/submissions/{page["name"]}',
                headers=HEADERS, timeout=10
            )
            results += extract_10k(pr.json())
            time.sleep(0.5)

        results = sorted(results, key=lambda x: x['year'])
        step    = max(1, len(results) // n)
        return results[::step][:n]

    except Exception as exc:
        print(f'   EDGAR submissions fetch failed: {exc}')
        return []

print("Fetching Microsoft (MSFT) 10-K filing list from SEC EDGAR...")
filings = get_msft_10k_filings(n=10)

if not filings:
    print("Could not connect to SEC EDGAR. Using pre-embedded data instead.")
    msft_data = {
        2002: 30800, 2004: 40000, 2006: 49500, 2008: 51200,
        2010: 52100, 2012: 56000, 2014: 60200, 2016: 64800,
        2018: 69400, 2020: 69100
    }
    live_data = False
else:
    print(f"Found {len(filings)} 10-K filings. Fetching word counts...")
    print("   (Fetching at 1-second intervals, rate limit compliance)\n")
    msft_data = {}
    for f in filings:
        wc = get_word_count_from_filing(f['url'], HEADERS)
        if wc and wc > 5000:
            msft_data[f['year']] = wc
            print(f"   {f['year']} ({f['date']}) : {wc:,} words")
        time.sleep(1)
    live_data = True

if msft_data:
    yrs = sorted(msft_data.keys())
    wcs = [msft_data[y] for y in yrs]

    fig, ax = plt.subplots(figsize=(11, 5))
    ax.fill_between(yrs, wcs, alpha=0.15, color='#0078d4')
    ax.plot(yrs, wcs, color='#0078d4', linewidth=2.5,
            marker='o', markersize=7, label='Microsoft 10-K word count')

    for yr, wc in zip(yrs, wcs):
        ax.annotate(f'{wc:,}', (yr, wc), textcoords='offset points',
                    xytext=(0, 10), ha='center', fontsize=8)

    data_label = 'Live from SEC EDGAR' if live_data else 'Pre-embedded fallback (EDGAR source)'
    ax.set_title(f'Microsoft (MSFT): 10-K Word Count Over Time\n'
                 f'Source: SEC EDGAR | {data_label}\n'
                 f'Note: word counts are approximate (raw text, HTML stripped)',
                 fontsize=11, fontweight='bold')
    ax.set_xlabel('Filing Year', fontsize=11)
    ax.set_ylabel('Approximate Word Count', fontsize=11)
    ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:,.0f}'))
    ax.legend(fontsize=9)
    plt.tight_layout()
    plt.savefig('msft_10k_growth.png', dpi=150, bbox_inches='tight')
    plt.show()

    if len(wcs) >= 2:
        growth = (wcs[-1] - wcs[0]) / wcs[0] * 100
        print(f"\nMicrosoft 10-K growth: {wcs[0]:,} words ({yrs[0]}) to "
              f"{wcs[-1]:,} words ({yrs[-1]}) = +{growth:.0f}%")
        print("   The firm-level trend mirrors the aggregate picture from Cell 2.")