# 10-K Filings: Data Extraction

This notebook demonstrates how to download 10-K filings from the SEC EDGAR database using the 10-K Analysis Toolkit.

In [None]:
# Import libraries
import sys
import os
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

# Add project root to path for importing local modules
sys.path.append('..')

# Import the SECDataLoader class
from src.data.data_loader import SECDataLoader

# Set plot style
plt.style.use('fivethirtyeight')
sns.set_palette('Set2')

# Set pandas display options
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)

## Configure the Data Loader

First, we'll create an instance of the `SECDataLoader` class, which we'll use to download 10-K filings from the SEC EDGAR database.

In [None]:
# Create data directory if it doesn't exist
if not os.path.exists('../data/cache'):
    os.makedirs('../data/cache')
if not os.path.exists('../data/raw'):
    os.makedirs('../data/raw')

# Initialize the data loader
loader = SECDataLoader(cache_dir='../data/cache')
print(f"Data loader initialized with cache directory: {loader.cache_dir}")

## Define Companies and Time Period

Now, let's define the list of companies (by ticker symbol) and the years for which we want to download 10-K filings.

In [None]:
# Define list of ticker symbols
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META']

# Define years to download
years = [2019, 2020, 2021, 2022, 2023]

print(f"Companies: {', '.join(tickers)}")
print(f"Years: {', '.join(map(str, years))}")

## Looking Up CIK Numbers

The SEC identifies companies by their Central Index Key (CIK). Let's look up the CIK numbers for our target companies.

In [None]:
# Look up CIK numbers for each ticker
cik_mapping = {}

for ticker in tqdm(tickers, desc="Looking up CIK numbers"):
    cik = loader.get_cik_for_ticker(ticker)
    cik_mapping[ticker] = cik
    
# Display the results
cik_df = pd.DataFrame(list(cik_mapping.items()), columns=['Ticker', 'CIK'])
cik_df

## Download 10-K Filings

Now, let's download the 10-K filings for our target companies and years. This may take some time, as we need to make multiple requests to the SEC EDGAR database.

In [None]:
# Download filings
print(f"Downloading 10-K filings for {len(tickers)} companies over {len(years)} years...")
print("This may take several minutes depending on the number of filings.")

filings_df = loader.load_filings(tickers, years=years)

print(f"Downloaded {len(filings_df)} filings.")

## Explore the Downloaded Data

Let's take a look at the downloaded filings to get a better understanding of the data.

In [None]:
# Check basic information
print(f"Downloaded {len(filings_df)} filings for {filings_df['ticker'].nunique()} companies.")
print(f"Years covered: {sorted(filings_df['filing_year'].unique())}")
print("\nColumns in the DataFrame:")
print(filings_df.columns.tolist())

In [None]:
# Show the first few rows (excluding HTML content for brevity)
display_cols = [col for col in filings_df.columns if col != 'filing_html']
filings_df[display_cols].head()

## Check Coverage

Let's check the coverage of our downloaded data to see if we have filings for all companies and years.

In [None]:
# Create a pivot table to check coverage
coverage = pd.pivot_table(
    filings_df,
    values='accession_number',
    index='ticker',
    columns='filing_year',
    aggfunc='count',
    fill_value=0
)

coverage

## Visualize Filing Dates

Let's visualize when the filings were submitted to the SEC.

In [None]:
# Convert filing_date to datetime if needed
if not pd.api.types.is_datetime64_dtype(filings_df['filing_date']):
    filings_df['filing_date'] = pd.to_datetime(filings_df['filing_date'])

# Plot filing dates
plt.figure(figsize=(12, 6))

# Get unique companies
companies = filings_df['ticker'].unique()

# Plot a line for each company
for i, company in enumerate(companies):
    company_filings = filings_df[filings_df['ticker'] == company]
    plt.scatter(
        company_filings['filing_date'],
        [i] * len(company_filings),
        label=company,
        s=100
    )

# Set y-ticks to company names
plt.yticks(range(len(companies)), companies)

# Set labels and title
plt.xlabel('Filing Date', fontsize=14)
plt.title('10-K Filing Dates by Company', fontsize=16, fontweight='bold')

# Format x-axis with years
import matplotlib.dates as mdates
plt.gca().xaxis.set_major_locator(mdates.YearLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))

# Add grid for better readability
plt.grid(True, axis='x', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

## Check File Sizes

Let's check the sizes of the downloaded filings to get an idea of how much data we're working with.

In [None]:
# Calculate HTML content size
filings_df['html_size_kb'] = filings_df['filing_html'].apply(lambda x: len(x) / 1024 if isinstance(x, str) else 0)

# Plot file sizes
plt.figure(figsize=(12, 6))
sns.boxplot(x='ticker', y='html_size_kb', data=filings_df)
plt.title('10-K Filing Sizes by Company', fontsize=16, fontweight='bold')
plt.xlabel('Company', fontsize=14)
plt.ylabel('File Size (KB)', fontsize=14)
plt.grid(True, axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

## Sample HTML Content

Let's take a quick look at a sample of the HTML content to understand what we're working with.

In [None]:
# Get a sample filing
sample_filing = filings_df.iloc[0]
print(f"Sample filing: {sample_filing['ticker']} from {sample_filing['filing_date']}")

# Show first 1000 characters of HTML
html_preview = sample_filing['filing_html'][:1000] if isinstance(sample_filing['filing_html'], str) else ''
print(f"\nHTML preview (first 1000 characters):\n{html_preview}...")

## Save the Downloaded Data

Now that we have downloaded and explored the data, let's save it to disk for use in later analysis.

In [None]:
# Save to pickle file
output_file = '../data/raw/10k_filings.pkl'
filings_df.to_pickle(output_file)
print(f"Saved {len(filings_df)} filings to {output_file}")

# Save metadata only (without HTML content) to CSV for easy inspection
metadata_file = '../data/raw/10k_filings_metadata.csv'
filings_df[display_cols].to_csv(metadata_file, index=False)
print(f"Saved metadata to {metadata_file}")

## Next Steps

In the next notebook (`2_data_preprocessing.ipynb`), we'll preprocess the downloaded filings by:
1. Extracting individual sections (e.g., Risk Factors, MD&A, Financial Statements)
2. Cleaning the text for analysis
3. Extracting tables for financial data