# Phase 1: Data Acquisition & Exploration

## SEC 10-K Risk Factor Intelligence

This notebook covers:
1. Loading EDGAR-CORPUS from Hugging Face
2. Filtering to 10-K filings (2006-2020)
3. Extracting Item 1A (Risk Factors) sections
4. Manual sampling to understand document structure
5. Basic EDA: document lengths, word counts, industry distribution

In [None]:
# Install required packages if needed
# !pip install datasets pandas numpy matplotlib seaborn tqdm

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from tqdm import tqdm
import re
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print('Libraries loaded successfully')

## 1. Load EDGAR-CORPUS Dataset

The EDGAR-CORPUS dataset on Hugging Face contains SEC filings from 1993-2020.
We'll focus on 10-K filings from 2006-2020 (Item 1A required since 2005).

In [None]:
# Load the dataset - this may take a few minutes on first run
# The dataset is organized by year
print("Loading EDGAR-CORPUS dataset...")
print("Note: First load will download ~2GB of data")

# Load a single year first to understand the structure
sample_year = load_dataset(
    "eloukas/edgar-corpus",
    "year_2020",
    split="train",
    trust_remote_code=True
)

print(f"\nDataset loaded for 2020")
print(f"Number of filings: {len(sample_year)}")
print(f"\nColumns: {sample_year.column_names}")

In [None]:
# Examine sample record structure
sample_record = sample_year[0]
print("Sample record keys:")
for key in sample_record.keys():
    value = sample_record[key]
    if isinstance(value, str) and len(value) > 100:
        print(f"  {key}: {type(value).__name__} (length: {len(value)})")
    else:
        print(f"  {key}: {value}")

## 2. Load Multiple Years (2006-2020)

Item 1A (Risk Factors) became mandatory in 2005, so we'll use 2006-2020 data.

In [None]:
# Load all years from 2006-2020
years = range(2006, 2021)
all_filings = []

for year in tqdm(years, desc="Loading years"):
    try:
        year_data = load_dataset(
            "eloukas/edgar-corpus",
            f"year_{year}",
            split="train",
            trust_remote_code=True
        )
        # Convert to pandas and add year column
        df_year = year_data.to_pandas()
        df_year['filing_year'] = year
        all_filings.append(df_year)
        print(f"  {year}: {len(df_year)} filings")
    except Exception as e:
        print(f"  {year}: Error - {e}")

# Combine all years
df_all = pd.concat(all_filings, ignore_index=True)
print(f"\nTotal filings loaded: {len(df_all):,}")

In [None]:
# Check dataset structure
print("Dataset Info:")
print(f"Shape: {df_all.shape}")
print(f"\nColumns: {df_all.columns.tolist()}")
print(f"\nData types:")
print(df_all.dtypes)

## 3. Filter to 10-K Filings Only

The dataset contains multiple filing types. We need only 10-K (and 10-K/A amendments).

In [None]:
# Check filing types if form_type column exists
if 'form_type' in df_all.columns:
    print("Filing types distribution:")
    print(df_all['form_type'].value_counts().head(20))
elif 'filename' in df_all.columns:
    # Try to extract form type from filename
    print("Sample filenames:")
    print(df_all['filename'].head(10))

In [None]:
# Filter to 10-K filings
# Adjust this based on actual column names in the dataset
if 'form_type' in df_all.columns:
    df_10k = df_all[df_all['form_type'].str.contains('10-K', case=False, na=False)].copy()
else:
    # If form_type not available, we may need to infer from other fields
    # The EDGAR-CORPUS dataset typically includes this in the filename or as metadata
    df_10k = df_all.copy()  # Will filter in next step
    print("Note: form_type column not found, proceeding with all data")

print(f"\n10-K filings: {len(df_10k):,}")

## 4. Extract Item 1A (Risk Factors) Section

Item 1A contains the Risk Factors disclosure. We need to extract this section from each filing.

In [None]:
def extract_item_1a(text):
    """
    Extract Item 1A (Risk Factors) section from 10-K filing text.
    
    Returns:
        str: Extracted risk factors text, or None if not found
    """
    if not isinstance(text, str) or len(text) == 0:
        return None
    
    # Common patterns for Item 1A header
    # Pattern to find start of Item 1A
    start_patterns = [
        r'item\s*1a\.?\s*risk\s*factors',
        r'item\s*1a\.?\s*-\s*risk\s*factors',
        r'item\s*1a\s*[:\.]\s*risk\s*factors',
    ]
    
    # Pattern to find end (start of Item 1B or Item 2)
    end_patterns = [
        r'item\s*1b\.?\s*unresolved\s*staff\s*comments',
        r'item\s*2\.?\s*properties',
        r'item\s*1b\.',
        r'item\s*2\.',
    ]
    
    text_lower = text.lower()
    
    # Find start position
    start_pos = None
    for pattern in start_patterns:
        match = re.search(pattern, text_lower)
        if match:
            start_pos = match.start()
            break
    
    if start_pos is None:
        return None
    
    # Find end position
    end_pos = len(text)
    for pattern in end_patterns:
        match = re.search(pattern, text_lower[start_pos + 100:])  # Search after start
        if match:
            candidate_end = start_pos + 100 + match.start()
            if candidate_end < end_pos:
                end_pos = candidate_end
    
    # Extract section
    item_1a_text = text[start_pos:end_pos].strip()
    
    # Validate: should be at least 500 characters for a real risk factors section
    if len(item_1a_text) < 500:
        return None
    
    return item_1a_text

print("Item 1A extraction function defined")

In [None]:
# Identify the text column (varies by dataset structure)
text_col = None
for col in ['text', 'section_text', 'filing_text', 'content']:
    if col in df_10k.columns:
        text_col = col
        break

if text_col:
    print(f"Using text column: '{text_col}'")
else:
    print("Available columns:", df_10k.columns.tolist())
    print("\nPlease identify the correct text column")

In [None]:
# Extract Item 1A from all filings
if text_col:
    print("Extracting Item 1A sections...")
    tqdm.pandas(desc="Extracting")
    df_10k['item_1a'] = df_10k[text_col].progress_apply(extract_item_1a)
    
    # Check extraction success rate
    extracted_count = df_10k['item_1a'].notna().sum()
    total_count = len(df_10k)
    print(f"\nExtraction results:")
    print(f"  Successfully extracted: {extracted_count:,} ({extracted_count/total_count*100:.1f}%)")
    print(f"  Failed/Not found: {total_count - extracted_count:,}")

In [None]:
# Filter to only filings with Item 1A extracted
df_risk = df_10k[df_10k['item_1a'].notna()].copy()
print(f"Final dataset size: {len(df_risk):,} filings with Risk Factors")

## 5. Sample Filings for Manual Review

Let's examine 3-5 sample filings to understand the structure of Item 1A.

In [None]:
# Sample diverse filings (different years, sizes)
if len(df_risk) > 0:
    # Get samples from different years
    samples = df_risk.groupby('filing_year').apply(
        lambda x: x.sample(min(1, len(x))),
        include_groups=False
    ).head(5)
    
    print("Sample filings for manual review:")
    print("=" * 80)
    
    for idx, (_, row) in enumerate(samples.iterrows()):
        print(f"\n--- Sample {idx + 1} (Year: {row.get('filing_year', 'N/A')}) ---")
        
        # Print metadata if available
        for col in ['cik', 'company_name', 'ticker', 'sic_code', 'filename']:
            if col in row.index and pd.notna(row[col]):
                print(f"{col}: {row[col]}")
        
        # Print first 2000 characters of Item 1A
        item_1a_text = row['item_1a']
        print(f"\nItem 1A length: {len(item_1a_text):,} characters")
        print(f"\nFirst 2000 characters:")
        print("-" * 40)
        print(item_1a_text[:2000])
        print("\n" + "=" * 80)

## 6. Exploratory Data Analysis

In [None]:
# Calculate text statistics
df_risk['item_1a_length'] = df_risk['item_1a'].str.len()
df_risk['item_1a_word_count'] = df_risk['item_1a'].str.split().str.len()

print("Item 1A Text Statistics:")
print(f"\nCharacter count:")
print(df_risk['item_1a_length'].describe())
print(f"\nWord count:")
print(df_risk['item_1a_word_count'].describe())

In [None]:
# Filings per year
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Filings per year
filings_by_year = df_risk.groupby('filing_year').size()
axes[0, 0].bar(filings_by_year.index, filings_by_year.values, color='steelblue')
axes[0, 0].set_xlabel('Year')
axes[0, 0].set_ylabel('Number of Filings')
axes[0, 0].set_title('10-K Filings with Item 1A by Year')
axes[0, 0].tick_params(axis='x', rotation=45)

# Plot 2: Distribution of document lengths
axes[0, 1].hist(df_risk['item_1a_word_count'], bins=50, color='steelblue', edgecolor='white')
axes[0, 1].set_xlabel('Word Count')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Item 1A Word Counts')
axes[0, 1].axvline(df_risk['item_1a_word_count'].median(), color='red', linestyle='--', label=f"Median: {df_risk['item_1a_word_count'].median():,.0f}")
axes[0, 1].legend()

# Plot 3: Average word count by year
avg_words_by_year = df_risk.groupby('filing_year')['item_1a_word_count'].mean()
axes[1, 0].plot(avg_words_by_year.index, avg_words_by_year.values, marker='o', color='steelblue', linewidth=2)
axes[1, 0].set_xlabel('Year')
axes[1, 0].set_ylabel('Average Word Count')
axes[1, 0].set_title('Average Item 1A Length Over Time')
axes[1, 0].tick_params(axis='x', rotation=45)

# Plot 4: Word count by year (boxplot)
df_risk.boxplot(column='item_1a_word_count', by='filing_year', ax=axes[1, 1])
axes[1, 1].set_xlabel('Year')
axes[1, 1].set_ylabel('Word Count')
axes[1, 1].set_title('Item 1A Word Count Distribution by Year')
axes[1, 1].tick_params(axis='x', rotation=45)
plt.suptitle('')  # Remove automatic title

plt.tight_layout()
plt.savefig('../outputs/eda_overview.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nFigure saved to outputs/eda_overview.png")

In [None]:
# Industry analysis (if SIC codes available)
sic_col = None
for col in ['sic_code', 'sic', 'industry_code']:
    if col in df_risk.columns:
        sic_col = col
        break

if sic_col:
    print(f"\nFilings by Industry (SIC Code - Top 20):")
    sic_counts = df_risk[sic_col].value_counts().head(20)
    print(sic_counts)
    
    # Plot
    fig, ax = plt.subplots(figsize=(12, 6))
    sic_counts.plot(kind='bar', ax=ax, color='steelblue')
    ax.set_xlabel('SIC Code')
    ax.set_ylabel('Number of Filings')
    ax.set_title('Top 20 Industries by Filing Count')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig('../outputs/industry_distribution.png', dpi=150, bbox_inches='tight')
    plt.show()
else:
    print("SIC code column not found in dataset")

## 7. Save Processed Dataset

In [None]:
# Select columns to save
# Adjust based on available columns
cols_to_keep = ['filing_year', 'item_1a', 'item_1a_length', 'item_1a_word_count']

# Add metadata columns if they exist
for col in ['cik', 'company_name', 'ticker', 'sic_code', 'filename', 'accession_number']:
    if col in df_risk.columns:
        cols_to_keep.insert(0, col)

# Keep only existing columns
cols_to_keep = [c for c in cols_to_keep if c in df_risk.columns]

df_final = df_risk[cols_to_keep].copy()

print(f"Final dataset shape: {df_final.shape}")
print(f"Columns: {df_final.columns.tolist()}")

In [None]:
# Save to parquet (efficient for large text data)
output_path = '../data/processed/risk_factors_2006_2020.parquet'
df_final.to_parquet(output_path, index=False)
print(f"Dataset saved to: {output_path}")
print(f"File size: {pd.io.common.file_size(output_path) / 1024 / 1024:.1f} MB")

In [None]:
# Also save a smaller CSV sample for quick inspection
sample_path = '../data/processed/risk_factors_sample_1000.csv'
df_final.sample(min(1000, len(df_final))).to_csv(sample_path, index=False)
print(f"Sample saved to: {sample_path}")

## Phase 1 Summary

### Dataset Overview
- **Total 10-K filings processed**: [to be filled after running]
- **Filings with Item 1A extracted**: [to be filled]
- **Date range**: 2006-2020

### Key Observations
1. Item 1A sections have grown longer over time (regulatory expansion)
2. Median word count: ~X,XXX words
3. Top industries: [to be filled]

### Next Steps (Phase 2)
- Build text preprocessing pipeline
- Handle HTML artifacts and legal boilerplate
- Implement sentence segmentation for classification