# Phase 2: Text Preprocessing & Risk Segmentation

## SEC 10-K Risk Factor Intelligence

This notebook covers:
1. Text cleaning (HTML, navigation text, normalization)
2. Sentence and paragraph segmentation
3. Risk paragraph extraction
4. Initial risk category identification
5. Building reusable preprocessing pipeline

In [None]:
import pandas as pd
import numpy as np
import re
import spacy
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Load spaCy model for sentence segmentation
try:
    nlp = spacy.load('en_core_web_sm')
except OSError:
    print('Downloading spaCy model...')
    !python -m spacy download en_core_web_sm
    nlp = spacy.load('en_core_web_sm')

# Increase max length for long documents
nlp.max_length = 600000

print('Libraries loaded successfully')

In [None]:
# Load the Phase 1 dataset
df = pd.read_parquet('../data/processed/risk_factors_2006_2020.parquet')
print(f'Loaded {len(df):,} filings')
print(f'Columns: {df.columns.tolist()}')

## 1. Text Cleaning Pipeline

Build functions to clean and normalize the risk factor text.

In [None]:
def clean_risk_text(text):
    """
    Clean and normalize risk factor text.
    
    Steps:
    1. Remove HTML tags (rare but present)
    2. Remove navigation text (Back to Table of Contents)
    3. Remove Item 1A header
    4. Normalize bullet characters
    5. Normalize whitespace
    6. Remove page numbers/headers
    """
    if not isinstance(text, str):
        return ''
    
    # 1. Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    
    # 2. Remove HTML entities
    text = re.sub(r'&nbsp;|&#160;|&amp;|&quot;|&lt;|&gt;', ' ', text)
    
    # 3. Remove navigation text
    text = re.sub(r'back to (table of )?contents', '', text, flags=re.IGNORECASE)
    
    # 4. Remove the Item 1A header itself (we know what section this is)
    text = re.sub(r'^\s*item\s*1a\.?\s*:?\s*risk\s*factors\s*', '', text, flags=re.IGNORECASE)
    
    # 5. Normalize bullet characters to standard bullet
    text = re.sub(r'[•●○◦▪▫◘►▸‣⁃]', '•', text)
    
    # 6. Normalize whitespace (but preserve paragraph breaks)
    text = re.sub(r'[ \t]+', ' ', text)  # Multiple spaces to single
    text = re.sub(r'\n{3,}', '\n\n', text)  # Max 2 newlines
    
    # 7. Remove standalone page numbers
    text = re.sub(r'^\s*\d+\s*$', '', text, flags=re.MULTILINE)
    
    # 8. Remove common header/footer patterns
    text = re.sub(r'\b\d+\s*of\s*\d+\b', '', text)  # "Page X of Y"
    text = re.sub(r'table of contents', '', text, flags=re.IGNORECASE)
    
    return text.strip()

# Test on a sample
sample_text = df['section_1A'].iloc[0]
cleaned = clean_risk_text(sample_text)
print(f'Original length: {len(sample_text):,}')
print(f'Cleaned length: {len(cleaned):,}')
print(f'\nFirst 500 chars of cleaned text:')
print(cleaned[:500])

In [None]:
# Apply cleaning to all documents
print('Cleaning all documents...')
tqdm.pandas(desc='Cleaning')
df['clean_text'] = df['section_1A'].progress_apply(clean_risk_text)

# Calculate new lengths
df['clean_length'] = df['clean_text'].str.len()

print(f'\nCleaning complete!')
print(f'Average reduction: {(1 - df["clean_length"].mean() / df["item_1a_length"].mean()) * 100:.1f}%')

## 2. Risk Paragraph Segmentation

SEC filings typically organize risks as:
- **Risk Header** (bold or capitalized title)
- **Risk Description** (1-3 paragraphs explaining the risk)

We'll segment by identifying risk headers and their associated content.

In [None]:
def extract_risk_paragraphs(text):
    """
    Extract individual risk paragraphs from Item 1A text.
    
    Strategy:
    1. Look for risk headers (sentences ending with period, followed by paragraph)
    2. Common patterns: "We may...", "Our business could...", "Risk of..."
    3. Fall back to paragraph-based segmentation
    
    Returns:
        list: List of (header, content) tuples
    """
    if not text or len(text) < 100:
        return []
    
    risks = []
    
    # Pattern 1: Look for risk headers (short sentences that introduce risks)
    # Common formats:
    # - "We depend on key personnel." followed by explanation
    # - "Competition" or "Competition." as a header
    # - Lines that are all caps or title case and short
    
    # Split into paragraphs first
    paragraphs = re.split(r'\n\s*\n', text)
    paragraphs = [p.strip() for p in paragraphs if p.strip()]
    
    current_header = None
    current_content = []
    
    for para in paragraphs:
        # Check if this looks like a header
        is_header = False
        
        # Short paragraph (< 200 chars) that could be a header
        if len(para) < 200:
            # All caps
            if para.isupper():
                is_header = True
            # Ends with period and is short (risk title)
            elif para.endswith('.') and len(para) < 150:
                # Check if it starts with risk-related words
                if re.match(r'^(we |our |the |if |there |risks? |loss |failure |changes? |inability )', para, re.I):
                    is_header = True
            # Title case short line
            elif para.istitle() and len(para) < 100:
                is_header = True
        
        if is_header:
            # Save previous risk if exists
            if current_header and current_content:
                risks.append({
                    'header': current_header,
                    'content': ' '.join(current_content)
                })
            current_header = para
            current_content = []
        else:
            current_content.append(para)
    
    # Don't forget the last risk
    if current_header and current_content:
        risks.append({
            'header': current_header,
            'content': ' '.join(current_content)
        })
    
    # If no headers found, fall back to paragraph-based segmentation
    if not risks and paragraphs:
        for i, para in enumerate(paragraphs):
            if len(para) > 100:  # Only substantial paragraphs
                risks.append({
                    'header': f'Risk {i+1}',
                    'content': para
                })
    
    return risks

# Test on a sample
sample_risks = extract_risk_paragraphs(df['clean_text'].iloc[100])
print(f'Found {len(sample_risks)} risk paragraphs')
print('\nFirst 3 risks:')
for i, risk in enumerate(sample_risks[:3]):
    print(f'\n--- Risk {i+1} ---')
    print(f'Header: {risk["header"][:100]}...' if len(risk['header']) > 100 else f'Header: {risk["header"]}')
    print(f'Content: {risk["content"][:200]}...')

In [None]:
# Extract risk paragraphs from all documents
print('Extracting risk paragraphs from all documents...')
tqdm.pandas(desc='Extracting')
df['risk_paragraphs'] = df['clean_text'].progress_apply(extract_risk_paragraphs)
df['num_risks'] = df['risk_paragraphs'].apply(len)

print(f'\nExtraction complete!')
print(f'Total risk paragraphs extracted: {df["num_risks"].sum():,}')
print(f'Average risks per filing: {df["num_risks"].mean():.1f}')
print(f'Median risks per filing: {df["num_risks"].median():.0f}')

In [None]:
# Distribution of risks per filing
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Histogram of risks per filing
axes[0].hist(df['num_risks'], bins=50, color='steelblue', edgecolor='white')
axes[0].axvline(df['num_risks'].median(), color='red', linestyle='--', label=f'Median: {df["num_risks"].median():.0f}')
axes[0].set_xlabel('Number of Risk Paragraphs')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Risk Paragraphs per Filing')
axes[0].legend()

# Average risks by year
risks_by_year = df.groupby('filing_year')['num_risks'].mean()
axes[1].plot(risks_by_year.index, risks_by_year.values, marker='o', color='steelblue', linewidth=2)
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Average Risk Paragraphs')
axes[1].set_title('Average Risk Paragraphs Over Time')

plt.tight_layout()
plt.savefig('../outputs/risk_segmentation.png', dpi=150, bbox_inches='tight')
plt.show()

## 3. Sentence-Level Segmentation

For fine-grained classification, we also need sentence-level segments.

In [None]:
def extract_sentences(text, max_sentences=500):
    """
    Extract sentences from text using spaCy.
    
    Args:
        text: Input text
        max_sentences: Maximum sentences to extract (for very long docs)
    
    Returns:
        list: List of sentence strings
    """
    if not text or len(text) < 50:
        return []
    
    # Process with spaCy
    doc = nlp(text)
    
    sentences = []
    for sent in doc.sents:
        sent_text = sent.text.strip()
        # Filter out very short or very long sentences
        if 20 < len(sent_text) < 2000:
            sentences.append(sent_text)
        if len(sentences) >= max_sentences:
            break
    
    return sentences

# Test on a sample
sample_sentences = extract_sentences(df['clean_text'].iloc[0])
print(f'Found {len(sample_sentences)} sentences')
print('\nFirst 5 sentences:')
for i, sent in enumerate(sample_sentences[:5]):
    print(f'{i+1}. {sent[:100]}...' if len(sent) > 100 else f'{i+1}. {sent}')

## 4. Initial Risk Category Identification

Based on the project plan, we'll identify these risk categories:
1. Regulatory/Legal
2. Cybersecurity/Data Privacy
3. Competitive/Market
4. Macroeconomic
5. Operational/Supply Chain
6. Financial/Liquidity
7. Environmental/Climate
8. Personnel/Labor
9. Reputational
10. Technology/Innovation

In [None]:
# Define keyword patterns for initial category identification
RISK_CATEGORIES = {
    'regulatory_legal': [
        r'regulat', r'compliance', r'legal', r'litigation', r'lawsuit',
        r'government', r'legislation', r'law ', r'court', r'patent',
        r'intellectual property', r'SEC', r'FDA', r'EPA', r'FTC'
    ],
    'cybersecurity': [
        r'cyber', r'data breach', r'security breach', r'hack', r'privacy',
        r'personal data', r'data protection', r'information security',
        r'GDPR', r'CCPA', r'ransomware', r'malware'
    ],
    'competitive_market': [
        r'compet', r'market share', r'pricing pressure', r'new entrants',
        r'industry consolidation', r'customer concentration', r'demand'
    ],
    'macroeconomic': [
        r'economic', r'recession', r'inflation', r'interest rate',
        r'currency', r'exchange rate', r'GDP', r'unemployment',
        r'global economy', r'trade war', r'tariff'
    ],
    'operational_supply': [
        r'supply chain', r'supplier', r'manufacturing', r'production',
        r'distribution', r'logistics', r'inventory', r'sourcing',
        r'operational', r'disruption'
    ],
    'financial_liquidity': [
        r'liquidity', r'cash flow', r'debt', r'credit', r'financing',
        r'capital', r'covenant', r'leverage', r'bankruptcy', r'insolven'
    ],
    'environmental_climate': [
        r'environment', r'climate', r'emission', r'carbon', r'pollution',
        r'sustainability', r'renewable', r'ESG', r'natural disaster',
        r'weather', r'flood', r'hurricane'
    ],
    'personnel_labor': [
        r'employee', r'personnel', r'labor', r'workforce', r'talent',
        r'key person', r'executive', r'management team', r'union',
        r'hiring', r'retention'
    ],
    'reputational': [
        r'reputation', r'brand', r'public perception', r'media',
        r'social media', r'negative publicity', r'trust'
    ],
    'technology_innovation': [
        r'technology', r'innovation', r'obsolete', r'R&D', r'research',
        r'product development', r'digital', r'AI ', r'artificial intelligence',
        r'automation', r'disrupt'
    ]
}

def classify_risk_paragraph(text):
    """
    Classify a risk paragraph into categories using keyword matching.
    This is a preliminary classifier - will be improved with ML.
    
    Returns:
        dict: Category scores (0-1 based on keyword matches)
    """
    text_lower = text.lower()
    scores = {}
    
    for category, patterns in RISK_CATEGORIES.items():
        matches = sum(1 for p in patterns if re.search(p, text_lower))
        scores[category] = matches / len(patterns)
    
    return scores

def get_primary_category(scores):
    """Get the primary category from scores dict."""
    if not scores or max(scores.values()) == 0:
        return 'other'
    return max(scores, key=scores.get)

# Test on sample risks
sample_risk = df['risk_paragraphs'].iloc[100][0] if df['risk_paragraphs'].iloc[100] else None
if sample_risk:
    scores = classify_risk_paragraph(sample_risk['content'])
    print(f'Risk: {sample_risk["header"]}')
    print(f'\nCategory scores:')
    for cat, score in sorted(scores.items(), key=lambda x: -x[1])[:5]:
        print(f'  {cat}: {score:.2f}')
    print(f'\nPrimary category: {get_primary_category(scores)}')

In [None]:
# Create flattened dataset of individual risk paragraphs
print('Creating flattened risk paragraph dataset...')

risk_records = []
for _, row in tqdm(df.iterrows(), total=len(df), desc='Flattening'):
    for risk in row['risk_paragraphs']:
        scores = classify_risk_paragraph(risk['content'])
        record = {
            'cik': row['cik'],
            'filing_year': row['filing_year'],
            'filename': row['filename'],
            'risk_header': risk['header'],
            'risk_content': risk['content'],
            'content_length': len(risk['content']),
            'primary_category': get_primary_category(scores),
            **{f'score_{k}': v for k, v in scores.items()}
        }
        risk_records.append(record)

df_risks = pd.DataFrame(risk_records)
print(f'\nCreated dataset with {len(df_risks):,} individual risk paragraphs')

In [None]:
# Distribution of preliminary categories
print('Preliminary category distribution:')
category_counts = df_risks['primary_category'].value_counts()
print(category_counts)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
category_counts.plot(kind='bar', ax=ax, color='steelblue')
ax.set_xlabel('Risk Category')
ax.set_ylabel('Number of Risk Paragraphs')
ax.set_title('Preliminary Risk Category Distribution (Keyword-based)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('../outputs/preliminary_categories.png', dpi=150, bbox_inches='tight')
plt.show()

## 5. Save Preprocessed Datasets

In [None]:
# Save document-level dataset with clean text
df_docs = df[['cik', 'filename', 'filing_year', 'clean_text', 'clean_length', 'num_risks']].copy()
df_docs.to_parquet('../data/processed/risk_factors_cleaned.parquet', index=False)
print(f'Saved document-level dataset: {len(df_docs):,} documents')

# Save paragraph-level dataset
df_risks.to_parquet('../data/processed/risk_paragraphs.parquet', index=False)
print(f'Saved paragraph-level dataset: {len(df_risks):,} paragraphs')

# Save a sample for quick inspection
df_risks.sample(min(5000, len(df_risks))).to_csv('../data/processed/risk_paragraphs_sample.csv', index=False)
print('Saved sample CSV for inspection')

## 6. Create Reusable Preprocessing Module

In [None]:
# Write preprocessing functions to a Python module
preprocessing_code = '''
"""Text preprocessing utilities for SEC Risk Factor classification."""

import re
import spacy

# Load spaCy model
try:
    nlp = spacy.load('en_core_web_sm')
    nlp.max_length = 600000
except OSError:
    nlp = None


def clean_risk_text(text):
    """Clean and normalize risk factor text."""
    if not isinstance(text, str):
        return ''
    
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    
    # Remove HTML entities
    text = re.sub(r'&nbsp;|&#160;|&amp;|&quot;|&lt;|&gt;', ' ', text)
    
    # Remove navigation text
    text = re.sub(r'back to (table of )?contents', '', text, flags=re.IGNORECASE)
    
    # Remove Item 1A header
    text = re.sub(r'^\\s*item\\s*1a\\.?\\s*:?\\s*risk\\s*factors\\s*', '', text, flags=re.IGNORECASE)
    
    # Normalize bullet characters
    text = re.sub(r'[•●○◦▪▫◘►▸‣⁃]', '•', text)
    
    # Normalize whitespace
    text = re.sub(r'[ \\t]+', ' ', text)
    text = re.sub(r'\\n{3,}', '\\n\\n', text)
    
    # Remove page numbers
    text = re.sub(r'^\\s*\\d+\\s*$', '', text, flags=re.MULTILINE)
    text = re.sub(r'\\b\\d+\\s*of\\s*\\d+\\b', '', text)
    text = re.sub(r'table of contents', '', text, flags=re.IGNORECASE)
    
    return text.strip()


def extract_risk_paragraphs(text):
    """Extract individual risk paragraphs from Item 1A text."""
    if not text or len(text) < 100:
        return []
    
    risks = []
    paragraphs = re.split(r'\\n\\s*\\n', text)
    paragraphs = [p.strip() for p in paragraphs if p.strip()]
    
    current_header = None
    current_content = []
    
    for para in paragraphs:
        is_header = False
        
        if len(para) < 200:
            if para.isupper():
                is_header = True
            elif para.endswith('.') and len(para) < 150:
                if re.match(r'^(we |our |the |if |there |risks? |loss |failure |changes? |inability )', para, re.I):
                    is_header = True
            elif para.istitle() and len(para) < 100:
                is_header = True
        
        if is_header:
            if current_header and current_content:
                risks.append({
                    'header': current_header,
                    'content': ' '.join(current_content)
                })
            current_header = para
            current_content = []
        else:
            current_content.append(para)
    
    if current_header and current_content:
        risks.append({
            'header': current_header,
            'content': ' '.join(current_content)
        })
    
    if not risks and paragraphs:
        for i, para in enumerate(paragraphs):
            if len(para) > 100:
                risks.append({
                    'header': f'Risk {i+1}',
                    'content': para
                })
    
    return risks


def extract_sentences(text, max_sentences=500):
    """Extract sentences using spaCy."""
    if not text or len(text) < 50 or nlp is None:
        return []
    
    doc = nlp(text)
    sentences = []
    
    for sent in doc.sents:
        sent_text = sent.text.strip()
        if 20 < len(sent_text) < 2000:
            sentences.append(sent_text)
        if len(sentences) >= max_sentences:
            break
    
    return sentences
'''

with open('../src/preprocessing.py', 'w') as f:
    f.write(preprocessing_code)

print('Saved preprocessing module to src/preprocessing.py')

## Phase 2 Summary

### Completed
- ✅ Text cleaning pipeline (HTML, navigation, normalization)
- ✅ Risk paragraph segmentation
- ✅ Preliminary category identification (keyword-based)
- ✅ Created flattened paragraph-level dataset
- ✅ Saved reusable preprocessing module

### Key Statistics
- Documents cleaned: [X]
- Total risk paragraphs: [X]
- Average risks per filing: [X]

### Outputs
- `data/processed/risk_factors_cleaned.parquet` - Document-level
- `data/processed/risk_paragraphs.parquet` - Paragraph-level
- `src/preprocessing.py` - Reusable module

### Next Steps (Phase 3)
- Manual labeling of ~500 risk paragraphs
- Build TF-IDF baseline classifier
- Establish evaluation metrics