# FIXED: MD&A EXTRACTION WITH YEAR SUBFOLDER ORGANIZATION

**Purpose:** Extract MD&A sections from 10-K filings with Drive timeout fix

**CRITICAL FIX:** Reorganizes files into year subfolders to avoid Google Drive's ~10,000 file limit

**Prerequisites:**
- Repository: `/content/drive/MyDrive/EDGAR_Project/edgar-crawler`
- Raw 10-K files downloaded
- **~75,000 extracted files causing Drive timeouts** ‚Üê This notebook fixes that!

**Instructions:**
- üü¢ **GREEN** = Run EVERY TIME
- üü° **YELLOW** = Run FIRST TIME ONLY
- üîµ **BLUE** = Optional/conditional
- üî¥ **RED** = CRITICAL FIX - Run once to reorganize files

---

# SECTION 1: SETUP

In [None]:
## üü¢ Cell 1: Mount Google Drive
import os
from google.colab import drive

if os.path.exists('/content/drive/MyDrive'):
    print("‚úÖ Drive already mounted")
else:
    drive.mount('/content/drive')
    print("‚úÖ Drive mounted successfully")

In [None]:
## üü¢ Cell 2: Navigate to Repository
import os

REPO_DIR = '/content/drive/MyDrive/EDGAR_Project/edgar-crawler'

if os.path.exists(REPO_DIR):
    os.chdir(REPO_DIR)
    print(f"‚úÖ Working directory: {os.getcwd()}")
else:
    print(f"‚ùå Repository not found at: {REPO_DIR}")

In [None]:
## üü¢ Cell 3: Install Dependencies
print("üì¶ Installing dependencies...")

!pip install -q 'dill<0.3.9' 'multiprocess<0.70.17'
!pip install -q pox ppft
!pip install -q --no-deps pathos
!pip install -q beautifulsoup4 lxml requests pandas tqdm click cssutils numpy pyarrow

print("‚úÖ All dependencies installed")

In [None]:
## üü¢ Cell 4: Keep-Alive Script
from IPython.display import display, Javascript

display(Javascript('''
function KeepClicking(){
    console.log("Keeping session alive...");
    document.querySelector("colab-connect-button").click();
}
setInterval(KeepClicking, 60000);
'''))

print("‚úÖ Keep-alive activated")

# SECTION 8: FIX DRIVE TIMEOUT ISSUE (RUN FIRST!)

**üî¥ CRITICAL: Run this BEFORE resuming extraction!**

Your `datasets/EXTRACTED_FILINGS/10-K/` folder has ~75,000 files in it.
Google Drive cannot handle >10,000 files in one folder.

This section will:
1. Reorganize existing files into year subfolders (2000/, 2001/, etc.)
2. Patch extraction to write new files to year subfolders
3. Update progress checker to count across all subfolders

In [None]:
## üî¥ STEP 1: Reorganize Existing Files into Year Subfolders
## RUN: ONE TIME - Fixes Drive timeout by moving files
##
## What it does:
## - Moves all JSON files from datasets/EXTRACTED_FILINGS/10-K/ into year subfolders
## - Creates: 10-K/2000/, 10-K/2001/, ..., 10-K/2024/
## - Extracts year from filename (e.g., 1234567_10K_2015_xxx.json ‚Üí 2015/)
##
## Expected time: 10-30 minutes for ~75,000 files
## SAFE: Only moves files, does not delete anything

import os
import re
from tqdm import tqdm
import time

print("üîß REORGANIZING FILES TO FIX DRIVE TIMEOUT ISSUE")
print("=" * 60)
print("This will move ~75,000 files into year subfolders")
print("Expected time: 10-30 minutes\n")

base_dir = 'datasets/EXTRACTED_FILINGS/10-K'

if not os.path.exists(base_dir):
    print(f"‚ùå Directory not found: {base_dir}")
else:
    print("üìä Step 1/4: Scanning for files...")
    
    # Get all JSON files in root directory only
    try:
        all_items = os.listdir(base_dir)
        root_files = [f for f in all_items if os.path.isfile(os.path.join(base_dir, f)) and f.endswith('.json')]
    except Exception as e:
        print(f"‚ùå Error listing directory: {e}")
        print("   Drive may be temporarily unavailable - wait and retry")
        root_files = []
    
    print(f"   Found {len(root_files):,} files to reorganize")
    
    if len(root_files) == 0:
        print("\n‚úÖ No files to reorganize - already organized!")
    else:
        print(f"\nüìä Step 2/4: Grouping by year...")
        
        # Extract years
        year_pattern = re.compile(r'_10K_(\d{4})_')
        year_groups = {}
        no_year = []
        
        for filename in root_files:
            match = year_pattern.search(filename)
            if match:
                year = match.group(1)
                if year not in year_groups:
                    year_groups[year] = []
                year_groups[year].append(filename)
            else:
                no_year.append(filename)
        
        print(f"   Years found: {len(year_groups)}")
        for year in sorted(year_groups.keys()):
            print(f"      {year}: {len(year_groups[year]):,} files")
        
        if no_year:
            print(f"      Unknown: {len(no_year)} files (will skip)")
        
        print(f"\nüöÄ Step 3/4: Moving files to year subfolders...")
        print(f"   This will take 10-30 minutes\n")
        
        moved_count = 0
        error_count = 0
        
        for year in sorted(year_groups.keys()):
            year_dir = os.path.join(base_dir, year)
            
            # Create year folder
            try:
                os.makedirs(year_dir, exist_ok=True)
                time.sleep(0.1)  # Small delay to avoid API rate limits
            except Exception as e:
                print(f"‚ùå Error creating {year}/: {e}")
                continue
            
            # Move files with progress bar
            files = year_groups[year]
            pbar = tqdm(files, desc=f"Year {year}", leave=False)
            
            for filename in pbar:
                src = os.path.join(base_dir, filename)
                dst = os.path.join(year_dir, filename)
                
                try:
                    os.rename(src, dst)
                    moved_count += 1
                    
                    # Add small delay every 100 files to avoid overwhelming Drive
                    if moved_count % 100 == 0:
                        time.sleep(0.5)
                        
                except Exception as e:
                    error_count += 1
                    if error_count <= 5:
                        print(f"\n‚ö†Ô∏è Error moving {filename}: {e}")
        
        print(f"\nüìä Step 4/4: Verifying reorganization...")
        
        total_in_subfolders = 0
        for year in sorted(year_groups.keys()):
            year_dir = os.path.join(base_dir, year)
            if os.path.exists(year_dir):
                try:
                    count = len([f for f in os.listdir(year_dir) if f.endswith('.json')])
                    total_in_subfolders += count
                    print(f"   {year}/: {count:,} files")
                except:
                    print(f"   {year}/: (error counting)")
        
        print(f"\n" + "=" * 60)
        print(f"‚úÖ REORGANIZATION COMPLETE!")
        print(f"   Moved: {moved_count:,} files")
        print(f"   Verified: {total_in_subfolders:,} files in subfolders")
        if error_count > 0:
            print(f"   ‚ö†Ô∏è Errors: {error_count} files failed")
        if no_year:
            print(f"   ‚ö†Ô∏è Skipped: {len(no_year)} files (no year)")
        print(f"\nüéâ You can now resume extraction!")
        print(f"   New extractions will automatically use year subfolders")

In [None]:
## üî¥ STEP 2: Patch Extraction to Write to Year Subfolders
## RUN: ONE TIME - After reorganization
##
## What it does:
## - Modifies extract_items.py to organize future extractions by year
## - New files will automatically go to year subfolders
## - Prevents the same timeout issue from happening again

import os

file_path = 'extract_items.py'

print("üîß Patching extract_items.py for year subfolder organization...\n")

with open(file_path, 'r') as f:
    content = f.read()

# Check if already patched
if 'year_subfolder' in content or '# Year-based organization' in content:
    print("‚úÖ extract_items.py already patched for year subfolders")
    print("   Future extractions will use year organization")
else:
    print("Searching for patch location...")
    
    # Find where output filename is created
    old_code = '''        absolute_json_filename = os.path.join(
            filing_type_folder, json_filename
        )'''
    
    new_code = '''        # Year-based organization to avoid Drive folder limits
        import re
        year_match = re.search(r'_10K_(\\d{4})_', json_filename)
        if year_match:
            year_subfolder = year_match.group(1)
            year_folder = os.path.join(filing_type_folder, year_subfolder)
            os.makedirs(year_folder, exist_ok=True)
            absolute_json_filename = os.path.join(year_folder, json_filename)
        else:
            # Fallback: no year found, use root
            absolute_json_filename = os.path.join(filing_type_folder, json_filename)'''
    
    if old_code in content:
        content = content.replace(old_code, new_code)
        
        with open(file_path, 'w') as f:
            f.write(content)
        
        print("‚úÖ extract_items.py successfully patched!")
        print("   Future extractions will write to year subfolders:")
        print("   - datasets/EXTRACTED_FILINGS/10-K/2024/")
        print("   - datasets/EXTRACTED_FILINGS/10-K/2023/")
        print("   - etc.")
    else:
        print("‚ö†Ô∏è Could not find target code section")
        print("   Code structure may have changed")
        print("   Extraction might still work, but won't organize by year")

# SECTION 5: CHECK PROGRESS (UPDATED)

**This version counts files across ALL year subfolders**

In [None]:
## üîµ Check Extraction Progress (UPDATED FOR YEAR SUBFOLDERS)
## RUN: When extraction is stopped
##
## Now counts files in:
## - Root directory (old flat structure)
## - Year subfolders (new organized structure)

import os
import json
import pandas as pd

extracted_dir = 'datasets/EXTRACTED_FILINGS/10-K'

if os.path.exists(extracted_dir):
    print("üìä Scanning for extracted files (including year subfolders)...\n")
    
    # Count all JSON files recursively
    all_files = []
    year_counts = {}
    root_count = 0
    
    try:
        for root, dirs, files in os.walk(extracted_dir):
            json_files = [f for f in files if f.endswith('.json')]
            all_files.extend([os.path.join(root, f) for f in json_files])
            
            # Track by location
            if root == extracted_dir:
                root_count = len(json_files)
            else:
                year = os.path.basename(root)
                year_counts[year] = len(json_files)
    except Exception as e:
        print(f"‚ö†Ô∏è Error scanning: {e}")
        print("   Drive may be temporarily slow - retry in a moment")
    
    # Get expected total
    metadata = pd.read_csv('datasets/FILINGS_METADATA.csv')
    expected = len(metadata[metadata['Type'] == '10-K'])
    
    print(f"üìä Extraction Progress:")
    print(f"   Total Extracted: {len(all_files):,} files")
    print(f"   Expected: {expected:,} files")
    print(f"   Progress: {len(all_files)/expected*100:.1f}%")
    print(f"   Remaining: {expected - len(all_files):,} files")
    
    # Show organization
    if year_counts or root_count > 0:
        print(f"\nüìÅ File organization:")
        
        if root_count > 0:
            print(f"   Root (not organized): {root_count:,} files")
            if root_count > 1000:
                print(f"      ‚ö†Ô∏è WARNING: Too many files in root!")
                print(f"      Run reorganization script to fix this")
        
        if year_counts:
            print(f"   Year subfolders: {sum(year_counts.values()):,} files")
            for year in sorted(year_counts.keys()):
                print(f"      {year}/: {year_counts[year]:,} files")
    
    # Sample quality check
    if len(all_files) > 0:
        print(f"\nüìã Sample Quality Check (3 random files):")
        import random
        sample = random.sample(all_files, min(3, len(all_files)))
        
        for fpath in sample:
            fname = os.path.basename(fpath)
            try:
                with open(fpath, 'r') as f:
                    data = json.load(f)
                    has_mda = 'item_7' in data and len(data.get('item_7', '')) > 100
                    mda_len = len(data.get('item_7', ''))
                    print(f"   {fname}: {'‚úÖ' if has_mda else '‚ùå'} MD&A ({mda_len:,} chars)")
            except Exception as e:
                print(f"   {fname}: ‚ö†Ô∏è Error - {e}")
                
else:
    print("‚ùå No extraction directory found")
    print(f"   Expected: {extracted_dir}")

# SECTION 4: RESUME EXTRACTION

**After running reorganization above, resume here**

In [None]:
## üü¢ Resume MD&A Extraction
## RUN: After reorganization is complete
##
## Now writes to year subfolders automatically
## Should not experience Drive timeouts anymore

print("üöÄ Resuming MD&A extraction...")
print("   Files will be organized by year")
print("   Should avoid Drive timeout issues\n")

!python flexible_extractor.py --config extraction_configs/mda_only.json