# FIXED: MD&A EXTRACTION WITH YEAR SUBFOLDER ORGANIZATION

**Purpose:** Extract MD&A sections from 10-K filings with Drive timeout fix

**CRITICAL FIX:** Reorganizes files into year subfolders to avoid Google Drive's ~10,000 file limit

**TWO FIX STRATEGIES:**
- **Strategy 1:** Try waiting 3-4 hours, then run standard reorganization
- **Strategy 2:** Progressive reorganization (moves files during extraction, one-by-one)

**Prerequisites:**
- Repository: `/content/drive/MyDrive/EDGAR_Project/edgar-crawler`
- Raw 10-K files downloaded
- **~75,000 extracted files causing Drive timeouts**

---

# SECTION 1: SETUP (Run Every Time)

In [None]:
## üü¢ Cell 1: Mount Google Drive
import os
from google.colab import drive

if os.path.exists('/content/drive/MyDrive'):
    print("‚úÖ Drive already mounted")
else:
    drive.mount('/content/drive')
    print("‚úÖ Drive mounted successfully")

In [None]:
## üü¢ Cell 2: Navigate to Repository
import os

REPO_DIR = '/content/drive/MyDrive/EDGAR_Project/edgar-crawler'

if os.path.exists(REPO_DIR):
    os.chdir(REPO_DIR)
    print(f"‚úÖ Working directory: {os.getcwd()}")
else:
    print(f"‚ùå Repository not found at: {REPO_DIR}")

In [None]:
## üü¢ Cell 3: Install Dependencies
print("üì¶ Installing dependencies...")

!pip install -q 'dill<0.3.9' 'multiprocess<0.70.17'
!pip install -q pox ppft
!pip install -q --no-deps pathos
!pip install -q beautifulsoup4 lxml requests pandas tqdm click cssutils numpy pyarrow

print("‚úÖ All dependencies installed")

In [None]:
## üü¢ Cell 4: Keep-Alive Script
from IPython.display import display, Javascript

display(Javascript('''
function KeepClicking(){
    console.log("Keeping session alive...");
    document.querySelector("colab-connect-button").click();
}
setInterval(KeepClicking, 60000);
'''))

print("‚úÖ Keep-alive activated")

# STRATEGY 1: Wait & Retry Standard Reorganization

**üü° TRY THIS FIRST!**

If you got `[Errno 5] Input/output error` when listing directory:
1. **Wait 3-4 hours** (don't access the folder)
2. **Restart Colab runtime** (Runtime ‚Üí Restart runtime)
3. **Re-run Cells 1-4** (setup)
4. **Try reorganization below**

Drive's cache/locks often clear after a few hours.

In [None]:
## üî¥ STRATEGY 1: Standard Reorganization (Try After Waiting)
## 
## What it does:
## - Lists all files in root directory
## - Groups by year
## - Moves to year subfolders
##
## If this fails with [Errno 5], use Strategy 2 below!

import os
import re
from tqdm import tqdm
import time

print("üîß STRATEGY 1: STANDARD REORGANIZATION")
print("=" * 60)

base_dir = 'datasets/EXTRACTED_FILINGS/10-K'

if not os.path.exists(base_dir):
    print(f"‚ùå Directory not found: {base_dir}")
else:
    print("üìä Attempting to list directory...")
    
    try:
        all_items = os.listdir(base_dir)
        root_files = [f for f in all_items if os.path.isfile(os.path.join(base_dir, f)) and f.endswith('.json')]
        print(f"‚úÖ Success! Found {len(root_files):,} files to reorganize\n")
        
        if len(root_files) == 0:
            print("‚úÖ No files to reorganize - already organized!")
        else:
            # Group by year
            print("üìä Grouping by year...")
            year_pattern = re.compile(r'_10K_(\d{4})_')
            year_groups = {}
            
            for filename in root_files:
                match = year_pattern.search(filename)
                if match:
                    year = match.group(1)
                    if year not in year_groups:
                        year_groups[year] = []
                    year_groups[year].append(filename)
            
            for year in sorted(year_groups.keys()):
                print(f"   {year}: {len(year_groups[year]):,} files")
            
            # Move files
            print(f"\nüöÄ Moving files to year subfolders...\n")
            
            moved_count = 0
            for year in sorted(year_groups.keys()):
                year_dir = os.path.join(base_dir, year)
                os.makedirs(year_dir, exist_ok=True)
                
                for filename in tqdm(year_groups[year], desc=f"Year {year}", leave=False):
                    try:
                        src = os.path.join(base_dir, filename)
                        dst = os.path.join(year_dir, filename)
                        os.rename(src, dst)
                        moved_count += 1
                        
                        if moved_count % 100 == 0:
                            time.sleep(0.5)
                    except Exception as e:
                        print(f"\n‚ö†Ô∏è Error moving {filename}: {e}")
            
            print(f"\n‚úÖ REORGANIZATION COMPLETE!")
            print(f"   Moved: {moved_count:,} files")
            print(f"\nüéâ Now run the patch script below, then resume extraction!")
            
    except OSError as e:
        print(f"‚ùå FAILED: {e}")
        print(f"\n" + "=" * 60)
        print("‚ö†Ô∏è  Directory listing failed!")
        print("   This means Drive is blocking access to the folder.")
        print("\n   ‚Üí Use STRATEGY 2 (Progressive Reorganization) below")
        print("   ‚Üí It moves files one-by-one during extraction")
        print("=" * 60)

# STRATEGY 2: Progressive Reorganization (USE IF STRATEGY 1 FAILS)

**üü† USE THIS IF DIRECTORY LISTING FAILS**

**How it works:**
- Doesn't try to list all 75,000 files at once
- Checks files one-by-one using metadata
- Moves old files AND extracts new files during extraction
- Works even when `os.listdir()` fails!

**What happens:**
- ~75,000 existing files: Moved to year folders
- ~8,000 remaining files: Extracted to year folders
- Total time: Same as normal extraction (~10-15 hours)

**Run cells below in order:**

In [None]:
## üü† STRATEGY 2 - Step 1: Apply Progressive Reorganization Patch
##
## This patches extract_items.py to:
## 1. Check if file exists in OLD location (root)
## 2. If yes: MOVE it to year folder, mark as done
## 3. If no: Check if in NEW location (year folder)
## 4. If no: Extract to NEW location

print("üîß Applying progressive reorganization patch...\n")

!python progressive_reorganization_patch.py

In [None]:
## üü† STRATEGY 2 - Step 2: Test Progressive Reorganization
##
## This tests if individual file access works (even when listing fails)

import os
import pandas as pd

print("üß™ Testing progressive access...\n")

base_dir = 'datasets/EXTRACTED_FILINGS/10-K'

# Load metadata
metadata = pd.read_csv('datasets/FILINGS_METADATA.csv')
metadata_10k = metadata[metadata['Type'] == '10-K']

print(f"üìä Total files in metadata: {len(metadata_10k):,}")

# Test: Can we check individual files?
print(f"\nüß™ Testing individual file access (first 10 files):\n")

test_count = 0
found_in_root = 0
found_in_subfolder = 0
not_found = 0

for idx, row in metadata_10k.head(10).iterrows():
    # Generate expected filename
    cik = str(row['CIK'])
    year = row['year']
    accession = row['accession_number']
    filename = f"{cik}_10K_{year}_{accession}.json"
    
    # Check old location (root)
    old_path = os.path.join(base_dir, filename)
    # Check new location (year subfolder)
    new_path = os.path.join(base_dir, str(year), filename)
    
    if os.path.exists(old_path):
        found_in_root += 1
        status = "üìÅ Root (will move)"
    elif os.path.exists(new_path):
        found_in_subfolder += 1
        status = f"üìÇ {year}/ (organized)"
    else:
        not_found += 1
        status = "‚ùå Not extracted"
    
    print(f"   {filename[:50]:50s} {status}")
    test_count += 1

print(f"\n" + "=" * 60)
print(f"‚úÖ Individual file access: WORKS!")
print(f"   Tested: {test_count} files")
print(f"   In root: {found_in_root}")
print(f"   In year folders: {found_in_subfolder}")
print(f"   Not extracted: {not_found}")
print(f"\nüéâ Progressive reorganization will work!")
print(f"   Run extraction below to start moving + extracting")

# RESUME EXTRACTION

**Run this after:**
- Strategy 1 successful reorganization, OR
- Strategy 2 progressive reorganization patch applied

In [None]:
## üü¢ Resume MD&A Extraction
##
## With Strategy 1: Files already organized, extracts remaining files
## With Strategy 2: Moves old files + extracts new files during extraction

print("üöÄ Resuming MD&A extraction...")
print("   Files will be organized by year")
print("   Should avoid Drive timeout issues\n")

!python flexible_extractor.py --config extraction_configs/mda_only.json

# CHECK PROGRESS (Works with Both Strategies)

In [None]:
## üîµ Check Extraction Progress (UPDATED FOR YEAR SUBFOLDERS)

import os
import json
import pandas as pd

extracted_dir = 'datasets/EXTRACTED_FILINGS/10-K'

if os.path.exists(extracted_dir):
    print("üìä Scanning for extracted files (including year subfolders)...\n")
    
    # Count all JSON files recursively
    all_files = []
    year_counts = {}
    root_count = 0
    
    try:
        for root, dirs, files in os.walk(extracted_dir):
            json_files = [f for f in files if f.endswith('.json')]
            all_files.extend([os.path.join(root, f) for f in json_files])
            
            if root == extracted_dir:
                root_count = len(json_files)
            else:
                year = os.path.basename(root)
                year_counts[year] = len(json_files)
    except Exception as e:
        print(f"‚ö†Ô∏è Error scanning: {e}")
    
    # Get expected total
    metadata = pd.read_csv('datasets/FILINGS_METADATA.csv')
    expected = len(metadata[metadata['Type'] == '10-K'])
    
    print(f"üìä Extraction Progress:")
    print(f"   Total Extracted: {len(all_files):,} files")
    print(f"   Expected: {expected:,} files")
    print(f"   Progress: {len(all_files)/expected*100:.1f}%")
    print(f"   Remaining: {expected - len(all_files):,} files")
    
    # Show organization
    if year_counts or root_count > 0:
        print(f"\nüìÅ File organization:")
        
        if root_count > 0:
            print(f"   Root (not organized): {root_count:,} files")
            if root_count > 10000:
                print(f"      ‚ö†Ô∏è WARNING: Too many files in root!")
                print(f"      Drive may timeout again")
            else:
                print(f"      ‚úÖ Acceptable (will be moved progressively)")
        
        if year_counts:
            print(f"   Year subfolders: {sum(year_counts.values()):,} files")
            for year in sorted(year_counts.keys()):
                print(f"      {year}/: {year_counts[year]:,} files")
    
    # Sample quality check
    if len(all_files) > 0:
        print(f"\nüìã Sample Quality Check (3 random files):")
        import random
        sample = random.sample(all_files, min(3, len(all_files)))
        
        for fpath in sample:
            fname = os.path.basename(fpath)
            try:
                with open(fpath, 'r') as f:
                    data = json.load(f)
                    has_mda = 'item_7' in data and len(data.get('item_7', '')) > 100
                    mda_len = len(data.get('item_7', ''))
                    print(f"   {fname}: {'‚úÖ' if has_mda else '‚ùå'} MD&A ({mda_len:,} chars)")
            except Exception as e:
                print(f"   {fname}: ‚ö†Ô∏è Error - {e}")
                
else:
    print("‚ùå No extraction directory found")
    print(f"   Expected: {extracted_dir}")