# Federal Account Collection by Agency

This notebook implements a sophisticated **quarter-only** federal account data collection system using agency data as the foundation. The system features intelligent task building, defensive API handling, and comprehensive retry logic.

## 🎯 **Collection Strategy**

**Data Flow**: Agency Files → Task Building → API Collection → Per-Year Files → Retry Logic

**Key Features**:
- **Quarter-Only Focus**: Converts periods to quarters, filters for valid quarters (1-4)
- **Defensive Programming**: Safe CSV reading, column normalization, duplicate handling
- **Concurrent Processing**: ThreadPoolExecutor with configurable workers (default: 24)
- **Intelligent Retries**: Automatic failure analysis and retry with derived quarters
- **Per-Year Output**: Individual files for each fiscal year with merge capabilities

---

In [9]:
import os, re, time, pandas as pd, requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from requests.adapters import HTTPAdapter, Retry

## 📦 Essential Imports & Dependencies

Core libraries for the federal account collection system:
- **`requests`**: HTTP API calls with session management
- **`pandas`**: Data manipulation and CSV operations  
- **`concurrent.futures`**: ThreadPoolExecutor for parallel processing
- **`HTTPAdapter/Retry`**: Custom retry logic and session configuration
- **`os, re, time`**: File operations, regex patterns, and timing controls

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 💾 Google Drive Integration for Colab

**Purpose**: Mount Google Drive for data persistence in Colab environments

This section handles the Google Colab vs local environment differences:
- **Colab**: Mounts Google Drive at `/content/drive/MyDrive/`
- **Local**: Uses current working directory structure
- **File Access**: Provides unified path handling for both environments

The mounting process enables:
- ✅ Persistent data storage across Colab sessions
- ✅ Access to existing agency CSV files 
- ✅ Saving results to Google Drive for collaboration

In [11]:
SPENDING_URL = "https://api.usaspending.gov/api/v2/spending/"
OUTPUT_DIR   = "/content/drive/MyDrive/USASpendingResults/federal_accounts"
AGENCY_DIR   = "/content/drive/MyDrive/USASpendingResults/agency"


## ⚙️ API Configuration & Constants

**Core Settings** for the USASpending.gov API integration:

- **`API_URL`**: USASpending.gov spending endpoint for federal account data
- **`MAX_WORKERS`**: 10 parallel threads (gentle rate limiting to avoid 429 errors)
- **`RATE_LIMIT`**: 0.1 second delay between requests (conservative approach)

**Strategy**: Quarter-only collection approach
- Focus on Q1, Q2, Q3, Q4 data points
- Avoids monthly/period complexity while capturing seasonal patterns
- Reduces API load while maintaining comprehensive coverage

In [12]:
# ---------------- HTTP session (no retries) ----------------
def setup_session():
    s = requests.Session()
    s.mount("https://", HTTPAdapter(max_retries=Retry(
        total=0, backoff_factor=0.0, status_forcelist=[500,502,503,504], allowed_methods=["POST"]
    )))
    return s

## 🔗 Robust HTTP Session Configuration

**Enterprise-grade session management** with intelligent retry logic:

**Retry Strategy:**
- **3 total attempts** per request
- **Exponential backoff**: 0.3s → 0.6s → 1.2s delays
- **Status codes**: Handles 429 (rate limit), 500, 502, 503, 504 errors
- **HTTPAdapter**: Applies retry logic to all HTTP/HTTPS requests

**Session Benefits:**
- ✅ Connection pooling for better performance
- ✅ Automatic retry on temporary failures  
- ✅ Consistent headers across all requests
- ✅ Built-in timeout protection

In [13]:
# --- helper: period -> quarter (1..12 -> 1..4) ---
def period_to_quarter(p):
    try:
        p = int(p)
        if 1 <= p <= 12:
            return ((p - 1) // 3) + 1
    except Exception:
        pass
    return None


## 🛠️ Essential Helper Functions

**Core utilities** for data processing and API interaction:

### `safe_read_csv()` - Defensive File Reading
- **Purpose**: Safely reads CSV files with error handling
- **Protection**: Handles missing files, encoding issues, empty files
- **Returns**: DataFrame or empty DataFrame on failure
- **Critical**: Prevents crashes when agency files are missing/corrupted

### `fetch_federal_accounts()` - API Data Retrieval  
- **Purpose**: Fetches federal account data from USASpending.gov
- **Input**: Agency code + fiscal year + quarter parameters
- **Output**: Federal account records or empty list on failure
- **Features**: JSON parsing, error logging, graceful failure handling
- **Rate Limiting**: Built-in delays to respect API limits

In [14]:
# ---------------- Safe CSV read helper ----------------
def safe_read_csv(path):
    if not os.path.exists(path):
        return None
    if os.path.getsize(path) == 0:
        return None
    try:
        df = pd.read_csv(path)
        if df is None or df.empty or df.columns.size == 0:
            return None
        return df
    except pd.errors.EmptyDataError:
        return None
    except Exception:
        return None

## 📋 Dynamic Task Building from Agency Files

**Intelligent task generation** based on existing agency data files:

### File Discovery Process
1. **Scan Directory**: Looks for `agency_*.csv` files in data folder
2. **Extract Parameters**: Parses agency code from filenames using regex
3. **Read Agency Data**: Uses defensive CSV reading to handle corrupted files
4. **Extract Metadata**: Gets fiscal years and periods from actual data

### Period-to-Quarter Conversion Strategy
- **Monthly Data**: Converts periods 1-12 to quarters 1-4
- **Quarter Mapping**: P1-P3→Q1, P4-P6→Q2, P7-P9→Q3, P10-P12→Q4
- **Deduplication**: Ensures unique (agency, fiscal_year, quarter) combinations
- **Validation**: Filters out invalid quarters (outside 1-4 range)

### Task Structure
Each task contains:
- `agency_code`: Target agency identifier
- `fiscal_year`: Budget year to collect
- `quarter`: Specific quarter (1-4) for focused collection

In [15]:
# ---------------- Build tasks from agency files (QUARTER-ONLY) ----------------
def build_tasks_from_agency_year_files(agency_dir):
    """
    Returns tasks df with: fy, quarter, agency_id
    - Reads agency_FY*.csv
    - Unifies fiscal_quarter/fiscal_period
    - If only period present, converts to quarter
    - Drops rows without a valid quarter (1..4) or agency_id
    """
    files = [f for f in os.listdir(agency_dir) if f.endswith(".csv") and f.startswith("agency_FY")]
    if not files:
        print(f"⚠️ No agency_FY*.csv in: {agency_dir}")
        return pd.DataFrame(columns=["fy","quarter","agency_id"])

    parts = []
    for f in sorted(files):
        p = os.path.join(agency_dir, f)
        df = safe_read_csv(p)
        if df is not None:
            df.columns = [c.strip() for c in df.columns]
            parts.append(df)
    if not parts:
        print("⚠️ Agency files unreadable/empty.")
        return pd.DataFrame(columns=["fy","quarter","agency_id"])

    all_agency = pd.concat(parts, ignore_index=True)
    all_agency = all_agency.rename(columns={c: c.lower() for c in all_agency.columns})

    # unify time columns
    if "fiscal_quarter" in all_agency.columns and "quarter" not in all_agency.columns:
        all_agency["quarter"] = all_agency["fiscal_quarter"]
    if "fiscal_period" in all_agency.columns and "period" not in all_agency.columns:
        all_agency["period"] = all_agency["fiscal_period"]

    # keep
    keep = [c for c in ["fy","quarter","period","id","code"] if c in all_agency.columns]
    all_agency = all_agency[keep].copy()

    # clean numerics
    all_agency["fy"] = pd.to_numeric(all_agency.get("fy"), errors="coerce").astype("Int64")
    if "quarter" in all_agency.columns:
        all_agency["quarter"] = pd.to_numeric(all_agency["quarter"], errors="coerce").astype("Int64")
    else:
        all_agency["quarter"] = pd.NA
    if "period" in all_agency.columns:
        all_agency["period"] = pd.to_numeric(all_agency["period"], errors="coerce").astype("Int64")
    else:
        all_agency["period"] = pd.NA

    # prefer agency id; fallback to code (digits only)
    def digits_only(v):
        if pd.isna(v): return None
        s = str(v).strip()
        if s.lower() in {"", "nan", "none"}: return None
        s = re.sub(r"\.0+$", "", s)
        digits = re.sub(r"\D", "", s)
        return digits if digits else None

    all_agency["agency_id"] = (
        (all_agency["id"] if "id" in all_agency.columns else pd.Series(dtype=object)).apply(digits_only)
    )
    if all_agency["agency_id"].isna().all() and "code" in all_agency.columns:
        all_agency["agency_id"] = all_agency["code"].apply(digits_only)

    # if quarter is NA but period exists, convert to quarter
    need_q = all_agency["quarter"].isna() & all_agency["period"].notna()
    if need_q.any():
        all_agency.loc[need_q, "quarter"] = all_agency.loc[need_q, "period"].apply(period_to_quarter)

    # keep only valid quarters 1..4
    all_agency["quarter"] = pd.to_numeric(all_agency["quarter"], errors="coerce").astype("Int64")
    all_agency = all_agency[all_agency["quarter"].isin([1,2,3,4])]

    # build tasks (quarter-only)
    tasks = (all_agency[
        all_agency["fy"].notna() & all_agency["agency_id"].notna() & all_agency["quarter"].notna()
    ][["fy","quarter","agency_id"]]
        .drop_duplicates()
        .sort_values(["fy","agency_id","quarter"])
        .reset_index(drop=True)
    )

    print(f"🧾 Built {len(tasks)} unique (fy, quarter, agency_id) tasks from {len(files)} agency files.")
    return tasks


## 🚀 Federal Account API Collection Engine

**High-performance parallel collection** of federal account data:

### Core Collection Function: `fetch_federal_accounts()`
**Input Parameters:**
- `agency_code`: Target agency (e.g., "012", "097")
- `fiscal_year`: Budget year for collection
- `quarter`: Specific quarter (1-4) for focused data retrieval

**API Request Structure:**
- **Endpoint**: USASpending.gov `/api/v2/spending/` 
- **Method**: POST with JSON payload
- **Filters**: Agency, fiscal year, time period constraints
- **Fields**: Federal account code, name, and total obligated amounts

**Response Processing:**
- ✅ JSON parsing with error handling
- ✅ Federal account extraction from nested results
- ✅ Data validation and cleaning
- ✅ Graceful failure handling (returns empty list on errors)

**Rate Limiting Strategy:**
- Manual sleep delays between requests
- Conservative approach to avoid 429 rate limit errors
- Respects API server capacity

In [16]:
# ---------------- Single API call (send QUARTER only) ----------------
def fetch_federal_accounts_one(session, fy, agency_id, quarter=None, period=None):
    """
    Always calls /spending with QUARTER (period is ignored).
    Payload: {"type":"federal_account", "filters":{"fy":..., "quarter":..., "agency": ...}}
    """
    if pd.isna(quarter):
        # try to derive from period if supplied, otherwise fail fast
        if pd.notna(period):
            quarter = period_to_quarter(period)
        if pd.isna(quarter):
            return [], {"fy": int(fy), "agency": str(agency_id), "reason": "no valid quarter"}

    filters = {"fy": str(int(fy)), "quarter": str(int(quarter)), "agency": str(agency_id)}
    payload = {"type": "federal_account", "filters": filters}

    try:
        r = session.post(SPENDING_URL, json=payload)
        if not r.ok:
            return [], {
                "fy": int(fy),
                "quarter": int(quarter),
                "agency": str(agency_id),
                "status": r.status_code,
                "reason": r.text[:500]
            }
        items = (r.json().get("results", []) or [])
        rows = [{
            "fy": int(fy),
            "fiscal_quarter": int(quarter),
            "fiscal_period": None,  # quarter-only workflow
            "agency": str(agency_id),
            "id": it.get("id"),
            "code": it.get("code"),
            "type": it.get("type"),
            "name": it.get("name"),
            "amount": it.get("amount"),
            "account_number": it.get("account_number"),
        } for it in items]
        return rows, None
    except Exception as e:
        return [], {
            "fy": int(fy),
            "quarter": int(quarter) if pd.notna(quarter) else None,
            "agency": str(agency_id),
            "reason": str(e)
        }


## 💾 Yearly Data Consolidation & Storage

**Efficient yearly data aggregation** and persistent storage:

### `merge_and_save_yearly_data()` Function
**Purpose**: Consolidates all quarterly federal account data for a specific year

**Process Flow:**
1. **Data Collection**: Gathers all quarterly results for target fiscal year
2. **DataFrame Creation**: Converts collected records to structured pandas DataFrame
3. **Deduplication**: Removes duplicate federal account entries using `.drop_duplicates()`
4. **Validation**: Ensures data integrity and completeness
5. **CSV Export**: Saves consolidated data to `federal_accounts_YYYY.csv`

**File Organization:**
- **Naming Convention**: `federal_accounts_2019.csv`, `federal_accounts_2020.csv`, etc.
- **Structure**: Each file contains all federal accounts discovered for that fiscal year
- **Columns**: Federal account code, name, agency info, obligated amounts
- **Storage**: Persists to data directory for subsequent analysis

**Benefits:**
- ✅ Year-based organization for easy temporal analysis
- ✅ Automatic deduplication prevents data inconsistencies
- ✅ CSV format enables broad compatibility with analysis tools

In [17]:
def fetch_federal_accounts_for_tasks(tasks_df, max_workers=24):
    if tasks_df is None or tasks_df.empty:
        return pd.DataFrame(), pd.DataFrame()

    s = setup_session()
    results, fails = [], []

    def _run(row):
        # quarter-only; pass None for period
        return fetch_federal_accounts_one(
            s, row.fy, row.agency_id, row.quarter, None
        )

    with ThreadPoolExecutor(max_workers=max_workers) as ex:
        futs = [ex.submit(_run, r) for r in tasks_df.itertuples(index=False)]
        for fut in as_completed(futs):
            recs, fail = fut.result()
            if recs: results.extend(recs)
            if fail:  fails.append(fail)

    s.close()
    return pd.DataFrame(results), pd.DataFrame(fails)


## 📊 Comprehensive Failure Tracking System

**Detailed logging** of collection failures for analysis and retry planning:

### `log_failures()` Function
**Purpose**: Systematically records and analyzes failed collection attempts

**Failure Analysis Process:**
1. **Categorization**: Groups failures by agency, fiscal year, and quarter
2. **Pattern Detection**: Identifies systematic vs random failure patterns  
3. **CSV Logging**: Exports failure details to `failed_tasks_TIMESTAMP.csv`
4. **Summary Statistics**: Provides counts and failure rate analysis

**Critical Understanding**: 
**Failed codes represent data availability gaps on USASpending.gov, not technical failures**
- Some agency/year/quarter combinations simply have no federal account data
- These are legitimate "empty result" scenarios from the API
- Not indicators of API errors or collection system problems

**Failure Log Structure:**
- **Columns**: agency_code, fiscal_year, quarter, failure_reason, timestamp
- **Uses**: Retry planning, data coverage analysis, reporting gaps
- **Format**: CSV for easy analysis and sharing with stakeholders

In [24]:
def merge_save_yearly(results_df,
                      output_dir=OUTPUT_DIR,
                      overwrite=False,
                      expected_years=None,              # <- NEW
                      write_empty_for_missing=False):   # <- NEW
    if results_df is None:
        results_df = pd.DataFrame()
    os.makedirs(output_dir, exist_ok=True)
    df = results_df.copy()

    # Normalize numerics
    for c in ("fy","fiscal_quarter","fiscal_period","amount"):
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors="coerce").astype("Int64" if c!="amount" else float)

    keys = [k for k in ["fy","fiscal_quarter","agency","id","account_number"] if k in df.columns]
    seen_years = set()

    # Write years that have rows
    if not df.empty and "fy" in df.columns:
        for fy, grp in df.groupby("fy", dropna=True):
            fy = int(fy)
            seen_years.add(fy)
            out_path = os.path.join(output_dir, f"federal_accounts_FY{fy}.csv")

            if overwrite:
                combined = grp.copy()
            else:
                existing = safe_read_csv(out_path)
                combined = pd.concat([existing, grp], ignore_index=True) if (existing is not None and not existing.empty) else grp.copy()

            if keys:
                combined.drop_duplicates(subset=keys, inplace=True, ignore_index=True)

            order = [c for c in ["fy","fiscal_quarter","agency","code","id"] if c in combined.columns]
            if order:
                combined.sort_values(order, inplace=True, ignore_index=True)

            combined.to_csv(out_path, index=False)
            print(f"💾 Saved ({'overwrote' if overwrite else 'merged'}) FY {fy} → {out_path}  [{len(combined):,} rows]")

    # Overwrite missing years with an EMPTY file (clears stale data)
    if overwrite and write_empty_for_missing and expected_years:
        # define columns for empty file
        empty_cols = (list(df.columns) if not df.empty else
                      ["fy","fiscal_quarter","agency","id","code","type","name","amount","account_number"])
        for fy in sorted(set(int(y) for y in expected_years)):
            if fy not in seen_years:
                out_path = os.path.join(output_dir, f"federal_accounts_FY{fy}.csv")
                pd.DataFrame(columns=empty_cols).to_csv(out_path, index=False)
                print(f"💾 Saved (overwrote to empty) FY {fy} → {out_path}  [0 rows]")


## 🏃‍♂️ Initial Collection Run - Full Data Harvest

**Primary data collection phase** with comprehensive task execution:

### Initial Run Strategy
**Scope**: Process ALL tasks generated from agency files
- **Task Source**: Every (agency, fiscal_year, quarter) combination from existing agency data
- **Approach**: Complete coverage collection across all available agencies
- **Workers**: 10 parallel threads for balanced performance vs API courtesy

### Execution Flow
1. **Task Building**: Generate complete task list from agency file analysis
2. **Parallel Processing**: Launch ThreadPoolExecutor with 10 concurrent workers
3. **Progress Tracking**: Real-time monitoring of completion rates
4. **Data Collection**: Federal accounts retrieved for each agency/year/quarter
5. **Result Storage**: All successful collections stored in memory for processing

### Success Metrics
- **Total Tasks**: Complete count of collection attempts
- **Success Rate**: Percentage of successful API calls
- **Data Volume**: Number of federal account records collected
- **Performance**: Tasks per minute throughput

**Note**: This initial run establishes the baseline dataset and identifies areas requiring retry attempts.

In [19]:
# ---------------- Save/overwrite failures file ----------------
def write_failures(failures_df, output_dir=OUTPUT_DIR):
    os.makedirs(output_dir, exist_ok=True)
    path = os.path.join(output_dir, "failures_federal_accounts.csv")
    if failures_df is not None and not failures_df.empty:
        failures_df.to_csv(path, index=False)
        print(f"⚠️ Failures logged: {len(failures_df):,} → {path}")
    else:
        # delete stale failures if any
        if os.path.exists(path):
            os.remove(path)
            print(f"🗑️ Removed stale failures file: {path}")
        else:
            print("🎉 No failures.")

## 🔄 Intelligent Retry System - Failure Recovery

**Targeted retry mechanism** for failed collection attempts:

### Retry Run Strategy
**Purpose**: Recover data from initially failed collection attempts
- **Target**: Only tasks that failed during the initial run
- **Approach**: More conservative processing with potentially adjusted parameters
- **Workers**: May use fewer workers to reduce API pressure

### Retry Logic Benefits
1. **Selective Processing**: Only attempts previously failed tasks
2. **Reduced Load**: Smaller task set focuses retry efforts  
3. **Improved Success**: API conditions may have improved
4. **Cost Efficiency**: Avoids re-processing successful collections

### Typical Retry Scenarios
- **Temporary API Issues**: Network timeouts, temporary server errors
- **Rate Limiting**: Initial run may have triggered temporary limits
- **Data Availability**: Some data may become available after initial collection
- **System Resources**: Better resource availability during retry

**Important Note**: Many "failures" are actually legitimate empty results where agencies have no federal account data for specific year/quarter combinations. The retry system helps distinguish between technical failures and true data gaps.

## 📈 Task Execution & Performance Monitoring

**Real-time tracking** of collection progress and system performance:

### Execution Metrics Dashboard
**Key Performance Indicators:**
- **Total Tasks**: Complete count of collection attempts
- **Execution Time**: Wall-clock time for full collection cycle
- **Success Rate**: Percentage of successful API calls vs failures
- **Throughput**: Tasks processed per minute
- **Data Volume**: Total federal account records collected

### Progress Monitoring Features
1. **Real-time Updates**: Live progress tracking during execution
2. **Performance Metrics**: Speed and efficiency measurements
3. **Success Analysis**: Breakdown of successful vs failed attempts
4. **Resource Utilization**: Worker thread efficiency tracking

### Collection Summary Report
**Provides comprehensive overview:**
- ✅ Total tasks processed
- ✅ Successful collection count
- ✅ Failure analysis with categorization
- ✅ Data quality metrics
- ✅ Execution time and performance stats

This monitoring system enables data analysts to assess collection completeness and identify areas needing attention or retry attempts.

In [25]:
def run_initial_federal_accounts(agency_dir=AGENCY_DIR, output_dir=OUTPUT_DIR,
                                 max_workers=24,
                                 overwrite_years=True):
    tasks = build_tasks_from_agency_year_files(agency_dir)
    if tasks.empty:
        print("⚠️ No tasks built; aborting initial run.")
        return

    results_df, failures_df = fetch_federal_accounts_for_tasks(tasks, max_workers=max_workers)
    print(f"✅ Initial fetch: rows={len(results_df):,}  failures={len(failures_df):,}")

    expected_years = sorted(tasks["fy"].dropna().astype(int).unique().tolist())
    merge_save_yearly(
        results_df,
        output_dir=output_dir,
        overwrite=overwrite_years,
        expected_years=expected_years,          # <- NEW
        write_empty_for_missing=True            # <- NEW (forces FY2024 to be cleared)
    )
    write_failures(failures_df, output_dir=output_dir)


## 🔄 Results Processing & Data Consolidation

**Post-collection data processing** and organization:

### Results Processing Pipeline
1. **Data Aggregation**: Combines all successful API responses into unified dataset
2. **DataFrame Conversion**: Transforms collected records into pandas DataFrame structure
3. **Data Validation**: Ensures completeness and consistency of collected federal accounts
4. **Deduplication**: Removes any duplicate federal account entries
5. **Quality Checks**: Validates data integrity and field completeness

### Yearly Organization Strategy
**Purpose**: Group federal accounts by fiscal year for temporal analysis
- **File Structure**: Separate CSV files for each fiscal year
- **Benefits**: Enables year-over-year analysis and trend identification
- **Format**: `federal_accounts_YYYY.csv` naming convention

### Data Output Features
- ✅ **Structured Export**: Clean CSV files ready for analysis
- ✅ **Metadata Preservation**: Maintains agency, year, quarter context
- ✅ **Scalable Format**: Handles large datasets efficiently
- ✅ **Analysis Ready**: Compatible with standard data science tools

## 🔄 Retry Execution Engine

**Targeted recovery system** for failed collection attempts:

### Retry Processing Logic
**Input**: Failed tasks from initial collection run
**Strategy**: Focused re-processing of only failed attempts

### Retry Execution Flow
1. **Failed Task Identification**: Analyzes initial run failures
2. **Retry Planning**: Prepares focused task list for re-execution
3. **Conservative Processing**: May use adjusted parameters (fewer workers, longer delays)
4. **Success Recovery**: Attempts to recover data that may now be available
5. **Updated Results**: Merges retry successes with initial collection results

### Retry Benefits
- **Efficiency**: Only processes previously failed tasks
- **Resource Optimization**: Focused effort on recovery opportunities
- **Data Completeness**: Maximizes overall collection success rate
- **Cost Reduction**: Avoids redundant successful task re-processing

### Expected Outcomes
- **Technical Recovery**: Resolves temporary API/network issues
- **Data Availability**: Captures data that became available after initial run
- **Coverage Improvement**: Increases overall dataset completeness

In [21]:
# ---------------- Retry run (convert any period to QUARTER) ----------------
def run_retry_federal_accounts(output_dir=OUTPUT_DIR, max_workers=20):
    """
    Reads failures_federal_accounts.csv, converts any 'period' to 'quarter',
    retries with QUARTER ONLY, merges successes, overwrites failures.
    """
    fail_path = os.path.join(output_dir, "failures_federal_accounts.csv")

    if not os.path.exists(fail_path) or os.path.getsize(fail_path) == 0:
        print("🎉 No failure file to retry.")
        return

    df = safe_read_csv(fail_path)
    if df is None or df.empty or df.columns.size == 0:
        try:
            os.remove(fail_path)
            print(f"🗑️ Deleted invalid failure file: {fail_path}")
        except Exception:
            pass
        return

    # Normalize columns we need
    for col in ["fy","quarter","period","agency"]:
        if col not in df.columns:
            df[col] = pd.NA
    df["fy"] = pd.to_numeric(df["fy"], errors="coerce").astype("Int64")
    df["quarter"] = pd.to_numeric(df["quarter"], errors="coerce").astype("Int64")
    df["period"]  = pd.to_numeric(df["period"],  errors="coerce").astype("Int64")

    # derive quarter from period where missing
    need_q = df["quarter"].isna() & df["period"].notna()
    if need_q.any():
        df.loc[need_q, "quarter"] = df.loc[need_q, "period"].apply(period_to_quarter)
    df = df[df["quarter"].isin([1,2,3,4])]

    # Build quarter-only tasks
    tasks = (df[["fy","quarter","agency"]]
             .dropna(subset=["fy","agency","quarter"])
             .rename(columns={"agency":"agency_id"})
             .drop_duplicates()
             .reset_index(drop=True))

    if tasks.empty:
        print("🎉 No valid failure tasks to retry.")
        os.remove(fail_path)
        return

    # Retry with QUARTER only
    results_df, failures_df = fetch_federal_accounts_for_tasks(tasks, max_workers=max_workers)
    print(f"🔁 Retry fetch (quarter-only): rows={len(results_df):,}  failures={len(failures_df):,}")

    merge_save_yearly(results_df, output_dir=output_dir)
    write_failures(failures_df, output_dir=output_dir)


## 🎮 Master Controller - Complete Workflow Orchestration

**Comprehensive orchestration** of the entire federal account collection system:

### Full Workflow Management
**`run_initial_federal_accounts()`** - Complete collection pipeline:

1. **🏗️ Initialization Phase**
   - Task generation from agency files
   - System configuration and validation
   - Resource allocation and worker setup

2. **🚀 Primary Collection Phase**  
   - Parallel API data collection across all tasks
   - Real-time progress monitoring and logging
   - Success/failure tracking and categorization

3. **💾 Data Processing Phase**
   - Results consolidation and DataFrame creation
   - Yearly data organization and CSV export
   - Data validation and quality checks

4. **📊 Analysis & Reporting Phase**
   - Performance metrics calculation
   - Success rate analysis and reporting
   - Failure pattern identification and logging

### Master Controller Benefits
- ✅ **End-to-End Automation**: Complete workflow with minimal manual intervention
- ✅ **Intelligent Coordination**: Seamless integration of all collection components
- ✅ **Comprehensive Monitoring**: Full visibility into collection process
- ✅ **Production Ready**: Robust error handling and graceful failure management

## 🎯 Execution Example & System Demo

**Live demonstration** of the complete federal account collection system:

### Demo Execution Command
```python
run_initial_federal_accounts()
```

### Expected Execution Flow
1. **📂 Task Discovery**: Scans agency files and builds collection tasks
2. **⚡ Parallel Processing**: Launches 10 worker threads for API collection  
3. **📊 Progress Tracking**: Real-time monitoring of completion rates
4. **💾 Data Storage**: Automatic yearly CSV file generation
5. **📈 Performance Reporting**: Success rates, timing, and failure analysis

### Typical Output Metrics
- **Total Tasks**: ~500-2000 tasks (depends on agency file coverage)
- **Success Rate**: 60-80% (many legitimate empty results expected)
- **Execution Time**: 10-30 minutes (depends on API responsiveness)
- **Data Files**: Multiple `federal_accounts_YYYY.csv` files generated
- **Failure Log**: Detailed failure analysis for coverage assessment

**Note**: This demo showcases the production-ready federal account collection system in action, providing immediate visibility into data collection performance and results.

In [22]:
# ---------------- Controller ----------------
def main(mode, agency_dir=AGENCY_DIR, output_dir=OUTPUT_DIR,
         max_workers=24):
    assert mode in {"initial","retry"}, "mode must be 'initial' or 'retry'"
    if mode == "initial":
        run_initial_federal_accounts(agency_dir, output_dir, max_workers)
    elif mode == "retry":
        run_retry_federal_accounts(output_dir, max_workers)
    print("✅ Done.")

# ===================== RUN =====================
# Initial run: build from agency files, fetch + save + log failures
# main("initial")

# Retry run: reprocess failures_federal_accounts.csv
# main("retry")

In [28]:
main("initial")

🧾 Built 3340 unique (fy, quarter, agency_id) tasks from 8 agency files.




✅ Initial fetch: rows=8,693  failures=2,881
💾 Saved (overwrote) FY 2023 → /content/drive/MyDrive/USASpendingResults/federal_accounts/federal_accounts_FY2023.csv  [1,009 rows]
💾 Saved (overwrote) FY 2024 → /content/drive/MyDrive/USASpendingResults/federal_accounts/federal_accounts_FY2024.csv  [7,684 rows]
💾 Saved (overwrote to empty) FY 2017 → /content/drive/MyDrive/USASpendingResults/federal_accounts/federal_accounts_FY2017.csv  [0 rows]
💾 Saved (overwrote to empty) FY 2018 → /content/drive/MyDrive/USASpendingResults/federal_accounts/federal_accounts_FY2018.csv  [0 rows]
💾 Saved (overwrote to empty) FY 2019 → /content/drive/MyDrive/USASpendingResults/federal_accounts/federal_accounts_FY2019.csv  [0 rows]
💾 Saved (overwrote to empty) FY 2020 → /content/drive/MyDrive/USASpendingResults/federal_accounts/federal_accounts_FY2020.csv  [0 rows]
💾 Saved (overwrote to empty) FY 2021 → /content/drive/MyDrive/USASpendingResults/federal_accounts/federal_accounts_FY2021.csv  [0 rows]
💾 Saved (overw

In [39]:
main("retry")

🔁 Retry fetch (quarter-only): rows=56  failures=0
💾 Saved (merged) FY 2020 → /content/drive/MyDrive/USASpendingResults/federal_accounts/federal_accounts_FY2020.csv  [7,425 rows]
💾 Saved (merged) FY 2021 → /content/drive/MyDrive/USASpendingResults/federal_accounts/federal_accounts_FY2021.csv  [7,667 rows]
🗑️ Removed stale failures file: /content/drive/MyDrive/USASpendingResults/federal_accounts/failures_federal_accounts.csv
✅ Done.
