# 🏆 Federal Awards Data Collection System

## 📋 Overview
**Advanced awards data collection** from USASpending.gov API using federal account foundation data.

This notebook implements **Stage 6** of the federal funding data pipeline - the final stage that collects individual awards and contracts data. It expands federal account records into detailed award-level transactions.

**Pipeline Position**: `Budget Function → Subfunction → Agency → Federal Account → Recipient → **Awards**`

**Input**: `federal_accounts_*.csv` files (from previous pipeline stages)  
**Output**: `awards_FY{YYYY}_Q{Q}.csv` + `failures_FY{YYYY}_Q{Q}.csv` files

## ⚠️ **Important API Endpoint Note**
**Current Implementation**: Uses `/api/v2/spending/` with `type="award"`
- **Purpose**: Aggregated award categories per federal account slice
- **Alternative**: For true award-level records, consider `/api/v2/search/spending_by_award/` with pagination

**Key Features:**
- ✅ **Type-specific API calls**: Uses `type="award"` filtering on spending endpoint
- ✅ **Composite key deduplication**: Prevents duplicate award records  
- ✅ **Period-scoped organization**: Separate files per fiscal year/quarter
- ✅ **Parallel processing**: 50 concurrent workers for high-performance collection
- ✅ **Intelligent retry system**: Targeted failure recovery with automatic cleanup

In [2]:
import requests
import time
import os
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from requests.adapters import HTTPAdapter, Retry
import logging
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)

## 📦 Imports & Dependencies Setup

**Core libraries** for awards data collection system:
- **`requests`**: HTTP API calls with session management
- **`pandas`**: DataFrame operations and CSV processing  
- **`ThreadPoolExecutor`**: Parallel processing (50 concurrent workers)
- **`HTTPAdapter/Retry`**: Custom session configuration
- **`logging/urllib3`**: HTTPS warnings and connection pool noise suppression

**Key Setup Actions:**
- ✅ **Suppresses urllib3 warnings**: Eliminates noisy HTTPS connection warnings
- ✅ **Configures logging levels**: Reduces verbose connection pool messages  
- ✅ **Imports threading utilities**: Enables high-performance parallel API calls

**Purpose**: Clean execution environment with optimized HTTP handling for bulk API operations

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 💾 Google Colab Drive Integration

**Purpose**: Mount Google Drive for persistent data storage across Colab sessions

**Functionality:**
- **Colab Environment**: Mounts Google Drive at `/content/drive/MyDrive/`
- **Data Persistence**: Ensures awards data survives session restarts
- **Input Access**: Enables access to federal account CSV files from previous pipeline stages
- **Output Storage**: Saves awards results to Google Drive for analysis and sharing

In [4]:
# ✅ Setup session with retry logic
def setup_session():
    """
    Creates and configures a session with retry logic for HTTP requests.
    Ensures resilience in case of server or network issues.
    """
    session = requests.Session()
    retries = Retry(
        total=0,
        backoff_factor=1.0,
        status_forcelist=[500, 502, 503, 504],
        allowed_methods=["POST"]
    )
    adapter = HTTPAdapter(max_retries=retries)
    session.mount('https://', adapter)
    return session


## 🔗 HTTP Session Configuration

**Creates optimized HTTP session** with **deliberately disabled retries** (total=0):

### Session Strategy
**Important**: Despite the docstring mentioning "retry logic," retries are **disabled with `total=0`**
- **No Automatic Retries**: Application handles retry logic explicitly at higher levels
- **Manual Control**: Enables precise control over retry behavior and failure handling
- **Status Force List**: Configured to handle 500, 502, 503, 504 errors (but won't auto-retry)

### Performance Optimizations
- **Connection Reuse**: Shared session across all threads for connection pooling
- **HTTPAdapter**: Mounted on HTTPS for consistent adapter behavior
- **Thread Safety**: Session can be safely shared across ThreadPoolExecutor workers
- **Reduced Overhead**: Avoids connection establishment costs per request

### Design Rationale
**Why Disable Retries**: Application-level retry logic provides:
- Better failure categorization and logging
- Period-scoped retry capabilities  
- Controlled retry timing and strategies

In [5]:
# ✅ Fetch award data for a single row using quarters
def fetch_award(session, row):
    """
    Sends a POST request to the USAspending API to fetch awards data
    for a given federal account record using fiscal year and quarter.
    Returns successful records and failure logs.
    """
    time.sleep(0.3)

    fy = str(row['fy'])
    quarter = str(row['quarter'])
    function_code = str(row['budget_function_code']).zfill(3)
    subfunction_code = str(row['budget_subfunction_code']).zfill(3)
    federal_account_code = str(row['federal_account_code']).zfill(4)

    url = "https://api.usaspending.gov/api/v2/spending/"
    payload = {
        "type": "award",
        "filters": {
            "fy": fy,
            "quarter": quarter,
            "budget_function": function_code,
            "budget_subfunction": subfunction_code,
            "federal_account": federal_account_code
        }
    }

    all_records = []
    all_failures = []

    try:
        resp = session.post(url, json=payload)
        resp.raise_for_status()
        data = resp.json()
        results = data.get("results", [])

        for item in results:
            # Using generalized fields; keys may vary by endpoint shape.
            all_records.append({
                "fy": fy,
                "quarter": quarter,
                "budget_function_code": function_code,
                "budget_subfunction_code": subfunction_code,
                "federal_account_code": federal_account_code,
                "award_id": item.get("id"),
                "award_name": item.get("name"),
                "award_code": item.get("code"),
                "obligated_amount": item.get("amount"),
                "total_amount": item.get("total")
            })

    except Exception as e:
        all_failures.append({
            "fy": fy,
            "quarter": quarter,
            "budget_function_code": function_code,
            "budget_subfunction_code": subfunction_code,
            "federal_account_code": federal_account_code,
            "reason": str(e)
        })

    return all_records, all_failures

## 🏆 Core Awards API Worker - Single Slice Processor

**Primary data collection function** for awards data at the **function → subfunction → federal account slice**:

### Input Parameters
**Expects row with complete hierarchical context:**
- `fy`: Fiscal year for temporal filtering
- `quarter`: Specific quarter (1-4) for time period constraint
- `budget_function_code`: 3-digit function code (zero-padded with `zfill(3)`)
- `budget_subfunction_code`: 3-digit subfunction code (zero-padded with `zfill(3)`)
- `federal_account_code`: 4-digit federal account code (zero-padded with `zfill(4)`)

### API Request Structure
**Endpoint**: `/api/v2/spending/` with `type="award"` filtering
```python
payload = {
  "type": "award",
  "filters": {
    "fy": fy, "quarter": quarter,
    "budget_function": function_code,
    "budget_subfunction": subfunction_code, 
    "federal_account": federal_account_code
  }
}
```

### Data Processing Strategy
**Output Records (success)**: Normalized award records with:
- **Partitioning Keys**: `(fy, quarter, budget_function_code, budget_subfunction_code, federal_account_code)`
- **Award Fields**: `award_id`, `award_name`, `award_code` from API response
- **Financial Data**: `obligated_amount` (from `item["amount"]`), `total_amount` (from `item["total"]`)

**Output Records (failure)**: Failure logs with same partitioning keys plus reason string

**Rate Limiting**: `time.sleep(0.3)` to avoid API flooding and respect server limits

In [6]:
# ✅ Step 1: Read and clean data
def read_and_filter_csv_awards(file_path):
    """
    Reads a federal accounts CSV and (optionally) filters rows.
    Mirrors recipients' loader for consistency.
    """
    df = pd.read_csv(file_path)
    # if "obligated_amount" in df.columns:
    #     df = df[df["obligated_amount"] > 0]
    return df

## 📂 Federal Accounts Input Reader

**Defensive CSV loading** for federal account foundation data:

### Input Processing Strategy
**File Pattern**: Reads `federal_accounts_*.csv` files from previous pipeline stages
- **Source**: Generated by federal account collection systems (Stage 4)
- **Content**: Federal account records with fiscal year, quarter, and hierarchical codes
- **Consistency**: Mirrors recipient loader for uniform data handling

### Filtering Options
**Current Implementation**: Reads ALL federal account records without pre-filtering
```python
# Optional zero-amount filtering (currently commented out):
# if "obligated_amount" in df.columns:
#     df = df[df["obligated_amount"] > 0]
```

### Design Rationale
**Inclusive Approach**: Processes all federal accounts to maximize award data coverage
- **Complete Coverage**: Ensures no award data is missed due to zero-obligation accounts
- **API-Level Filtering**: Lets USASpending.gov API handle data availability
- **Downstream Analysis**: Enables comprehensive award pattern analysis

**Output**: Clean DataFrame ready for parallel awards collection processing

In [7]:
# ✅ Step 2: Fetch data from API using ThreadPoolExecutor (mirrors recipients)
def fetch_all_awards(df, max_workers=50):
    """
    Submits all API calls in parallel using a thread pool and returns combined results.
    """
    session = setup_session()
    results = []
    failures = []

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(fetch_award, session, row) for _, row in df.iterrows()]
        for future in as_completed(futures):
            res, fail = future.result()
            results.extend(res)
            failures.extend(fail)

    return pd.DataFrame(results), pd.DataFrame(failures)

## 🚀 Parallel Fetch Orchestrator - High-Performance Collection

**Massive parallelization** of awards data collection with **50 concurrent workers**:

### Parallel Processing Architecture
**ThreadPoolExecutor Strategy:**
- **Max Workers**: 50 concurrent threads (aggressive parallelization)
- **Task Distribution**: One `fetch_award()` call per federal account row
- **Session Sharing**: Single HTTP session across all worker threads
- **Result Aggregation**: Combines results and failures from all threads

### Execution Flow
1. **Session Creation**: Single `setup_session()` call shared across all threads
2. **Task Submission**: Each DataFrame row submitted as separate future
3. **Concurrent Processing**: Up to 50 simultaneous API calls to USASpending.gov
4. **Result Collection**: `as_completed()` gathers results as threads finish
5. **Data Separation**: Success records and failure logs collected separately

### Performance Benefits
- ✅ **High Throughput**: 50x speed improvement over sequential processing
- ✅ **Network Latency Hiding**: Threads mask network wait times effectively
- ✅ **Connection Pooling**: Shared session reduces connection establishment overhead
- ✅ **Scalable Design**: Handles large federal account datasets efficiently

### Why Threads Work Well Here
- **I/O Bound**: Network requests benefit from concurrent execution
- **Small HTTP Calls**: Many short API requests ideal for thread-based parallelism
- **Shared Resources**: Session and memory efficiently shared across threads

**Output**: Two DataFrames - successful award records and detailed failure logs

In [8]:
# ✅ Save results & failures with empty-file safety + dedupe (mirrors recipients)
def save_award_results(results_df, failures_df, file_path, output_base_folder):
    """
    Saves awards results and failure logs.

    - Appends only unique award records (based on key columns).
    - Handles empty/corrupted existing result files safely.
    - Reports how many new records added vs duplicates skipped.
    - Overwrites the failures file each time.
    """
    import os
    import pandas as pd

    base_filename = os.path.basename(file_path).replace(".csv", "")
    for prefix in ["federal_accounts_", "failures_"]:
        if base_filename.startswith(prefix):
            base_filename = base_filename.replace(prefix, "")
            break

    year_quarter = base_filename
    os.makedirs(output_base_folder, exist_ok=True)
    results_path = os.path.join(output_base_folder, f"awards_{year_quarter}.csv")
    failures_path = os.path.join(output_base_folder, f"failures_{year_quarter}.csv")

    unique_keys = [
        "fy", "quarter", "budget_function_code",
        "budget_subfunction_code", "federal_account_code", "award_id"
    ]

    # Deduplicate incoming new results
    if not results_df.empty:
        results_df.drop_duplicates(subset=unique_keys, inplace=True)

    # ✅ Try to read existing results safely
    existing_results = None
    if os.path.exists(results_path):
        try:
            existing_results = pd.read_csv(results_path)
            if existing_results.empty or existing_results.columns.size == 0:
                print(f"⚠️ Ignoring empty/corrupted results file: {results_path}")
                existing_results = None
        except pd.errors.EmptyDataError:
            print(f"⚠️ Ignoring EmptyDataError on: {results_path}")
            existing_results = None
        except Exception as e:
            print(f"⚠️ Skipped reading existing results ({e}): {results_path}")
            existing_results = None

    # ✅ Merge results
    if existing_results is not None:
        before_count = len(existing_results)
        combined = pd.concat([existing_results, results_df], ignore_index=True)
        combined.drop_duplicates(subset=unique_keys, inplace=True)
        added_count = len(combined) - before_count
        duplicate_count = max(len(results_df) - added_count, 0)
    else:
        combined = results_df.copy()
        added_count = len(combined)
        duplicate_count = 0

    combined.to_csv(results_path, index=False)
    print(f"✅ Awards saved: {added_count} new rows → {results_path} (Duplicates skipped: {duplicate_count})")

    # Always overwrite failures
    failures_df.to_csv(failures_path, index=False)
    print(f"⚠️ Failures overwritten: {len(failures_df)} → {failures_path}")



## 💾 Result Writer - Idempotent Award Storage

**Advanced save logic** with **composite key deduplication** and **safe file handling**:

### File Naming Strategy  
**Derives period tokens from input filenames:**
- **Input**: `federal_accounts_FY2024_Q1.csv` → **Token**: `FY2024_Q1`
- **Prefix Stripping**: Removes `federal_accounts_` or `failures_` prefixes automatically
- **Output Files**: 
  - `awards_FY2024_Q1.csv` (award records)
  - `failures_FY2024_Q1.csv` (failure logs)

### Deduplication System  
**Composite Unique Key**: `[fy, quarter, budget_function_code, budget_subfunction_code, federal_account_code, award_id]`
- **Idempotent Operations**: Safe re-runs without creating duplicate records
- **Merge Logic**: Combines new results with existing files using key-based deduplication
- **Conflict Resolution**: New records replace existing records with same composite key

### Safe File Operations
**Defensive Handling**: 
1. **Existing File Reading**: Safely loads existing `awards_{FY_Q}.csv` files
2. **Corruption Protection**: Handles empty/corrupted files gracefully
3. **Smart Merge**: Combines new and existing data with deduplication applied
4. **Atomic Write**: Saves updated results to prevent partial file corruption

### Failure File Management
**Always Overwrites**: `failures_{FY_Q}.csv` files completely replaced each run
- **Current State Focus**: Only shows failures from most recent execution
- **Clean Slate**: Each run starts with fresh failure tracking per period

**Reporting**: Detailed logging of "new rows added" vs "duplicates skipped"

In [9]:

# ✅ Full initial processing
def run_initial_federal_account_processing_awards(input_folder, output_folder, max_workers=50, start_fy=None):
    """
    Processes all federal_accounts_*.csv files for AWARDS.
    - Always writes awards_{FY_Q}.csv (even if empty)
    - Writes failures_{FY_Q}.csv ONLY if there are failures (deletes old one if it exists)
    - Optional: start_fy to filter files by fiscal year
    """
    for file in os.listdir(input_folder):
        if not (file.endswith(".csv") and file.startswith("federal_accounts")):
            continue

        # Optional FY filter from filename: federal_accounts_FY2024_Q1.csv
        if start_fy is not None:
            try:
                fy = int(file.split("_FY")[1].split("_Q")[0])
                if fy < start_fy:
                    continue
            except Exception:
                print(f"⚠️ Skipped (cannot parse FY): {file}")
                continue

        file_path = os.path.join(input_folder, file)
        print(f"🚀 [AWARDS] Starting initial load for: {file_path}")

        df = read_and_filter_csv_awards(file_path)
        if df.empty:
            print(f"⚠️ Skipped (no data): {file_path}")
            continue

        results_df, failures_df = fetch_all_awards(df, max_workers=max_workers)

        base_name = os.path.basename(file_path).replace(".csv", "")
        year_quarter = base_name.replace("federal_accounts_", "")
        results_path = os.path.join(output_folder, f"awards_{year_quarter}.csv")
        failures_path = os.path.join(output_folder, f"failures_{year_quarter}.csv")

        os.makedirs(output_folder, exist_ok=True)

        # ✅ Always save results
        results_df.to_csv(results_path, index=False)
        print(f"✅ [AWARDS] Saved: {len(results_df)} → {results_path}")

        # ⚠️ Only save failures if any; delete stale failures if none
        if failures_df is not None and not failures_df.empty:
            failures_df.to_csv(failures_path, index=False)
            print(f"⚠️ [AWARDS] Failures: {len(failures_df)} → {failures_path}")
        else:
            if os.path.exists(failures_path):
                os.remove(failures_path)
                print(f"🗑️ [AWARDS] Removed stale failures file: {failures_path}")
            print(f"🎉 [AWARDS] No failures for {year_quarter}")




## 🏃‍♂️ Initial Folder Runner - Complete Awards Processing

**Primary "fan-out" batch job** that transforms **federal accounts → awards** across all periods:

### Folder Processing Strategy
**File Discovery**: Scans input folder for `federal_accounts_*.csv` files  
- **Pattern Matching**: Only processes files with correct prefix and CSV extension
- **FY Filtering**: Optional `start_fy` parameter to skip older fiscal years
- **Comprehensive Coverage**: Processes every discovered federal account file

### Per-File Processing Flow
1. **File Validation**: Confirms CSV format and federal_accounts prefix
2. **FY Parsing**: Extracts fiscal year from filename for optional filtering (e.g., `federal_accounts_FY2024_Q1.csv`)
3. **Data Loading**: Uses `read_and_filter_csv_awards()` for defensive CSV reading
4. **Parallel Collection**: Launches 50-worker ThreadPoolExecutor for awards collection
5. **Period-Scoped Storage**: Saves results with `awards_{FY_Q}.csv` naming

### Output File Management
**Always Created**: `awards_{FY_Q}.csv` files
- **Consistent Output**: Every input file produces corresponding awards file (even if empty)
- **Empty Handling**: Creates empty CSV with proper headers when no awards found

**Conditionally Created**: `failures_{FY_Q}.csv` files
- **Only When Needed**: Created only if API failures occur during collection
- **Stale Cleanup**: Deletes existing failure files if no new failures occur
- **Period Isolation**: Each fiscal year/quarter has independent failure tracking

### Benefits
- ✅ **Complete Coverage**: Processes entire federal accounts dataset
- ✅ **Period Organization**: Separate files enable targeted analysis and retries
- ✅ **Scalable Design**: Handles large datasets with parallel processing

In [10]:
# ✅ Retry processing from failures folder (mirrors recipients retry)
def run_failure_retry_from_folder_awards(failure_folder, output_folder, max_workers=50):
    """
    Retries all failures_*.csv in a folder for AWARDS.
    - Deletes & skips empty/corrupt failure files BEFORE reading
    - Appends new successful results to awards_{FY_Q}.csv (via save_award_results)
    - Overwrites failures_{FY_Q}.csv with new failures
    - If new failures are empty, deletes failures_{FY_Q}.csv
    """
    for file in os.listdir(failure_folder):
        if not (file.endswith(".csv") and file.startswith("failures_")):
            continue

        file_path = os.path.join(failure_folder, file)
        print(f"🔁 [AWARDS] Retrying failures from: {file_path}")

        # ⛔ Delete 0-byte files up front
        if os.path.getsize(file_path) == 0:
            os.remove(file_path)
            print(f"🗑️ [AWARDS] Deleted empty failure file: {file_path}")
            continue

        # Try reading safely
        try:
            df = pd.read_csv(file_path)
        except pd.errors.EmptyDataError:
            os.remove(file_path)
            print(f"🗑️ [AWARDS] Deleted corrupt failure file (EmptyDataError): {file_path}")
            continue
        except Exception as e:
            print(f"⚠️ [AWARDS] Skipped (read error: {e}): {file_path}")
            continue

        # Delete files that load but have no usable rows/columns
        if df.empty or df.columns.size == 0:
            os.remove(file_path)
            print(f"🗑️ [AWARDS] Deleted invalid failure file (no rows/cols): {file_path}")
            continue

        # ✅ Retry valid failures
        results_df, failures_df = fetch_all_awards(df, max_workers=max_workers)
        save_award_results(results_df, failures_df, file_path, output_folder)

        # If the fresh failures are empty, remove the just-written failures file
        fyq = file.replace("failures_", "").replace(".csv", "")
        failures_out_path = os.path.join(output_folder, f"failures_{fyq}.csv")
        if failures_df is None or failures_df.empty:
            if os.path.exists(failures_out_path):
                os.remove(failures_out_path)
                print(f"🎉 [AWARDS] No remaining failures → deleted: {failures_out_path}")




## 🔄 Retry Folder Runner - Failure Recovery System

**Targeted retry mechanism** for failed awards collection with **defensive file cleanup**:

### Retry Processing Strategy
**File Discovery**: Scans folder for `failures_*.csv` files from previous runs
- **Selective Processing**: Only retries federal accounts that failed initially
- **Period-Scoped**: Each fiscal year/quarter has separate failure file
- **Tight Recovery Loop**: Focused re-processing without full period reload

### Defensive File Management
**Pre-Processing Cleanup** (prevents stuck pipelines):
1. **Zero-Byte Detection**: Automatically deletes empty failure files before processing
2. **EmptyDataError Protection**: Safely handles corrupted CSV files with automatic cleanup
3. **Invalid File Removal**: Deletes files with no usable rows/columns
4. **Read Error Handling**: Gracefully skips unreadable files with detailed logging

### Retry Execution Flow
1. **File Validation**: Checks file size, readability, and data integrity
2. **Failed Record Loading**: Reads federal account records that previously failed
3. **Parallel Retry**: Uses ThreadPoolExecutor to re-attempt awards API calls
4. **Incremental Append**: Uses `save_award_results()` for idempotent merging with existing data
5. **Success Integration**: Merges retry successes with existing awards files

### Post-Retry Cleanup
**Smart Failure File Management:**
- **New Failures**: Overwrites failure file if retry attempts still fail
- **Complete Success**: Deletes failure file if all retries succeed  
- **Clean State**: Ensures only current failures remain tracked

**Benefits**: Maximizes data completeness through intelligent recovery without reprocessing successful collections

In [11]:
# ✅ Single-file processing
def process_single_federal_account_file_awards(file_path, output_folder, max_workers=50):
    """
    Processes one federal_accounts_*.csv file and saves awards results and failures.
    Overwrites both result and failure CSVs.
    """
    print(f"📄 [AWARDS] Processing single file: {file_path}")
    df = read_and_filter_csv_awards(file_path)
    if df.empty:
        print("⚠️ [AWARDS] Skipped: No rows to process")
        return
    results_df, failures_df = fetch_all_awards(df, max_workers=max_workers)

    # Reuse saving logic
    save_award_results(results_df, failures_df, file_path, output_folder)




## 📄 Single-File Processor - Targeted Awards Collection

**Streamlined processing** for individual federal account files:

### Use Cases & Applications
**Development & Testing**: Perfect for isolated testing of awards collection pipeline
- **Individual File Focus**: Process specific fiscal year/quarter combinations
- **Debug & Analysis**: Test awards collection for particular periods
- **Selective Processing**: Handle specific federal account files without full folder processing

### Processing Flow
1. **File Loading**: Uses `read_and_filter_csv_awards()` for consistent CSV handling
2. **Data Validation**: Checks for empty DataFrame after loading (skips if no data)
3. **Parallel Collection**: Launches ThreadPoolExecutor for awards data collection
4. **Result Storage**: Uses `save_award_results()` for consistent storage and deduplication

### Output Behavior
**Reuses Standard Logic**: Leverages existing `save_award_results()` function
- **Idempotent Append**: Merges with existing awards files if present
- **Composite Key Deduplication**: Applies same deduplication logic as folder processing
- **Failure Tracking**: Creates/updates failure files as needed

### Integration Benefits
- ✅ **Consistent Output**: Produces identical format as folder processing functions
- ✅ **Reusable Logic**: Leverages same core functions for uniform behavior
- ✅ **Development Friendly**: Enables quick testing and validation workflows

In [12]:
# ✅ Unified controller
def main_controller(
    mode,
    input_folder=None,
    output_folder=None,
    single_file_path=None,
    max_workers=50
):
    """
    Unified entry point for AWARDS:
    - Run full initial load: mode='initial'
    - Retry from failures: mode='retry'
    - Run one specific file: mode='single'
    """
    assert mode in {"initial", "retry", "single"}, "❌ Invalid mode. Choose: 'initial', 'retry', or 'single'"

    if mode == "initial":
        if not input_folder or not output_folder:
            raise ValueError("📂 Please provide both input_folder and output_folder for initial mode (awards).")
        run_initial_federal_account_processing_awards(input_folder, output_folder, max_workers=max_workers)

    elif mode == "retry":
        if not input_folder or not output_folder:
            raise ValueError("📂 Please provide both input_folder and output_folder for retry mode (awards).")
        run_failure_retry_from_folder_awards(input_folder, output_folder, max_workers=max_workers)

    elif mode == "single":
        if not single_file_path or not output_folder:
            raise ValueError("📄 Please provide both single_file_path and output_folder for single mode (awards).")
        process_single_federal_account_file_awards(single_file_path, output_folder, max_workers=max_workers)

    print("✅ [AWARDS] Done.")


## 🎮 Unified Controller - Complete Awards Pipeline Orchestration

**Single entry point** for all awards collection operations with **three distinct processing modes**:

### Controller Architecture  
**Centralized Management**: Unified interface for all awards processing scenarios
- **Mode-Based Routing**: Intelligent dispatch based on operational requirements
- **Parameter Validation**: Ensures required parameters for each processing mode
- **Consistent Interface**: Standardized function signature across all operations

### Processing Modes

#### 1. **"initial"** Mode - Complete Awards Harvest
```python
main_controller("initial", 
    input_folder="/USASpendingResults", 
    output_folder="/awards")
```
- **Purpose**: Processes ALL `federal_accounts_*.csv` files in input folder
- **Scope**: Complete federal accounts → awards transformation
- **Output**: Comprehensive awards dataset organized by fiscal year/quarter

#### 2. **"retry"** Mode - Failure Recovery Processing  
```python
main_controller("retry", 
    input_folder="/awards", 
    output_folder="/awards")
```
- **Purpose**: Retries ALL `failures_*.csv` files in target folder
- **Strategy**: Targeted re-processing of previously failed federal accounts
- **Integration**: Incrementally appends retry successes to existing awards files

#### 3. **"single"** Mode - Individual File Processing
```python
main_controller("single", 
    single_file_path="/path/federal_accounts_FY2024_Q1.csv", 
    output_folder="/awards")
```
- **Purpose**: Process one specific federal account file  
- **Applications**: Development, testing, selective processing workflows

In [17]:
# Initial run (creates awards_FYxxxx_Qx.csv; failures only if any)
main_controller(
    mode="initial",
    input_folder="/content/drive/MyDrive/USASpendingResults",
    output_folder="/content/drive/MyDrive/USASpendingResults/awards",
    max_workers=20
)

🚀 [AWARDS] Starting initial load for: /content/drive/MyDrive/USASpendingResults/federal_accounts_FY2017_Q2.csv
✅ [AWARDS] Saved: 117025 → /content/drive/MyDrive/USASpendingResults/awards/awards_FY2017_Q2.csv
⚠️ [AWARDS] Failures: 326 → /content/drive/MyDrive/USASpendingResults/awards/failures_FY2017_Q2.csv
🚀 [AWARDS] Starting initial load for: /content/drive/MyDrive/USASpendingResults/federal_accounts_FY2017_Q3.csv
✅ [AWARDS] Saved: 52239 → /content/drive/MyDrive/USASpendingResults/awards/awards_FY2017_Q3.csv
⚠️ [AWARDS] Failures: 1495 → /content/drive/MyDrive/USASpendingResults/awards/failures_FY2017_Q3.csv
🚀 [AWARDS] Starting initial load for: /content/drive/MyDrive/USASpendingResults/federal_accounts_FY2017_Q4.csv
✅ [AWARDS] Saved: 0 → /content/drive/MyDrive/USASpendingResults/awards/awards_FY2017_Q4.csv
⚠️ [AWARDS] Failures: 1967 → /content/drive/MyDrive/USASpendingResults/awards/failures_FY2017_Q4.csv
🚀 [AWARDS] Starting initial load for: /content/drive/MyDrive/USASpendingResults/

## 🚀 Execution Example - Initial Awards Collection

**Primary awards data collection run** processing all federal account files:

### Configuration Details
- **Mode**: `"initial"` - Complete folder processing for awards collection
- **Input Folder**: `/content/drive/MyDrive/USASpendingResults`
  - Contains all `federal_accounts_*.csv` files from previous pipeline stages (Stage 4)
  - Source data for federal accounts → awards transformation
- **Output Folder**: `/content/drive/MyDrive/USASpendingResults/awards`
  - Will contain `awards_{FY_Q}.csv` and `failures_{FY_Q}.csv` files
- **Workers**: 20 concurrent threads (balanced for Colab environment)

### Expected Process Flow
1. **File Discovery**: Scans input folder for all `federal_accounts_*.csv` files
2. **Parallel Processing**: 20-thread pool processes each federal account file
3. **API Collection**: Calls USASpending.gov API with `type="award"` filters
4. **Period Organization**: Creates `awards_FYxxxx_Qx.csv` files per fiscal year/quarter
5. **Failure Tracking**: Logs API failures in `failures_FYxxxx_Qx.csv` (only if failures occur)

### Typical Results
- **Files Created**: Multiple `awards_FY{YYYY}_Q{Q}.csv` files with award data
- **Failure Files**: `failures_FY{YYYY}_Q{Q}.csv` files (only when needed)
- **Data Scale**: Award records aggregated by federal account slice per period

In [13]:

  # Retry from failures
main_controller(
    mode="retry",
    input_folder="/content/drive/MyDrive/USASpendingResults/awards",  # folder that contains failures_*.csv
    output_folder="/content/drive/MyDrive/USASpendingResults/awards",
    max_workers=20
)


🔁 [AWARDS] Retrying failures from: /content/drive/MyDrive/USASpendingResults/awards/failures_FY2022_Q3.csv
✅ Awards saved: 0 new rows → /content/drive/MyDrive/USASpendingResults/awards/awards_FY2022_Q3.csv (Duplicates skipped: 0)
⚠️ Failures overwritten: 2 → /content/drive/MyDrive/USASpendingResults/awards/failures_FY2022_Q3.csv
🔁 [AWARDS] Retrying failures from: /content/drive/MyDrive/USASpendingResults/awards/failures_FY2019_Q2.csv
✅ Awards saved: 0 new rows → /content/drive/MyDrive/USASpendingResults/awards/awards_FY2019_Q2.csv (Duplicates skipped: 0)
⚠️ Failures overwritten: 1 → /content/drive/MyDrive/USASpendingResults/awards/failures_FY2019_Q2.csv
🔁 [AWARDS] Retrying failures from: /content/drive/MyDrive/USASpendingResults/awards/failures_FY2019_Q4.csv
✅ Awards saved: 0 new rows → /content/drive/MyDrive/USASpendingResults/awards/awards_FY2019_Q4.csv (Duplicates skipped: 0)
⚠️ Failures overwritten: 1 → /content/drive/MyDrive/USASpendingResults/awards/failures_FY2019_Q4.csv
🔁 [AWAR

## 🔄 Execution Example - Awards Retry Processing

**Targeted retry processing** for previously failed awards collection attempts:

### Configuration Details
- **Mode**: `"retry"` - Failure recovery processing for awards
- **Input Folder**: `/content/drive/MyDrive/USASpendingResults/awards`
  - Same as output folder - scans for `failures_*.csv` files created during initial run
- **Output Folder**: `/content/drive/MyDrive/USASpendingResults/awards`
  - Updates existing awards files with retry successes
- **Workers**: 20 concurrent threads (conservative retry approach)

### Retry Process Flow
1. **Failure Discovery**: Scans awards folder for `failures_*.csv` files
2. **File Validation**: Automatically deletes empty/corrupted failure files
3. **Selective Retry**: Only processes federal accounts that failed initially
4. **Incremental Append**: Uses `save_award_results()` to merge successes with existing awards data
5. **Cleanup Logic**: Deletes failure files if no remaining failures after retry

### Expected Outcomes
- **Success Recovery**: Previously failed federal accounts may now succeed due to improved API conditions
- **Data Completeness**: Enhanced awards data coverage through failure recovery
- **Clean State**: Remaining failures represent legitimate data gaps or persistent issues
- **File Management**: Automatic cleanup of resolved failure files

**Best Practice**: Run retry after initial collection to maximize awards data completeness and identify persistent data gaps

In [None]:
` # Single file
main_controller(
    mode="single",
    single_file_path="/content/drive/MyDrive/USASpendingResults/federal_accounts_FY2017_Q4.csv",
    output_folder="/content/drive/MyDrive/USASpendingResults/awards",
    max_workers=20
)