#  Federal Funding Recipients Data Collection System

##  Overview
**Advanced recipient data collection** from USASpending.gov API using federal account foundation data.

This notebook implements a sophisticated **4-stage data pipeline**:
1. **Federal Accounts  Recipients**: Expands each federal account into individual recipient records
2. **Parallel Processing**: High-performance ThreadPoolExecutor with 50 concurrent workers
3. **Idempotent Storage**: Deduplication and incremental append capabilities
4. **Intelligent Retry System**: Period-scoped failure recovery with automatic cleanup

**Input**: `federal_accounts_*.csv` files (from previous pipeline stages)  
**Output**: `recipients_FY{YYYY}_Q{Q}.csv` + `failures_FY{YYYY}_Q{Q}.csv` files

**Key Features:**
-  **Type-specific API calls**: Uses `type="recipient"` filtering
-  **Composite key deduplication**: Prevents duplicate recipient records  
-  **Period-scoped organization**: Separate files per fiscal year/quarter
-  **Defensive programming**: Robust error handling and file corruption protection

In [None]:
#  Imports
import pandas as pd
import requests
import time
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from requests.adapters import HTTPAdapter, Retry
import logging
import urllib3

#  Suppress urllib3 warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)


##  Setup & Dependencies

**Core libraries** for recipient data collection system:
- **`pandas`**: DataFrame operations and CSV processing
- **`requests`**: HTTP API calls with session management
- **`ThreadPoolExecutor`**: Parallel processing (50 concurrent workers)
- **`HTTPAdapter/Retry`**: Custom session configuration
- **`logging/urllib3`**: Noise suppression for cleaner output

**Key Setup Actions:**
-  **Suppresses urllib3 warnings**: Eliminates noisy HTTP connection logs
-  **Configures logging levels**: Reduces verbose connection pool messages
-  **Imports threading utilities**: Enables high-performance parallel API calls

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##  Google Drive Integration

**Purpose**: Mount Google Drive for Colab environment data persistence

**Functionality:**
- **Colab Environment**: Mounts Google Drive at `/content/drive/MyDrive/`
- **Data Access**: Enables access to federal account CSV files
- **Result Storage**: Saves recipient data to Google Drive for persistence
- **Cross-Session**: Maintains data across Colab session restarts

In [None]:
#  Setup session with retry logic
def setup_session():
    """
    Creates and configures a session with retry logic for HTTP requests.
    Ensures resilience in case of server or network issues.
    """
    session = requests.Session()
    retries = Retry(
        total=0,
        backoff_factor=1.0,
        status_forcelist=[500, 502, 503, 504],
        allowed_methods=["POST"]
    )
    adapter = HTTPAdapter(max_retries=retries)
    session.mount('https://', adapter)
    return session


##  HTTP Session Configuration

**Creates robust HTTP session** with **deliberately disabled retries** (total=0):

### Session Strategy
**Important**: Despite the docstring mentioning "retry logic," this session is configured with `total=0` retries
- **No Automatic Retries**: Relies on application-level retry mechanisms instead
- **Manual Control**: Enables precise control over retry behavior at the application layer
- **Status Force List**: Prepared to handle 500, 502, 503, 504 errors (but won't retry automatically)

### Configuration Details
- **HTTPAdapter**: Configured for HTTPS requests
- **Backoff Factor**: 1.0 second delays (unused due to total=0)
- **Allowed Methods**: POST requests for API calls
- **Connection Pooling**: Reuses connections for better performance

**Design Choice**: Application handles retries explicitly rather than relying on automatic session-level retries

In [None]:
#  Fetch recipient data for a single row using quarters
def fetch_recipient(session, row):
    """
    Sends a POST request to the USAspending API to fetch recipient data
    for a given federal account record using fiscal year and quarter.
    Returns successful records and failure logs.
    """
    time.sleep(0.3)

    fy = str(row['fy'])
    quarter = str(row['quarter'])
    function_code = str(row['budget_function_code']).zfill(3)
    subfunction_code = str(row['budget_subfunction_code']).zfill(3)
    federal_account_code = str(row['federal_account_code']).zfill(4)

    url = "https://api.usaspending.gov/api/v2/spending/"
    payload = {
        "type": "recipient",
        "filters": {
            "fy": fy,
            "quarter": quarter,
            "budget_function": function_code,
            "budget_subfunction": subfunction_code,
            "federal_account": federal_account_code
        }
    }

    all_records = []
    all_failures = []

    try:
        resp = session.post(url, json=payload)
        resp.raise_for_status()
        data = resp.json()
        results = data.get("results", [])

        for item in results:
            all_records.append({
                "fy": fy,
                "quarter": quarter,
                "budget_function_code": function_code,
                "budget_subfunction_code": subfunction_code,
                "federal_account_code": federal_account_code,
                "recipient_id": item.get("id"),
                "recipient_name": item.get("name"),
                "recipient_code": item.get("code"),
                "obligated_amount": item.get("amount"),
                "total_amount": item.get("total")
            })

    except Exception as e:
        all_failures.append({
            "fy": fy,
            "quarter": quarter,
            "budget_function_code": function_code,
            "budget_subfunction_code": subfunction_code,
            "federal_account_code": federal_account_code,
            "reason": str(e)
        })

    return all_records, all_failures


##  Core Recipient Fetcher - API Collection Engine

**Primary data collection function** for recipient data at the **function  subfunction  federal account slice**:

### Input Parameters
**Expects row with complete hierarchical context:**
- `fy`: Fiscal year for data collection
- `quarter`: Specific quarter (1-4) for temporal filtering  
- `budget_function_code`: 3-digit function code (zero-padded)
- `budget_subfunction_code`: 3-digit subfunction code (zero-padded)
- `federal_account_code`: 4-digit federal account code (zero-padded)

### API Request Structure
**Endpoint**: `/api/v2/spending/` with `type="recipient"` filtering
**Filters Applied:**
- **Temporal**: Fiscal year + quarter combination
- **Hierarchical**: Budget function + subfunction + federal account
- **Type-Specific**: Recipients only (not awards or other entities)

### Response Processing
**Builds structured records with composite keys:**
- **Primary Key**: `(fy, quarter, budget_function_code, budget_subfunction_code, federal_account_code, recipient_id)`
- **Recipient Data**: Name, code, ID from API response
- **Financial Data**: Obligated amount + optional total amount
- **Error Handling**: Graceful failure capture with detailed reason logging

**Rate Limiting**: 0.3 second delay between requests to respect API limits

In [None]:
#  Step 1: Read and clean data

def read_and_filter_csv(file_path):
    """
    Reads a federal accounts CSV and filters out rows with zero obligated amounts.
    Returns a filtered DataFrame.
    """
    df = pd.read_csv(file_path)
    #if "obligated_amount" in df.columns:
        #df = df[df["obligated_amount"] > 0]
    return df

##  Input File Processing - Federal Accounts Reader

**Defensive CSV reading** for federal account input files:

### Input Source
**File Pattern**: `federal_accounts_*.csv` files from previous pipeline stages
- **Source**: Generated by federal account collection system
- **Content**: Federal account records with fiscal year, quarter, and hierarchical codes
- **Format**: Structured CSV with consistent column naming

### Data Processing Strategy
**Current Implementation**: Reads all federal account records without filtering
```python
# Optional zero-amount filtering (currently commented out):
# if "obligated_amount" in df.columns:
#     df = df[df["obligated_amount"] > 0]
```

### Design Rationale
**Inclusive Approach**: Processes all federal accounts regardless of obligation amounts
- **Complete Coverage**: Ensures no recipient data is missed due to zero-obligation federal accounts
- **Downstream Filtering**: Allows recipient-level filtering instead of account-level pre-filtering
- **Data Integrity**: Maintains complete federal account context for API calls

**Output**: Clean DataFrame ready for parallel recipient data collection

In [None]:
#  Step 2: Fetch data from API using ThreadPoolExecutor

def fetch_all_recipients(df, max_workers=50):
    """
    Submits all API calls in parallel using a thread pool and returns combined results.
    """
    session = setup_session()
    results = []
    failures = []

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(fetch_recipient, session, row) for _, row in df.iterrows()]
        for future in as_completed(futures):
            res, fail = future.result()
            results.extend(res)
            failures.extend(fail)

    return pd.DataFrame(results), pd.DataFrame(failures)


##  High-Performance Parallel Processing Engine

**Massive parallelization** of recipient data collection with **50 concurrent workers**:

### Parallel Processing Strategy
**ThreadPoolExecutor Configuration:**
- **Max Workers**: 50 concurrent threads (aggressive parallelization)
- **Task Distribution**: One API call per federal account row
- **Session Sharing**: Single HTTP session across all threads for connection pooling
- **Result Aggregation**: Combines all thread results into unified DataFrames

### Execution Flow
1. **Session Creation**: Single `setup_session()` call for all threads
2. **Task Submission**: Each DataFrame row becomes a separate thread task
3. **Concurrent Execution**: Up to 50 simultaneous API calls to USASpending.gov
4. **Result Collection**: `as_completed()` aggregates results as threads finish
5. **Data Separation**: Successful records and failures collected separately

### Performance Benefits
-  **High Throughput**: 50x speed improvement over sequential processing
-  **Connection Reuse**: Session pooling reduces connection overhead
-  **Resource Efficiency**: Threads share memory and CPU resources effectively
-  **Scalable Design**: Handles large federal account datasets efficiently

**Output**: Two DataFrames - successful recipient records and failure logs

In [None]:
def save_recipient_results(results_df, failures_df, file_path, output_base_folder):
    """
    Saves recipient results and failure logs.

    - Appends only unique recipient records (based on key columns).
    - Handles empty/corrupted existing result files safely.
    - Reports how many new records added vs duplicates skipped.
    - Overwrites the failures file each time.
    """
    import os
    import pandas as pd

    base_filename = os.path.basename(file_path).replace(".csv", "")
    for prefix in ["federal_accounts_", "failures_"]:
        if base_filename.startswith(prefix):
            base_filename = base_filename.replace(prefix, "")
            break

    year_quarter = base_filename
    os.makedirs(output_base_folder, exist_ok=True)
    results_path = os.path.join(output_base_folder, f"recipients_{year_quarter}.csv")
    failures_path = os.path.join(output_base_folder, f"failures_{year_quarter}.csv")

    unique_keys = [
        "fy", "quarter", "budget_function_code",
        "budget_subfunction_code", "federal_account_code", "recipient_id"
    ]

    # Deduplicate incoming new results
    results_df.drop_duplicates(subset=unique_keys, inplace=True)

    #  Try to read existing results safely
    existing_results = None
    if os.path.exists(results_path):
        try:
            existing_results = pd.read_csv(results_path)
            if existing_results.empty or existing_results.columns.size == 0:
                print(f" Ignoring empty/corrupted results file: {results_path}")
                existing_results = None
        except pd.errors.EmptyDataError:
            print(f" Ignoring EmptyDataError on: {results_path}")
            existing_results = None
        except Exception as e:
            print(f" Skipped reading existing results ({e}): {results_path}")
            existing_results = None

    #  Merge results
    if existing_results is not None:
        before_count = len(existing_results)
        combined = pd.concat([existing_results, results_df], ignore_index=True)
        combined.drop_duplicates(subset=unique_keys, inplace=True)
        added_count = len(combined) - before_count
        duplicate_count = len(results_df) - added_count
    else:
        combined = results_df.copy()
        added_count = len(combined)
        duplicate_count = 0

    combined.to_csv(results_path, index=False)
    print(f" Results saved: {added_count} new rows  {results_path} (Duplicates skipped: {duplicate_count})")

    # Always overwrite failures
    failures_df.to_csv(failures_path, index=False)
    print(f" Failures overwritten: {len(failures_df)}  {failures_path}")


##  Idempotent Storage System - Smart Deduplication

**Advanced save logic** with **idempotent append capabilities** and **composite key deduplication**:

### Core Storage Strategy
**Purpose**: Safely merge new recipient data with existing results while preventing duplicates

### Deduplication Logic
**Composite Key System**: 
- **Unique Keys**: `(fy, quarter, budget_function_code, budget_subfunction_code, federal_account_code, recipient_id)`
- **Conflict Resolution**: New records replace existing records with same composite key
- **Data Integrity**: Ensures each recipient appears only once per federal account context

### File Management Process
1. **Filename Derivation**: Extracts FY_Q pattern from input filename  
2. **Existing File Handling**: Safely reads existing `recipients_{FY_Q}.csv` files
3. **Corruption Protection**: Handles empty/corrupted existing files gracefully
4. **Smart Merge**: Combines new and existing data with deduplication
5. **Atomic Write**: Saves updated results to prevent partial file corruption

### Failure File Strategy
**Always Overwrites**: `failures_{FY_Q}.csv` files are completely replaced each run
- **Current State**: Only shows failures from the most recent execution
- **Historical Context**: Previous failure data is not preserved
- **Clean Slate**: Each execution starts with fresh failure tracking

In [None]:
def run_initial_federal_account_processing(input_folder, output_folder, max_workers=50, start_fy=None):
    """
    Processes all federal_accounts_*.csv files.
    - Always writes recipients_{FY_Q}.csv (even if empty)
    - Writes failures_{FY_Q}.csv ONLY if there are failures (deletes old one if it exists)
    - Optional: start_fy to filter files by fiscal year
    """
    for file in os.listdir(input_folder):
        if not (file.endswith(".csv") and file.startswith("federal_accounts")):
            continue

        # Optional FY filter from filename: federal_accounts_FY2024_Q1.csv
        if start_fy is not None:
            try:
                fy = int(file.split("_FY")[1].split("_Q")[0])
                if fy < start_fy:
                    continue
            except Exception:
                print(f" Skipped (cannot parse FY): {file}")
                continue

        file_path = os.path.join(input_folder, file)
        print(f" Starting initial load for: {file_path}")

        df = read_and_filter_csv(file_path)
        if df.empty:
            print(f" Skipped (no data): {file_path}")
            continue

        results_df, failures_df = fetch_all_recipients(df, max_workers=max_workers)

        base_name = os.path.basename(file_path).replace(".csv", "")
        year_quarter = base_name.replace("federal_accounts_", "")
        results_path = os.path.join(output_folder, f"recipients_{year_quarter}.csv")
        failures_path = os.path.join(output_folder, f"failures_{year_quarter}.csv")

        os.makedirs(output_folder, exist_ok=True)

        #  Always save results
        results_df.to_csv(results_path, index=False)
        print(f" Saved: {len(results_df)}  {results_path}")

        #  Only save failures if any; delete stale failures if none
        if failures_df is not None and not failures_df.empty:
            failures_df.to_csv(failures_path, index=False)
            print(f" Failures: {len(failures_df)}  {failures_path}")
        else:
            if os.path.exists(failures_path):
                os.remove(failures_path)
                print(f" Removed stale failures file: {failures_path}")
            print(f" No failures for {year_quarter}")


##  Initial Run - Complete Folder Processing

**Primary data collection pipeline** that processes **all federal_accounts_*.csv files** in a folder:

### Folder Processing Strategy
**File Discovery Pattern**: Scans input folder for `federal_accounts*.csv` files
- **Automatic Detection**: Finds all federal account files regardless of naming variations
- **FY Filtering**: Optional `start_fy` parameter to skip older fiscal years
- **Comprehensive Coverage**: Processes every discovered federal account file

### Per-File Processing Flow
1. **File Validation**: Confirms CSV format and federal_accounts prefix
2. **FY Extraction**: Parses fiscal year from filename for optional filtering
3. **Data Loading**: Reads federal account CSV using defensive reading
4. **Parallel Collection**: Launches 50-worker ThreadPoolExecutor for recipient collection
5. **Results Storage**: Saves recipients and failures with period-scoped naming

### Output File Management
**Always Created**: `recipients_{FY_Q}.csv` files (even if empty)
- **Consistent Output**: Ensures every input file produces a corresponding recipient file
- **Empty Handling**: Creates empty CSV with proper headers if no recipients found

**Conditionally Created**: `failures_{FY_Q}.csv` files
- **Only When Needed**: Created only if failures occur during collection
- **Cleanup Logic**: Deletes existing failure files if no new failures occur
- **Period Isolation**: Each fiscal year/quarter has separate failure tracking

In [None]:
def run_failure_retry_from_folder(failure_folder, output_folder, max_workers=50):
    """
    Retries all failures_*.csv in a folder.
    - Deletes & skips empty/corrupt failure files BEFORE reading
    - Appends new successful results to recipients_{FY_Q}.csv (via save_recipient_results)
    - Overwrites failures_{FY_Q}.csv with new failures
    - If new failures are empty, deletes failures_{FY_Q}.csv
    """
    for file in os.listdir(failure_folder):
        if not (file.endswith(".csv") and file.startswith("failures_")):
            continue

        file_path = os.path.join(failure_folder, file)
        print(f" Retrying failures from: {file_path}")

        #  Delete 0-byte files up front
        if os.path.getsize(file_path) == 0:
            os.remove(file_path)
            print(f" Deleted empty failure file: {file_path}")
            continue

        # Try reading safely
        try:
            df = pd.read_csv(file_path)
        except pd.errors.EmptyDataError:
            os.remove(file_path)
            print(f" Deleted corrupt failure file (EmptyDataError): {file_path}")
            continue
        except Exception as e:
            print(f" Skipped (read error: {e}): {file_path}")
            continue

        # Delete files that load but have no usable rows/columns
        if df.empty or df.columns.size == 0:
            os.remove(file_path)
            print(f" Deleted invalid failure file (no rows/cols): {file_path}")
            continue

        #  Retry valid failures
        results_df, failures_df = fetch_all_recipients(df, max_workers=max_workers)
        save_recipient_results(results_df, failures_df, file_path, output_folder)

        # If the fresh failures are empty, remove the just-written failures file
        fyq = file.replace("failures_", "").replace(".csv", "")
        failures_out_path = os.path.join(output_folder, f"failures_{fyq}.csv")
        if failures_df is None or failures_df.empty:
            if os.path.exists(failures_out_path):
                os.remove(failures_out_path)
                print(f" No remaining failures  deleted: {failures_out_path}")


##  Intelligent Retry System - Failure Recovery

**Targeted retry mechanism** for failed recipient collection attempts with **defensive file handling**:

### Retry Processing Strategy
**File Discovery**: Scans folder for `failures_*.csv` files from previous runs
- **Period-Scoped**: Each fiscal year/quarter has separate failure file
- **Selective Processing**: Only retries previously failed federal account records
- **Incremental Recovery**: Appends successful retries to existing recipient files

### Defensive File Management
**Pre-Processing Cleanup:**
1. **Zero-Byte Detection**: Automatically deletes empty failure files before processing
2. **Corruption Handling**: Safely handles corrupted CSV files with try/catch logic
3. **File Validation**: Confirms readable CSV format before retry attempts
4. **Pipeline Protection**: Prevents stuck pipelines from corrupt failure files

### Retry Execution Flow
1. **Failure File Validation**: Checks file size and readability
2. **Failed Record Loading**: Reads federal account records that previously failed
3. **Parallel Retry**: Uses ThreadPoolExecutor to re-attempt API calls
4. **Incremental Append**: Uses `save_recipient_results()` for idempotent merging
5. **Success Integration**: Merges retry successes with existing recipient data

### Post-Retry Cleanup
**Failure File Management:**
- **New Failures**: Overwrites failure file if retry attempts still fail
- **Complete Success**: Deletes failure file if all retries succeed
- **Clean State**: Ensures only current failures are tracked

In [None]:
def process_single_federal_account_file(file_path, output_folder, max_workers=50):
    """
    Processes one federal_accounts_*.csv file and saves results and failures.
    Overwrites both result and failure CSVs.
    """
    print(f" Processing single file: {file_path}")
    df = read_and_filter_csv(file_path)
    if df.empty:
        print(" Skipped: No rows with obligated_amount > 0")
        return
    results_df, failures_df = fetch_all_recipients(df, max_workers=max_workers)

    # Reuse existing saving logic
    save_recipient_results(results_df, failures_df, file_path, output_folder)


##  Single-File Processing Helper

**Streamlined processing** for individual federal account files:

### Purpose & Use Cases
**Single File Focus**: Processes one `federal_accounts_*.csv` file at a time
- **Development/Testing**: Perfect for testing pipeline on individual files
- **Selective Processing**: Process specific fiscal year/quarter combinations
- **Debug/Analysis**: Isolate processing for troubleshooting specific periods

### Processing Flow
1. **File Loading**: Uses `read_and_filter_csv()` for defensive CSV reading
2. **Data Validation**: Checks for empty DataFrame after loading
3. **Parallel Collection**: Launches ThreadPoolExecutor for recipient data collection
4. **Result Storage**: Uses `save_recipient_results()` for consistent output formatting

### Output Behavior
**Reuses Standard Logic**: Leverages existing `save_recipient_results()` function
- **Idempotent Append**: Merges with existing recipient files if present
- **Deduplication**: Applies composite key deduplication logic
- **Failure Tracking**: Creates/updates failure files as needed

**Consistency**: Produces identical output format as folder processing functions

In [None]:
def main_controller(
    mode,
    input_folder=None,
    output_folder=None,
    single_file_path=None,
    max_workers=50
):
    """
    Unified entry point to:
    - Run full initial load: mode='initial'
    - Retry from failures: mode='retry'
    - Run one specific file: mode='single'

    Args:
        mode (str): 'initial', 'retry', or 'single'
        input_folder (str): Path to folder with input federal_accounts_*.csv files
        output_folder (str): Path where recipient_* and failures_* files are stored
        single_file_path (str): Path to one federal_accounts_*.csv file for single mode
        max_workers (int): Thread pool size
    """
    assert mode in {"initial", "retry", "single"}, " Invalid mode. Choose: 'initial', 'retry', or 'single'"

    if mode == "initial":
        if not input_folder or not output_folder:
            raise ValueError(" Please provide both input_folder and output_folder for initial mode.")
        run_initial_federal_account_processing(input_folder, output_folder, max_workers=max_workers)

    elif mode == "retry":
        if not input_folder or not output_folder:
            raise ValueError(" Please provide both input_folder and output_folder for retry mode.")
        run_failure_retry_from_folder(input_folder, output_folder, max_workers=max_workers)

    elif mode == "single":
        if not single_file_path or not output_folder:
            raise ValueError(" Please provide both single_file_path and output_folder for single mode.")
        process_single_federal_account_file(single_file_path, output_folder, max_workers=max_workers)

    print(" Done.")


##  Master Controller - Unified Entry Point

**Comprehensive orchestration system** with **three distinct processing modes**:

### Controller Architecture
**Single Entry Point**: Unified interface for all recipient collection operations
- **Mode-Based Routing**: Intelligent dispatch based on processing requirements
- **Parameter Validation**: Ensures required parameters for each mode
- **Consistent Interface**: Standardized function signature across all modes

### Processing Modes

#### 1. **"initial"** Mode - Complete Data Harvest
```python
main_controller("initial", 
    input_folder="/USASpendingResults", 
    output_folder="/recipients")
```
- **Purpose**: Processes ALL `federal_accounts_*.csv` files in source folder
- **Input**: Federal account files from previous pipeline stages
- **Output**: Complete recipient dataset with period-scoped organization

#### 2. **"retry"** Mode - Failure Recovery
```python
main_controller("retry", 
    input_folder="/recipients", 
    output_folder="/recipients")
```
- **Purpose**: Retries ALL `failures_*.csv` files in target folder
- **Strategy**: Selective re-processing of previously failed federal accounts
- **Integration**: Incrementally appends successful retries to existing recipient files

#### 3. **"single"** Mode - Individual File Processing
```python
main_controller("single", 
    single_file_path="/path/to/federal_accounts_FY2024_Q1.csv", 
    output_folder="/recipients")
```
- **Purpose**: Process one specific federal account file
- **Use Cases**: Testing, debugging, selective processing

In [None]:
main_controller(
    mode="initial",
    input_folder="/content/drive/MyDrive/USASpendingResults",
    output_folder="/content/drive/MyDrive/USASpendingResults/recipients",
    max_workers=20
)

##  Execution Example - Initial Data Collection

**Primary data collection run** processing all federal account files:

### Configuration Details
- **Mode**: `"initial"` - Complete folder processing
- **Input Folder**: `/content/drive/MyDrive/USASpendingResults`
  - Contains all `federal_accounts_*.csv` files from previous pipeline stages
- **Output Folder**: `/content/drive/MyDrive/USASpendingResults/recipients`
  - Will contain `recipients_{FY_Q}.csv` and `failures_{FY_Q}.csv` files
- **Workers**: 20 concurrent threads (balanced performance for Colab)

### Expected Process Flow
1. **File Discovery**: Scans input folder for all `federal_accounts_*.csv` files
2. **Parallel Processing**: 20-thread pool processes each federal account file
3. **API Collection**: Calls USASpending.gov API with `type="recipient"` filters
4. **Data Organization**: Creates period-scoped recipient files
5. **Failure Tracking**: Logs any API failures for retry processing

### Typical Results
- **Files Created**: Multiple `recipients_FY{YYYY}_Q{Q}.csv` files
- **Failure Files**: `failures_FY{YYYY}_Q{Q}.csv` files (if failures occur)
- **Data Scale**: Thousands of recipient records per federal account file

In [None]:
main_controller(
    mode="retry",
    input_folder="/content/drive/MyDrive/USASpendingResults/recipients",
    output_folder="/content/drive/MyDrive/USASpendingResults/recipients",
    max_workers=20
)


 Retrying failures from: /content/drive/MyDrive/USASpendingResults/recipients/failures_FY2019_Q1.csv
 Results saved: 0 new rows  /content/drive/MyDrive/USASpendingResults/recipients/recipients_FY2019_Q1.csv (Duplicates skipped: 0)
 Failures overwritten: 1  /content/drive/MyDrive/USASpendingResults/recipients/failures_FY2019_Q1.csv
 Retrying failures from: /content/drive/MyDrive/USASpendingResults/recipients/failures_FY2019_Q2.csv
 Results saved: 0 new rows  /content/drive/MyDrive/USASpendingResults/recipients/recipients_FY2019_Q2.csv (Duplicates skipped: 0)
 Failures overwritten: 1  /content/drive/MyDrive/USASpendingResults/recipients/failures_FY2019_Q2.csv
 Retrying failures from: /content/drive/MyDrive/USASpendingResults/recipients/failures_FY2019_Q3.csv
 Results saved: 0 new rows  /content/drive/MyDrive/USASpendingResults/recipients/recipients_FY2019_Q3.csv (Duplicates skipped: 0)
 Failures overwritten: 1  /content/drive/MyDrive/USASpendingResults/recipients/failures_FY2019_Q3.csv
 

##  Execution Example - Retry Failed Collections

**Targeted retry processing** for previously failed recipient collection attempts:

### Configuration Details  
- **Mode**: `"retry"` - Failure recovery processing
- **Input Folder**: `/content/drive/MyDrive/USASpendingResults/recipients`
  - Same as output folder - looks for `failures_*.csv` files created during initial run
- **Output Folder**: `/content/drive/MyDrive/USASpendingResults/recipients`
  - Updates existing recipient files with retry successes
- **Workers**: 20 concurrent threads (conservative retry approach)

### Retry Process Flow
1. **Failure Discovery**: Scans recipient folder for `failures_*.csv` files
2. **File Validation**: Deletes empty/corrupted failure files automatically
3. **Selective Retry**: Only processes federal accounts that failed initially  
4. **Incremental Append**: Merges successful retries with existing recipient data
5. **Cleanup Logic**: Deletes failure files if no remaining failures

### Expected Outcomes
- **Success Recovery**: Previously failed federal accounts may now succeed
- **Data Completeness**: Improved recipient data coverage
- **Clean State**: Remaining failures represent legitimate data gaps
- **File Management**: Automatic cleanup of resolved failure files

**Best Practice**: Run retry after initial collection to maximize data completeness