# Budget Function Amounts


In [None]:
import os
import time
import pandas as pd
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from requests.adapters import HTTPAdapter, Retry
import logging, urllib3

# Quiet noisy warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
logging.getLogger("urllib3.connectionpool").setLevel(logging.ERROR)

##  Environment Setup & Library Imports

This section imports all necessary Python libraries for:
- **Data handling**: `pandas` for DataFrames, `numpy` for numerical operations
- **HTTP requests**: `requests` for API calls with retry mechanisms
- **Concurrent processing**: `ThreadPoolExecutor` for parallel API calls
- **System operations**: `os` and `time` for file operations and delays
- **Logging control**: Disables noisy warnings from urllib3 for cleaner output

##  Google Drive Integration

This section handles the connection to Google Drive for data storage:
- **Google Colab**: Uses `google.colab.drive.mount()` to access Google Drive
- **File Storage**: Creates the directory structure `/content/drive/MyDrive/Federal Funding/`
- **Access Point**: Establishes the connection needed for saving collected data to your Google Drive

*Note: This cell is designed for Google Colab environment. In VS Code, you'd need Google Drive API setup.*

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# -------- Session (retries disabled as per your preference) --------
def setup_session():
    s = requests.Session()
    retries = Retry(
        total=0,
        backoff_factor=1.0,
        status_forcelist=[500, 502, 503, 504],
        allowed_methods=["POST"],
    )
    s.mount("https://", HTTPAdapter(max_retries=retries))
    return s

##  HTTP Session Configuration

Sets up the HTTP session for API communication with specific settings:
- **No Automatic Retries**: `total=0` disables automatic retries as per user preference
- **Gentle API Approach**: Manual handling of rate limiting (HTTP 429 errors)
- **Error Handling**: Configured for specific server error codes (500, 502, 503, 504)
- **POST Method**: Optimized for USASpending.gov API POST requests

##  Stage 1: Budget Functions Data Fetcher

This function collects budget function data for a single fiscal year and quarter:

**API Endpoint**: `https://api.usaspending.gov/api/v2/spending/`
**Request Type**: POST with JSON payload
**Data Filters**: Fiscal Year (FY) and Quarter (Q)

**Key Features**:
- **Gentle Pacing**: Built-in sleep delay to respect API rate limits
- **Data Standardization**: Ensures 3-digit budget function codes (e.g., '050' for National Defense)
- **Error Handling**: Returns both successful records and failure information
- **Timeout Protection**: 60-second timeout to prevent hanging requests

In [None]:
# -------- Fetch one FY/Q (type=budget_function) --------
def fetch_budget_functions_one_quarter(session, fy, quarter, sleep_sec=0.25):
    """
    Returns (records, failure) where:
      records = list of dict rows
      failure = dict with reason on error, else None
    """
    time.sleep(sleep_sec)  # gentle pacing
    url = "https://api.usaspending.gov/api/v2/spending/"
    payload = {"type": "budget_function", "filters": {"fy": str(fy), "quarter": str(quarter)}}

    try:
        r = session.post(url, json=payload, timeout=60)
        r.raise_for_status()
        data = r.json()
        rows = []
        for item in data.get("results", []):
            rows.append({
                "fy": int(fy),
                "quarter": int(quarter),
                "budget_function_code": str(item.get("code", "")).zfill(3),
                "budget_function_name": item.get("name"),
                "amount": item.get("amount", 0.0),
                # Keep an optional total if present; won't break anything if missing
                "total": item.get("total", None),
            })
        return rows, None
    except Exception as e:
        return [], {
            "fy": int(fy),
            "quarter": int(quarter),
            "reason": str(e),
        }

In [None]:
# -------- Orchestrator --------
def get_budget_functions_quarterly(start_fy, end_fy, max_workers=20):
    """
    Returns (df_all, df_failures).
    df_all columns: fy, quarter, budget_function_code, budget_function_name, amount, total
    """
    session = setup_session()
    tasks = [(fy, q) for fy in range(int(start_fy), int(end_fy) + 1) for q in (1, 2, 3, 4)]

    results = []
    failures = []

    with ThreadPoolExecutor(max_workers=max_workers) as ex:
        futs = {ex.submit(fetch_budget_functions_one_quarter, session, fy, q): (fy, q) for fy, q in tasks}
        for fut in as_completed(futs):
            fy, q = futs[fut]
            rows, fail = fut.result()
            if rows:
                results.extend(rows)
            if fail:
                failures.append(fail)

    df_all = pd.DataFrame(results)
    df_fail = pd.DataFrame(failures)
    return df_all, df_fail

##  Budget Functions Orchestrator

This is the main coordinator for collecting all budget functions data:

**Scope**: Collects data for ALL fiscal years and quarters in the specified range
**Parallelization**: Uses ThreadPoolExecutor for concurrent API calls (default: 20 workers)
**Task Generation**: Creates (FY, Quarter) combinations for complete coverage

**Data Coverage**:
- **Fiscal Years**: User-defined range (e.g., 2017-2024)
- **Quarters**: All 4 quarters per fiscal year
- **Total Tasks**: 8 years × 4 quarters = 32 API calls (example)

**Output**: Returns two DataFrames - successful results and any failures

##  Data Storage & Organization

This function handles saving and organizing the collected budget functions data:

**File Structure**:
- **Master File**: `budget_functions_quarterly_all.csv` - Combined data for all years
- **Per-Year Files**: `budget_functions_FY{YYYY}.csv` - Individual files for each fiscal year
- **Failure Log**: `failures_budget_functions_quarterly.csv` - Records any failed API calls

**Data Processing**:
- **Sorting**: Organizes data by fiscal year, quarter, and budget function code
- **Directory Creation**: Automatically creates output folders if they don't exist
- **Progress Reporting**: Shows row counts and file paths for verification

In [None]:
# -------- Save helpers --------
def save_quarterwise_outputs(df_all, df_fail, output_folder, also_split_by_fy=True):
    os.makedirs(output_folder, exist_ok=True)

    # Combined master CSV (all FYs & quarters)
    combined_path = os.path.join(output_folder, "budget_functions_quarterly_all.csv")
    # Sort for readability
    sort_cols = [c for c in ["fy", "quarter", "budget_function_code"] if c in df_all.columns]
    if not df_all.empty and sort_cols:
        df_all = df_all.sort_values(sort_cols)
    df_all.to_csv(combined_path, index=False)
    print(f" Saved {len(df_all):,} rows  {combined_path}")

    # Failures
    if df_fail is not None and not df_fail.empty:
        fail_path = os.path.join(output_folder, "failures_budget_functions_quarterly.csv")
        df_fail.to_csv(fail_path, index=False)
        print(f" Failures logged: {len(df_fail)}  {fail_path}")
    else:
        print(" No failures reported.")

    # Optional: one CSV per FY
    if also_split_by_fy and not df_all.empty:
        for fy, grp in df_all.groupby("fy"):
            out_path = os.path.join(output_folder, f"budget_functions_FY{int(fy)}.csv")
            grp.to_csv(out_path, index=False)
            print(f" FY {fy}: {len(grp):,} rows  {out_path}")


In [None]:
OUTPUT_DIR = "/content/drive/MyDrive/Federal Funding/Budget Functions"

START_FY = 2017
END_FY   = 2024

df_all, df_fail = get_budget_functions_quarterly(START_FY, END_FY, max_workers=20)
save_quarterwise_outputs(df_all, df_fail, OUTPUT_DIR, also_split_by_fy=True)


 Saved 621 rows  /content/drive/MyDrive/Federal Funding/Budget Functions/budget_functions_quarterly_all.csv
 No failures reported.
 FY 2017: 61 rows  /content/drive/MyDrive/Federal Funding/Budget Functions/budget_functions_FY2017.csv
 FY 2018: 80 rows  /content/drive/MyDrive/Federal Funding/Budget Functions/budget_functions_FY2018.csv
 FY 2019: 80 rows  /content/drive/MyDrive/Federal Funding/Budget Functions/budget_functions_FY2019.csv
 FY 2020: 80 rows  /content/drive/MyDrive/Federal Funding/Budget Functions/budget_functions_FY2020.csv
 FY 2021: 80 rows  /content/drive/MyDrive/Federal Funding/Budget Functions/budget_functions_FY2021.csv
 FY 2022: 80 rows  /content/drive/MyDrive/Federal Funding/Budget Functions/budget_functions_FY2022.csv
 FY 2023: 80 rows  /content/drive/MyDrive/Federal Funding/Budget Functions/budget_functions_FY2023.csv
 FY 2024: 80 rows  /content/drive/MyDrive/Federal Funding/Budget Functions/budget_functions_FY2024.csv


##  Stage 1 Execution: Budget Functions Collection

**Configuration**:
- **Output Directory**: `/content/drive/MyDrive/Federal Funding/Budget Functions`
- **Date Range**: FY 2017 - 2024 (8 fiscal years)
- **Worker Threads**: 20 concurrent API calls for faster processing

**Expected Results**:
- **Total API Calls**: ~32 requests (8 years × 4 quarters)
- **Processing Time**: ~5-15 minutes depending on API response times
- **Data Volume**: ~600-800 records (varies by actual budget function activity)

**This cell will execute the full budget functions data collection pipeline.**

In [None]:
import os, glob, re, pandas as pd

def _safe_read_csv(path):
    try:
        if not os.path.exists(path) or os.path.getsize(path) == 0:
            return None
        df = pd.read_csv(path)
        if df.empty or df.columns.size == 0:
            return None
        return df
    except Exception:
        return None

def debug_and_build_tasks(budget_functions_dir):
    """
    Loads your Budget Function files and builds unique (fy, quarter, budget_function_code, budget_function_name) tasks.
    Prints a small diagnostic so you can see why tasks may be empty.
    Cleans codes like '50', '050', '50.0', 'BF-050'  '050'. Skips true blanks.
    """
    # 1) load
    combined = os.path.join(budget_functions_dir, "budget_functions_quarterly_all.csv")
    df = _safe_read_csv(combined)
    source = combined
    if df is None:
        files = sorted(glob.glob(os.path.join(budget_functions_dir, "budget_functions_FY*.csv")))
        parts = [_safe_read_csv(f) for f in files]
        parts = [p for p in parts if p is not None]
        if not parts:
            print(f" No readable files found in: {budget_functions_dir}")
            return pd.DataFrame(), pd.DataFrame()
        df = pd.concat(parts, ignore_index=True)
        source = f"{len(parts)} FY files"

    print(f" Loaded from: {source}   rows: {len(df)}")
    needed = {"fy", "quarter", "budget_function_code"}
    missing = needed - set(df.columns)
    if missing:
        print(f" Missing required columns: {missing}")
        return pd.DataFrame(), pd.DataFrame()

    # 2) normalize FY/Q
    df = df.copy()
    df["fy"] = pd.to_numeric(df["fy"], errors="coerce").astype("Int64")
    df["quarter"] = pd.to_numeric(df["quarter"], errors="coerce").astype("Int64")

    # 3) clean codes & collect diagnostics
    diag = []
    total = len(df)

    # a) blank / null
    null_mask = df["budget_function_code"].isna()
    diag.append(("null_code", int(null_mask.sum())))

    # b) normalized digits (keep last 3 digits if available)
    def clean_code(v):
        if pd.isna(v): return None
        s = str(v).strip()
        if s.lower() in {"", "nan", "none"}: return None
        s = re.sub(r"\.0+$", "", s)          # '50.0' -> '50'
        digits = re.sub(r"\D", "", s)        # keep only digits
        if digits == "": return None
        # take last 3 digits (handles '0050' -> '050', '50' -> '50')
        digits = digits[-3:]
        return digits.zfill(3)

    df["bf_code_clean"] = df["budget_function_code"].apply(clean_code)
    cleaned_null = df["bf_code_clean"].isna()
    diag.append(("non_digits_or_empty_after_clean", int(cleaned_null.sum())))

    # rows to keep
    keep = (~null_mask) & (~cleaned_null) & df["fy"].notna() & df["quarter"].notna()
    kept = df.loc[keep].copy()
    kept["budget_function_code"] = kept["bf_code_clean"]
    if "budget_function_name" not in kept.columns:
        kept["budget_function_name"] = None

    # 4) build tasks
    task_df = (
        kept[["fy", "quarter", "budget_function_code", "budget_function_name"]]
        .dropna()
        .astype({"fy":"int", "quarter":"int"})
        .drop_duplicates()
        .sort_values(["fy","quarter","budget_function_code"])
        .reset_index(drop=True)
    )
    print(" Diagnostic:")
    print(f"  total_rows: {total}")
    for k, v in diag:
        print(f"  {k}: {v}")
    print(f"  kept_rows_after_clean: {len(kept)}")
    print(f"  unique_tasks: {len(task_df)}")
    if not task_df.empty:
        print(task_df.head(10))

    # Also return a small diagnostics dataframe you can inspect
    diag_df = pd.DataFrame(diag, columns=["reason","count"])
    return task_df, diag_df


##  Stage 2 Preparation: Task Building & Data Validation

This section prepares for Stage 2 (subfunctions collection) by analyzing Stage 1 results:

**Data Loading**: 
- Reads budget functions data from Google Drive or local FY files
- Performs data quality checks and diagnostics

**Code Cleaning**:
- Handles various budget function code formats ('50', '050', '50.0', 'BF-050')
- Standardizes all codes to 3-digit format (e.g., '050')
- Removes null/invalid entries with detailed reporting

**Task Generation**:
- Creates unique (FY, Quarter, Function Code, Function Name) combinations
- Each task represents one API call for subfunctions data
- Provides diagnostic information about data quality and task count

##  Task Execution Setup

**Configuration**:
- **Source Directory**: `/content/drive/MyDrive/Federal Funding/Budget Functions`
- **Task Building**: Converts Stage 1 results into actionable subfunctions collection tasks
- **Task Objects**: Creates structured task objects for the threadpool executor

**Diagnostic Output**:
- Shows total rows processed, null/invalid codes removed
- Displays sample tasks to verify data quality
- Reports final task count ready for Stage 2 execution

**Expected Output**: 500-600 tasks (depends on actual budget function coverage across years)

In [None]:
# Example: build tasks, then run your threadpool fetcher
BUDGET_FUNCTIONS_DIR = "/content/drive/MyDrive/Federal Funding/Budget Functions"

tasks_df, diag_df = debug_and_build_tasks(BUDGET_FUNCTIONS_DIR)

# If tasks_df is empty, you now know *why* from the printed diagnostic.
# If you still want to run, you can turn the tasks_df into the namedtuple the fetcher expected:

Task = lambda fy, quarter, budget_function_code, budget_function_name: type("Task", (), {
    "fy": fy, "quarter": quarter,
    "budget_function_code": budget_function_code,
    "budget_function_name": budget_function_name
})

tasks = [Task(int(r.fy), int(r.quarter), r.budget_function_code, r.budget_function_name)
         for r in tasks_df.itertuples(index=False)]

print(f" Will run {len(tasks)} tasks.")


 Loaded from: /content/drive/MyDrive/Federal Funding/Budget Functions/budget_functions_quarterly_all.csv   rows: 621
 Diagnostic:
  total_rows: 621
  null_code: 32
  non_digits_or_empty_after_clean: 32
  kept_rows_after_clean: 589
  unique_tasks: 589
     fy  quarter budget_function_code                    budget_function_name
0  2017        2                  000                   Governmental Receipts
1  2017        2                  050                        National Defense
2  2017        2                  150                   International Affairs
3  2017        2                  250  General Science, Space, and Technology
4  2017        2                  270                                  Energy
5  2017        2                  300       Natural Resources and Environment
6  2017        2                  350                             Agriculture
7  2017        2                  370             Commerce and Housing Credit
8  2017        2                  400          

In [None]:
import time, pandas as pd, requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from requests.adapters import HTTPAdapter, Retry

SPENDING_URL = "https://api.usaspending.gov/api/v2/spending/"

def fetch_subfunctions_spending_for_tasks(tasks_df, max_workers=24, pause=0.15, timeout=60):
    """
    tasks_df columns required:
      - fy (int), quarter (int), budget_function_code (str/num), budget_function_name (optional)
    Returns: results_df, failures_df
    """
    if tasks_df is None or tasks_df.empty:
        return pd.DataFrame(), pd.DataFrame()

    df = tasks_df.copy()

    # Normalize types + codes
    df["fy"] = pd.to_numeric(df["fy"], errors="coerce").astype("Int64")
    df["quarter"] = pd.to_numeric(df["quarter"], errors="coerce").astype("Int64")
    if "budget_function_name" not in df.columns:
        df["budget_function_name"] = None

    # Require non-null fy/q/code
    df = df[df["fy"].notna() & df["quarter"].notna() & df["budget_function_code"].notna()].copy()
    if df.empty:
        return pd.DataFrame(), pd.DataFrame()

    # Clean + pad function codes (handles '50', '050', '50.0', etc.)
    def _pad3(x):
        s = str(x).strip()
        s = s.split(".")[0] if s.endswith(".0") else s
        digits = "".join(ch for ch in s if ch.isdigit())
        return digits[-3:].zfill(3) if digits else None

    df["budget_function_code"] = df["budget_function_code"].apply(_pad3)
    df = df[df["budget_function_code"].notna()].copy()

    # Unique tasks
    df = (df[["fy","quarter","budget_function_code","budget_function_name"]]
            .drop_duplicates()
            .astype({"fy":"int","quarter":"int"}))

    # Session (no retries, as requested)
    s = requests.Session()
    s.mount("https://", HTTPAdapter(max_retries=Retry(
        total=0, backoff_factor=1.0, status_forcelist=[500,502,503,504], allowed_methods=["POST"]
    )))

    def _fetch_one(fy, q, func_code, func_name):
        time.sleep(pause)
        payload = {
            "type": "budget_subfunction",
            "filters": {"fy": str(fy), "quarter": str(q), "budget_function": str(func_code)}
        }
        try:
            r = s.post(SPENDING_URL, json=payload, timeout=timeout)
            if not r.ok:
                return [], {
                    "fy": fy, "quarter": q, "budget_function_code": func_code,
                    "status": r.status_code,
                    "reason": r.text[:500] if isinstance(r.text, str) else str(r.content)[:500]
                }
            items = (r.json().get("results", []) or [])
            rows = [{
                "fy": fy,
                "quarter": q,
                "budget_function_code": func_code,
                "budget_function_name": func_name,
                "budget_subfunction_code": str(it.get("code","")).zfill(3),
                "budget_subfunction_name": it.get("name"),
                "amount": it.get("amount", 0.0),
                "total": it.get("total")
            } for it in items]
            return rows, None
        except Exception as e:
            return [], {"fy": fy, "quarter": q, "budget_function_code": func_code, "reason": str(e)}

    results, failures = [], []
    with ThreadPoolExecutor(max_workers=max_workers) as ex:
        futs = [ex.submit(_fetch_one, r.fy, r.quarter, r.budget_function_code, r.budget_function_name)
                for r in df.itertuples(index=False)]
        for fut in as_completed(futs):
            rows, fail = fut.result()
            if rows: results.extend(rows)
            if fail: failures.append(fail)

    s.close()

    results_df = pd.DataFrame(results)
    failures_df = pd.DataFrame(failures)

    # Tidy/sort (no dedupe of content)
    if not results_df.empty:
        results_df["budget_function_code"] = results_df["budget_function_code"].astype(str).str.zfill(3)
        results_df["budget_subfunction_code"] = results_df["budget_subfunction_code"].astype(str).str.zfill(3)
        results_df.sort_values(
            ["fy","quarter","budget_function_code","budget_subfunction_code"],
            inplace=True, ignore_index=True
        )

    return results_df, failures_df


##  Stage 2: Budget Subfunctions Data Fetcher

This is the core function for collecting detailed budget subfunctions data:

**API Configuration**:
- **Endpoint**: Same USASpending.gov API but with `type="budget_subfunction"`
- **Filtering**: Each request is scoped by FY, Quarter, AND specific Budget Function
- **Workers**: 24 parallel threads (higher than Stage 1 due to smaller response sizes)

**Data Processing**:
- **Code Standardization**: Ensures 3-digit codes for both functions and subfunctions
- **Hierarchical Structure**: Maintains parent-child relationship (function  subfunction)
- **Quality Control**: Handles various code formats and validates data integrity

**Performance Features**:
- **Gentle Pacing**: 0.15-second pause between requests
- **Session Management**: Proper connection pooling and cleanup
- **Error Tracking**: Detailed failure logging with status codes and reasons

##  Stage 2 Execution: Subfunctions Collection

**This cell executes the complete subfunctions data collection:**

**Input**: Uses `tasks_df` generated from Stage 1 budget functions data
**Processing**: 
- ~500-600 API calls (one per task from Stage 1 results)
- 24 concurrent workers for optimal performance
- 0.15-second pause between requests for API courtesy

**Expected Results**:
- **Processing Time**: 10-20 minutes (depends on task count and API response)
- **Data Volume**: 2,000-3,000 subfunctions records
- **Coverage**: Detailed breakdown of each budget function into subfunctions

**Output Preview**: Shows first few records and summary statistics

In [None]:
# You already built tasks_df with debug_and_build_tasks(...)
# tasks_df has fy, quarter, budget_function_code, budget_function_name

results_df, failures_df = fetch_subfunctions_spending_for_tasks(tasks_df, max_workers=24, pause=0.15)
print(results_df.head())
print(f"results: {len(results_df)}  failures: {len(failures_df)}")


     fy  quarter budget_function_code   budget_function_name  \
0  2017        2                  000  Governmental Receipts   
1  2017        2                  050       National Defense   
2  2017        2                  050       National Defense   
3  2017        2                  050       National Defense   
4  2017        2                  150  International Affairs   

  budget_subfunction_code                            budget_subfunction_name  \
0                     000                              Governmental Receipts   
1                     051                     Department of Defense-Military   
2                     053                   Atomic energy defense activities   
3                     054                         Defense-related activities   
4                     151  International development and humanitarian ass...   

         amount total  
0  0.000000e+00  None  
1  3.893964e+11  None  
2  1.245135e+10  None  
3  9.173390e+10  None  
4  1.134137e+1

In [None]:
results_df.head()

Unnamed: 0,fy,quarter,budget_function_code,budget_function_name,budget_subfunction_code,budget_subfunction_name,amount,total
0,2017,2,0,Governmental Receipts,0,Governmental Receipts,0.0,
1,2017,2,50,National Defense,51,Department of Defense-Military,389396400000.0,
2,2017,2,50,National Defense,53,Atomic energy defense activities,12451350000.0,
3,2017,2,50,National Defense,54,Defense-related activities,91733900000.0,
4,2017,2,150,International Affairs,151,International development and humanitarian ass...,11341370000.0,


##  Results Preview & Verification

**Data Inspection**: This cell displays the first few rows of collected subfunctions data for verification.

**What to Look For**:
- **Hierarchical Structure**: Each row shows function  subfunction relationship
- **Code Formatting**: Both function and subfunction codes should be 3-digit format
- **Data Completeness**: Amount values and proper naming conventions
- **Time Coverage**: Records spanning multiple fiscal years and quarters

**Sample Output Structure**:
- `fy`, `quarter`: Time identifiers
- `budget_function_code`, `budget_function_name`: Parent category (e.g., '050', 'National Defense')
- `budget_subfunction_code`, `budget_subfunction_name`: Detailed breakdown
- `amount`: Dollar amount for that subfunction in that quarter

##  Stage 2 Data Storage & Final Organization

This function handles the complete storage of subfunctions data with advanced features:

**File Structure**:
- **Master File**: `budget_subfunctions_quarterly_all.csv` - Complete dataset
- **Per-Year Splits**: `budget_subfunctions_FY{YYYY}.csv` - Individual fiscal year files
- **Failure Tracking**: Documents any failed API calls for retry/analysis

**Data Quality Features**:
- **Deduplication**: Removes duplicate records based on (FY, Quarter, Function, Subfunction)
- **Code Standardization**: Ensures consistent 3-digit formatting
- **Sorting**: Organizes data hierarchically for easy analysis
- **Type Safety**: Proper data type conversion and validation

**Output Organization**: Creates clean, analysis-ready datasets in Google Drive

In [None]:
# ============================
# Save Budget Subfunctions results to Drive
# ============================

import os
import pandas as pd

def save_subfunction_results_to_drive(
    results_df: pd.DataFrame,
    failures_df: pd.DataFrame = None,
    out_dir: str = "/content/drive/MyDrive/Federal Funding/Budget Subfunctions",
    write_master: bool = True,
    split_by_fy: bool = True,
):
    """
    Saves:
      - Master file:  budget_subfunctions_quarterly_all.csv   (if write_master=True)
      - One file per FY: budget_subfunctions_FY{YYYY}.csv     (if split_by_fy=True)
      - Optional failures file: failures_budget_subfunctions_quarterly.csv (if failures_df provided)
    """
    os.makedirs(out_dir, exist_ok=True)

    if results_df is None or results_df.empty:
        print(" Nothing to save (results_df is empty).")
        return

    df = results_df.copy()

    # Normalize types and codes
    if "fy" in df.columns:
        df["fy"] = pd.to_numeric(df["fy"], errors="coerce").astype("Int64")
    if "quarter" in df.columns:
        df["quarter"] = pd.to_numeric(df["quarter"], errors="coerce").astype("Int64")
    if "budget_function_code" in df.columns:
        df["budget_function_code"] = df["budget_function_code"].astype(str).str.zfill(3)
    if "budget_subfunction_code" in df.columns:
        df["budget_subfunction_code"] = df["budget_subfunction_code"].astype(str).str.zfill(3)

    # Sort & (light) dedupe for clean outputs
    sort_cols = [c for c in ["fy","quarter","budget_function_code","budget_subfunction_code"] if c in df.columns]
    if sort_cols:
        df.sort_values(sort_cols, inplace=True, ignore_index=True)
    if {"fy","quarter","budget_function_code","budget_subfunction_code"}.issubset(df.columns):
        df.drop_duplicates(subset=["fy","quarter","budget_function_code","budget_subfunction_code"], inplace=True)

    # Master CSV
    if write_master:
        master_path = os.path.join(out_dir, "budget_subfunctions_quarterly_all.csv")
        df.to_csv(master_path, index=False)
        print(f" Saved {len(df):,} rows  {master_path}")

    # Per-FY CSVs
    if split_by_fy and "fy" in df.columns:
        for fy, grp in df.groupby("fy", dropna=True):
            out_path = os.path.join(out_dir, f"budget_subfunctions_FY{int(fy)}.csv")
            grp.to_csv(out_path, index=False)
            print(f" FY {int(fy)}: {len(grp):,} rows  {out_path}")

    # Failures (optional)
    if failures_df is not None and not failures_df.empty:
        fail_path = os.path.join(out_dir, "failures_budget_subfunctions_quarterly.csv")
        failures_df.to_csv(fail_path, index=False)
        print(f" Failures logged: {len(failures_df)}  {fail_path}")


In [None]:
# You already have: results_df, failures_df
save_subfunction_results_to_drive(
    results_df,
    failures_df,  # or None
    out_dir="/content/drive/MyDrive/Federal Funding/Budget Subfunctions",
    write_master=True,
    split_by_fy=True
)


 Saved 2,092 rows  /content/drive/MyDrive/Federal Funding/Budget Subfunctions/budget_subfunctions_quarterly_all.csv
 FY 2017: 201 rows  /content/drive/MyDrive/Federal Funding/Budget Subfunctions/budget_subfunctions_FY2017.csv
 FY 2018: 268 rows  /content/drive/MyDrive/Federal Funding/Budget Subfunctions/budget_subfunctions_FY2018.csv
 FY 2019: 268 rows  /content/drive/MyDrive/Federal Funding/Budget Subfunctions/budget_subfunctions_FY2019.csv
 FY 2020: 268 rows  /content/drive/MyDrive/Federal Funding/Budget Subfunctions/budget_subfunctions_FY2020.csv
 FY 2021: 271 rows  /content/drive/MyDrive/Federal Funding/Budget Subfunctions/budget_subfunctions_FY2021.csv
 FY 2022: 272 rows  /content/drive/MyDrive/Federal Funding/Budget Subfunctions/budget_subfunctions_FY2022.csv
 FY 2023: 272 rows  /content/drive/MyDrive/Federal Funding/Budget Subfunctions/budget_subfunctions_FY2023.csv
 FY 2024: 272 rows  /content/drive/MyDrive/Federal Funding/Budget Subfunctions/budget_subfunctions_FY2024.csv


##  Final Execution: Save Subfunctions Data

**This cell completes the data collection pipeline:**

**Storage Configuration**:
- **Directory**: `/content/drive/MyDrive/Federal Funding/Budget Subfunctions`
- **Master File**: Complete dataset with all fiscal years
- **FY Splits**: Individual files for focused analysis by year

**Final Output Summary**:
- **Total Records**: Count of collected subfunctions records
- **File Locations**: Paths to all created files in Google Drive
- **Data Quality**: Confirmation of deduplication and formatting
- **Processing Complete**: Ready for analysis and integration with other federal funding datasets

**Success Indicators**: File creation confirmations and row counts for verification