# Federal Funding Data Consolidation Pipeline

This notebook implements a comprehensive data consolidation pipeline for federal funding data, transforming quarterly CSV files into yearly consolidated datasets, and then further combining them into comprehensive ALL-years datasets.

## Workflow Overview

The pipeline consists of four main phases:

1. **Quarter  Year Combiner**: Consolidates quarterly CSVs into yearly files for each dataset type
2. **Year  ALL-years Combiner**: Merges yearly files into comprehensive multi-year datasets
3. **Special Federal Accounts with Agency Merge**: Handles specialized federal accounts with agency information
4. **Data Validation & Quality Checks**: Performs sanity checks on the consolidated outputs

## Key Features

- **Robust Error Handling**: Safely handles corrupt, empty, or unreadable files
- **Schema Preservation**: Maintains natural column ordering across different file versions
- **Duplicate Detection**: Automatically removes exact duplicate rows
- **Manifest Generation**: Creates detailed logs of which input files were used
- **Timestamped Outputs**: Generates timestamped run folders for tracking
- **Memory Efficient**: Processes large datasets without memory overflow

---

In [None]:
# ==== Combine quarterly CSVs into year-wise CSVs ====

## Phase 1: Quarter  Year Combiner (Per Dataset)

### Purpose
This phase takes multiple quarterly CSV files (e.g., `federal_accounts_FY2019_Q1.csv`, `federal_accounts_FY2019_Q2.csv`, etc.) and combines them into single yearly CSV files (e.g., `federal_accounts_FY2019.csv`).

### Supported Dataset Types
- **federal_accounts**: Federal account information by quarter
- **recipients**: Grant/contract recipient data by quarter  
- **awards**: Award/contract details by quarter

### Key Processing Steps
1. **Pattern Matching**: Uses regex to identify quarterly files with pattern `<stem>_FY<year>_Q<quarter>.csv`
2. **Safe Reading**: Handles corrupt, empty, or unreadable files gracefully
3. **Metadata Addition**: Adds `fy` and `quarter` columns if missing from source data
4. **Chronological Ordering**: Processes quarters in order (Q1, Q2, Q3, Q4) for each fiscal year
5. **Output Generation**: Creates consolidated yearly files with row count reporting

---

In [1]:
import os
import re
import pandas as pd
from datetime import datetime

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


### Google Colab Setup

Mount Google Drive to access data files stored in Google Drive. This step is required for accessing the quarterly data files and writing output files to the designated folders.

**Note**: This cell is specific to Google Colab environment and can be skipped if running locally.

### Utility Functions for Quarter-to-Year Processing

The following functions handle the core logic for identifying, parsing, and safely reading quarterly data files:

#### Key Functions:
- **`build_rx(stem)`**: Creates regex pattern to match quarterly files like `<stem>_FY<year>_Q<quarter>.csv`
- **`list_quarter_files(input_dir, stem)`**: Finds all quarterly files for a given dataset stem, excluding failure files
- **`parse_fy_q(stem, filename)`**: Extracts fiscal year and quarter from filename
- **`safe_read(path)`**: Safely reads CSV files, handling empty, corrupt, or unreadable files

In [3]:
def build_rx(stem):
    # Matches: <stem>_FY2019_Q1.csv (case-insensitive)
    # Note: double braces around 4  \d{{4}} to keep {4} in the regex
    return re.compile(rf'^{re.escape(stem)}_FY(\d{{4}})_Q([1-4])\.csv$', re.IGNORECASE)

In [4]:
def list_quarter_files(input_dir, stem):
    rx = build_rx(stem)
    files = []
    for f in os.listdir(input_dir):
        if not f.endswith(".csv"):
            continue
        if f.lower().startswith("failures_"):  # skip failure files
            continue
        if not f.lower().startswith(stem.lower() + "_"):
            continue
        if rx.match(f):
            files.append(f)
    return files

In [None]:
def parse_fy_q(stem, filename):
    rx = build_rx(stem)
    m = rx.match(filename)
    if not m:
        return None, None
    return int(m.group(1)), int(m.group(2))

In [None]:
def safe_read(path):
    try:
        if os.path.getsize(path) == 0:
            print(f" 0-byte file, skipped: {path}")
            return None
        df = pd.read_csv(path)
        if df.empty or df.columns.size == 0:
            print(f" Empty/corrupt file, skipped: {path}")
            return None
        return df
    except Exception as e:
        print(f" Read error ({e}), skipped: {path}")
        return None


In [None]:
def combine_quarters_to_years(input_dir, output_dir, stem):
    """
    input_dir: folder containing quarterly files like '<stem>_FY2019_Q1.csv'
    output_dir: where to write '<stem>_FY2019.csv'
    stem: 'federal_accounts' | 'recipients' | 'awards'
    """
    os.makedirs(output_dir, exist_ok=True)
    quarter_files = list_quarter_files(input_dir, stem)
    if not quarter_files:
        print(f" No quarterly files for '{stem}' in {input_dir}")
        return

    # Group by FY
    by_year = {}
    for f in quarter_files:
        fy, q = parse_fy_q(stem, f)
        if fy is None:
            continue
        by_year.setdefault(fy, []).append((q, f))

    # Combine and write per FY
    for fy in sorted(by_year.keys()):
        chunks = []
        for q, fname in sorted(by_year[fy]):  # sort by quarter
            path = os.path.join(input_dir, fname)
            df = safe_read(path)
            if df is None:
                continue
            if "fy" not in df.columns:
                df["fy"] = fy
            if "quarter" not in df.columns:
                df["quarter"] = q
            chunks.append(df)

        if not chunks:
            print(f" FY {fy}: nothing to write for stem '{stem}'")
            continue

        combined = pd.concat(chunks, ignore_index=True)
        out_path = os.path.join(output_dir, f"{stem}_FY{fy}.csv")
        combined.to_csv(out_path, index=False)
        print(f" Wrote {len(combined):,} rows  {out_path}")


### Main Quarter Combining Function

The `combine_quarters_to_years()` function is the core processor that:

1. **Groups quarterly files by fiscal year**
2. **Processes quarters in chronological order** (Q1  Q2  Q3  Q4)
3. **Adds missing metadata columns** (`fy`, `quarter`) if not present in source data
4. **Concatenates quarterly data** into yearly datasets
5. **Writes consolidated output** with detailed logging

#### Function Parameters:
- `input_dir`: Directory containing quarterly CSV files
- `output_dir`: Directory for yearly output files
- `stem`: Dataset identifier ('federal_accounts', 'recipients', 'awards')

#### Error Handling:
- Skips 0-byte or corrupt files
- Continues processing if individual quarters are missing
- Reports file-level and year-level processing statistics

### Path Configuration & Execution

#### Input Paths (Quarterly Data):
- **`ACCOUNTS_QTR_DIR`**: Contains `federal_accounts_FYyyyy_Qq.csv` files
- **`RECIPIENTS_QTR_DIR`**: Contains `recipients_FYyyyy_Qq.csv` files  
- **`AWARDS_QTR_DIR`**: Contains `awards_FYyyyy_Qq.csv` files

#### Output Paths (Yearly Data):
- **`ACCOUNTS_YEARLY_DIR`**: `/Federal Funding/Federal Accounts/`
- **`RECIPIENTS_YEARLY_DIR`**: `/Federal Funding/recipient/` *(note: singular naming)*
- **`AWARDS_YEARLY_DIR`**: `/Federal Funding/Awards/`

#### Important Notes:
-  **Path Consistency**: Early code uses `/My Drive/Federal Funding` (with space), later uses `/MyDrive/FederalFunding` (no space)
-  **Naming Convention**: Recipients yearly folder is named `recipient` (singular), but stem is `recipients` (plural)
-  **File Validation**: Process will skip corrupt, empty, or failure files automatically

In [None]:
# --- Your paths ---
ACCOUNTS_QTR_DIR   = "/content/drive/MyDrive/USASpendingResults"               # federal_accounts_FYyyyy_Qq.csv
RECIPIENTS_QTR_DIR = "/content/drive/MyDrive/USASpendingResults/recipients"    # recipients_FYyyyy_Qq.csv
AWARDS_QTR_DIR     = "/content/drive/MyDrive/USASpendingResults/awards"        # awards_FYyyyy_Qq.csv

BASE_YEARLY_DIR = "/content/drive/MyDrive/Federal Funding"
ACCOUNTS_YEARLY_DIR   = os.path.join(BASE_YEARLY_DIR, "Federal Accounts")
RECIPIENTS_YEARLY_DIR = os.path.join(BASE_YEARLY_DIR, "recipient")  # per your casing
AWARDS_YEARLY_DIR     = os.path.join(BASE_YEARLY_DIR, "Awards")

# --- Run ---
combine_quarters_to_years(ACCOUNTS_QTR_DIR,   ACCOUNTS_YEARLY_DIR,   "federal_accounts")
combine_quarters_to_years(RECIPIENTS_QTR_DIR, RECIPIENTS_YEARLY_DIR, "recipients")
combine_quarters_to_years(AWARDS_QTR_DIR,     AWARDS_YEARLY_DIR,     "awards")

 Wrote 5,772 rows  /content/drive/MyDrive/Federal Funding/Federal Accounts/federal_accounts_FY2017.csv
 Wrote 7,633 rows  /content/drive/MyDrive/Federal Funding/Federal Accounts/federal_accounts_FY2018.csv
 Wrote 7,488 rows  /content/drive/MyDrive/Federal Funding/Federal Accounts/federal_accounts_FY2019.csv
 Wrote 7,433 rows  /content/drive/MyDrive/Federal Funding/Federal Accounts/federal_accounts_FY2020.csv
 Wrote 7,675 rows  /content/drive/MyDrive/Federal Funding/Federal Accounts/federal_accounts_FY2021.csv
 Wrote 7,633 rows  /content/drive/MyDrive/Federal Funding/Federal Accounts/federal_accounts_FY2022.csv
 Wrote 7,758 rows  /content/drive/MyDrive/Federal Funding/Federal Accounts/federal_accounts_FY2023.csv
 Wrote 7,692 rows  /content/drive/MyDrive/Federal Funding/Federal Accounts/federal_accounts_FY2024.csv
 Wrote 615,291 rows  /content/drive/MyDrive/Federal Funding/recipient/recipients_FY2017.csv
 Wrote 838,643 rows  /content/drive/MyDrive/Federal Funding/recipient/recipients_FY2

---

## Phase 2: Year  ALL-years Combiner (Multi-Category)

### Purpose
Takes yearly files (e.g., `agency_FY2018.csv`, `agency_FY2019.csv`, etc.) and merges them into comprehensive ALL-years files per category (e.g., `agency_ALL_FY.csv`). Also generates detailed manifests listing which input files were used.

### Key Features
- **Multi-year Consolidation**: Combines all available years for each dataset type
- **Schema Evolution Handling**: Preserves natural column order, adds new columns as they appear
- **Duplicate Removal**: Automatically removes exact duplicate rows
- **Manifest Generation**: Creates detailed logs of input files used for each output
- **Timestamped Runs**: Generates timestamped folders for tracking different runs

### Base Directory Configuration
- **BASE_YEARLY_DIR**: `/content/drive/MyDrive/FederalFunding` *(note: no space)*
- **OUTPUT_BASE_DIR**: `/content/drive/MyDrive/FederalFunding/All Years Combined`

### Category Mappings
- **federal_accounts**  `/FederalFunding/Federal Accounts`
- **recipients**  `/FederalFunding/recipient`
- **awards**  `/FederalFunding/Awards`
- **agency**  `/FederalFunding/agency`

### Utility Functions for Year-to-ALL Processing

#### Core Functions:
- **`list_year_files(input_dir, stem)`**: Lists files matching pattern `<stem>_FY<year>.csv`
- **`safe_read_csv(path)`**: Safe CSV reader with comprehensive error handling
- **`infer_fy_from_name(filename)`**: Extracts fiscal year from filename when `fy` column is missing
- **`ensure_out_dir(base_dir, timestamped=True)`**: Creates timestamped output directories

#### Advanced Processing:
- **`combine_years_to_one(input_dir, stem, output_dir)`**: Main consolidation function that:
  - Gathers all yearly files for a dataset stem
  - Builds stable column order by preserving natural column evolution
  - Adds `fy` column from filename if missing in data
  - Concatenates all years, removes duplicates
  - Writes consolidated output and detailed manifest

#### Schema Evolution Strategy:
The system preserves "natural" column order by:
1. Starting with columns from the first file processed
2. Appending new columns as they appear in subsequent files
3. Reindexing all DataFrames to maintain consistent column structure
4. This approach handles schema drift gracefully without forcing rigid schemas

In [None]:
import os, re, pandas as pd
from datetime import datetime

# -----------------------------
# Paths discovered from your check
# -----------------------------
BASE_YEARLY_DIR = "/content/drive/MyDrive/FederalFunding"  # <- no space
OUTPUT_BASE_DIR = os.path.join(BASE_YEARLY_DIR, "All Years Combined")
RUN_TAG_SUBFOLDER = True   # set to False to always overwrite the same folder

# Folder -> stem (stem must match filename prefix before _FYyyyy.csv)
CATEGORY_FOLDERS = {
    "federal_accounts": os.path.join(BASE_YEARLY_DIR, "Federal Accounts"),
    "recipients":       os.path.join(BASE_YEARLY_DIR, "recipient"),
    "awards":           os.path.join(BASE_YEARLY_DIR, "Awards"),
    "agency":           os.path.join(BASE_YEARLY_DIR, "agency"),
}

YEARLY_RX = re.compile(r'^(.+)_FY(\d{4})\.csv$', re.IGNORECASE)

# -----------------------------
# Helpers
# -----------------------------
def list_year_files(input_dir: str, stem: str):
    """List files like '<stem>_FYyyyy.csv' in input_dir."""
    if not os.path.isdir(input_dir):
        print(f" Folder not found: {input_dir}")
        return []
    rx = re.compile(rf'^{re.escape(stem)}_FY(\d{{4}})\.csv$', re.IGNORECASE)
    return sorted([f for f in os.listdir(input_dir) if f.lower().endswith(".csv") and rx.match(f)])

def safe_read_csv(path: str):
    """Safely read CSV; skip empty/0-byte/bad files."""
    try:
        if os.path.getsize(path) == 0:
            print(f" 0-byte file, skipped: {path}")
            return None
        df = pd.read_csv(path)
        if df.empty or df.columns.size == 0:
            print(f" Empty/corrupt file, skipped: {path}")
            return None
        return df
    except Exception as e:
        print(f" Read error ({e}), skipped: {path}")
        return None

def infer_fy_from_name(filename: str):
    m = re.search(r"FY(\d{4})", filename, flags=re.IGNORECASE)
    return int(m.group(1)) if m else None

def ensure_out_dir(base_dir: str, timestamped=True):
    os.makedirs(base_dir, exist_ok=True)
    if timestamped:
        tag = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        out_dir = os.path.join(base_dir, f"run_{tag}")
        os.makedirs(out_dir, exist_ok=True)
        return out_dir
    return base_dir

# -----------------------------
# Core merge
# -----------------------------
def combine_years_to_one(input_dir: str, stem: str, output_dir: str):
    """
    Merge all '<stem>_FYyyyy.csv' into '<stem>_ALL_FY.csv' in output_dir.
    - Adds 'fy' from filename if missing
    - Keeps natural column order (no forced reordering)
    - Drops exact duplicate rows
    """
    files = list_year_files(input_dir, stem)
    if not files:
        print(f" No yearly files in {input_dir} for '{stem}'")
        return

    chunks, inputs_used = [], []
    # Build "natural" unified column order: start from first file, append new cols as seen
    seen_cols = []

    for fname in files:
        fp = os.path.join(input_dir, fname)
        df = safe_read_csv(fp)
        if df is None:
            continue
        if "fy" not in df.columns:
            fy = infer_fy_from_name(fname)
            if fy is not None:
                df["fy"] = fy
        for c in list(df.columns):
            if c not in seen_cols:
                seen_cols.append(c)
        chunks.append(df)
        inputs_used.append(fp)

    if not chunks:
        print(f" No valid data for '{stem}'")
        return

    combined = pd.concat([c.reindex(columns=seen_cols) for c in chunks], ignore_index=True)
    combined.drop_duplicates(inplace=True)

    os.makedirs(output_dir, exist_ok=True)
    out_path = os.path.join(output_dir, f"{stem}_ALL_FY.csv")
    combined.to_csv(out_path, index=False)

    # Manifest of inputs
    pd.DataFrame({"input_files": inputs_used}).to_csv(out_path.replace(".csv", "_manifest.csv"), index=False)

    print(f" {stem}: merged {len(combined):,} rows  {out_path}")

# -----------------------------
# Run for your four categories
# -----------------------------
final_out_dir = ensure_out_dir(OUTPUT_BASE_DIR, RUN_TAG_SUBFOLDER)
print(f" Writing merged outputs to: {final_out_dir}\n")

for stem, folder in CATEGORY_FOLDERS.items():
    print(f" Processing: stem='{stem}'")
    print(f"   Input : {folder}")
    print(f"   Output: {final_out_dir}")
    combine_years_to_one(folder, stem, final_out_dir)


### Multi-Category Processing Execution

This section processes all four main dataset categories in sequence:

#### Output Generation:
- **`agency_ALL_FY.csv`**: Combined agency data across all years
- **`awards_ALL_FY.csv`**: Combined awards data across all years  
- **`federal_accounts_ALL_FY.csv`**: Combined federal accounts data across all years
- **`recipients_ALL_FY.csv`**: Combined recipients data across all years

#### For Each Dataset:
1. **File Discovery**: Locates all yearly files matching the stem pattern
2. **Safe Processing**: Handles corrupt or missing files gracefully
3. **Metadata Addition**: Adds fiscal year from filename if missing
4. **Schema Unification**: Creates consistent column structure across years
5. **Deduplication**: Removes exact duplicate rows
6. **Output Writing**: Saves consolidated data and manifest files

#### Manifest Files:
Each `*_ALL_FY.csv` file gets a corresponding `*_ALL_FY_manifest.csv` that lists:
- All input files used in the consolidation
- Full file paths for traceability
- Processing timestamp information

In [None]:
import os, re, pandas as pd
from datetime import datetime

TARGET_DIR = "/content/drive/MyDrive/FederalFunding/federal_accounts_with_agency"
OUTPUT_BASE_DIR = "/content/drive/MyDrive/FederalFunding/All Years Combined"
os.makedirs(OUTPUT_BASE_DIR, exist_ok=True)
OUT_DIR = os.path.join(OUTPUT_BASE_DIR, f"run_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}")
os.makedirs(OUT_DIR, exist_ok=True)

STEM = "federal_accounts"  # filenames like federal_accounts_FY2017.csv
rx = re.compile(rf'^{re.escape(STEM)}_FY(\d{{4}})\.csv$', re.IGNORECASE)

def safe_read_csv(path):
    try:
        if os.path.getsize(path) == 0:
            print(f" 0-byte file, skipped: {path}")
            return None
        df = pd.read_csv(path)
        if df.empty or df.columns.size == 0:
            print(f" Empty/corrupt file, skipped: {path}")
            return None
        return df
    except Exception as e:
        print(f" Read error ({e}), skipped: {path}")
        return None

files = sorted([f for f in os.listdir(TARGET_DIR) if f.lower().endswith(".csv") and rx.match(f)])
if not files:
    print(f" No yearly files like {STEM}_FYyyyy.csv found in {TARGET_DIR}")
else:
    seen_cols, chunks, used = [], [], []
    for fname in files:
        fp = os.path.join(TARGET_DIR, fname)
        df = safe_read_csv(fp)
        if df is None:
            continue
        if "fy" not in df.columns:
            m = re.search(r"FY(\d{4})", fname, flags=re.IGNORECASE)
            if m: df["fy"] = int(m.group(1))
        for c in list(df.columns):
            if c not in seen_cols:
                seen_cols.append(c)
        chunks.append(df); used.append(fp)

    if chunks:
        combined = pd.concat([c.reindex(columns=seen_cols) for c in chunks], ignore_index=True)
        combined.drop_duplicates(inplace=True)
        out_path = os.path.join(OUT_DIR, f"{STEM}_ALL_FY.csv")
        combined.to_csv(out_path, index=False)
        pd.DataFrame({"input_files": used}).to_csv(out_path.replace(".csv", "_manifest.csv"), index=False)
        print(f" Merged {len(combined):,} rows  {out_path}")
        print(f" Manifest  {out_path.replace('.csv', '_manifest.csv')}")


---

## Phase 3: Special Merge for "federal_accounts_with_agency"

### Purpose
This is a specialized processing step for federal accounts data that has already been enriched with agency information. It follows the same consolidation logic as Phase 2 but targets a specific directory structure.

### Target Directory
- **Input**: `/content/drive/MyDrive/FederalFunding/federal_accounts_with_agency`
- **Output**: `/content/drive/MyDrive/FederalFunding/All Years Combined/run_<timestamp>/`

### Processing Logic
This section mirrors the `combine_years_to_one` function but is specifically configured for:
- **Files matching**: `federal_accounts_FYyyyy.csv` pattern
- **Enhanced data**: Federal accounts already joined with agency information
- **Same safety measures**: Handles corrupt files, adds missing `fy` columns, preserves schema evolution
- **Timestamped output**: Creates new run folder with timestamp

### Key Differences from Phase 2
- **Single dataset focus**: Only processes federal accounts (not multi-category)
- **Pre-enriched data**: Input files already contain agency information
- **Dedicated directory**: Uses separate input directory for processed accounts
- **Independent processing**: Runs separately from main multi-category workflow

In [6]:

path = "/content/drive/MyDrive/FederalFunding/All Years Combined/agency_ALL_FY.csv"
df_agency = pd.read_csv(path)

print(f"\n Loaded agency_ALL_FY.csv  {len(df_agency)} rows, {len(df_agency.columns)} columns\n")
df_agency.head()


 Loaded agency_ALL_FY.csv  6440 rows, 12 columns



Unnamed: 0,fy,time_granularity,fiscal_quarter,fiscal_period,fyq,fyp,id,code,type,name,amount,link
0,2017,quarter,2,,FY2017-Q2,,11.0,5.0,agency,Government Accountability Office,275517900.0,True
1,2017,quarter,2,,FY2017-Q2,,15.0,9.0,agency,Legislative Branch Boards and Commissions,1167124.0,False
2,2017,quarter,2,,FY2017-Q2,,28.0,10.0,agency,The Judicial Branch,249131600.0,False
3,2017,quarter,2,,FY2017-Q2,,95.0,12.0,agency,Department of Agriculture,81294280000.0,True
4,2017,quarter,2,,FY2017-Q2,,183.0,13.0,agency,Department of Commerce,7266541000.0,True


---

## Phase 4: Data Validation & Quality Checks

### Purpose
Performs comprehensive validation and quality assessment of the consolidated ALL-years datasets. This phase serves as a sanity check to ensure the consolidation process worked correctly and provides insights into data quality and structure.

### Validation Steps
For each consolidated dataset, the following checks are performed:

1. **Load Verification**: Confirms files can be loaded successfully
2. **Dimension Reporting**: Reports total rows and columns for each dataset
3. **Schema Inspection**: Displays column names and data types
4. **Sample Data Review**: Shows first few rows to verify data structure
5. **Data Quality Assessment**: Identifies potential issues or anomalies

### Datasets Validated
- **`agency_ALL_FY.csv`**: Combined agency information across all fiscal years
- **`awards_ALL_FY.csv`**: Combined awards/contracts data across all fiscal years
- **`federal_accounts_ALL_FY.csv`**: Combined federal accounts data across all fiscal years
- **`federal_accounts_agency_ALL_FY.csv`**: Combined federal accounts with agency details
- **`recipients_ALL_FY.csv`**: Combined recipient information across all fiscal years

### Quality Indicators
-  **Successful Load**: File loads without errors
-  **Row Counts**: Total records in consolidated dataset
-  **Column Counts**: Total fields available
-  **Sample Preview**: Representative data structure
-  **Potential Issues**: Missing data, unexpected values, or schema problems

### Agency Data Validation

In [7]:
path = "/content/drive/MyDrive/FederalFunding/All Years Combined/awards_ALL_FY.csv"
df_awards = pd.read_csv(path)

print(f"\n Loaded awards_ALL_FY.csv  {len(df_awards)} rows, {len(df_awards.columns)} columns\n")
df_awards.head()



 Loaded awards_ALL_FY.csv  8333844 rows, 10 columns



Unnamed: 0,fy,quarter,budget_function_code,budget_subfunction_code,federal_account_code,award_id,award_name,award_code,obligated_amount,total_amount
0,2017,2,150,153,4644,128686659.0,ITCCN140006,ITCCN140006,136047.56,136047.56
1,2017,2,150,153,4644,128686663.0,ITCCN160004,ITCCN160004,132645.0,132645.0
2,2017,2,150,153,4644,128687697.0,ITCTO160002,ITCTO160002,102055.52,102055.52
3,2017,2,150,153,4644,128687512.0,ITCPO160018,ITCPO160018,81962.0,81962.0
4,2017,2,150,153,4644,128687517.0,ITCPO160027,ITCPO160027,68526.96,68526.96


### Awards Data Validation

In [8]:
path = "/content/drive/MyDrive/FederalFunding/All Years Combined/federal_accounts_ALL_FY.csv"
df_accounts = pd.read_csv(path)

print(f"\n Loaded federal_accounts_ALL_FY.csv  {len(df_accounts)} rows, {len(df_accounts.columns)} columns\n")
df_accounts.head()



 Loaded federal_accounts_ALL_FY.csv  59084 rows, 7 columns



Unnamed: 0,fy,quarter,budget_function_code,budget_subfunction_code,federal_account_code,federal_account_name,obligated_amount
0,2017,2,150,155,3405,"Advances, Foreign Military Sales, Funds Approp...",12198710000.0
1,2017,2,150,155,5650,"Administration Expenses, Export-Import Bank of...",99214440.0
2,2017,2,150,155,3387,"Special Defense Acquisition Fund, Funds Approp...",48534650.0
3,2017,2,150,155,5651,"Inspector General, Export-Import Bank of the U...",6354166.0
4,2017,2,150,155,4398,"Exchange Stabilization Fund, Office of the Sec...",0.0


### Federal Accounts Data Validation

In [9]:
path = "/content/drive/MyDrive/FederalFunding/All Years Combined/federal_accounts_agency_ALL_FY.csv"
df_accounts_agency = pd.read_csv(path)

print(f"\n Loaded federal_accounts_agency_ALL_FY.csv  {len(df_accounts_agency)} rows, {len(df_accounts_agency.columns)} columns\n")
df_accounts_agency.head()



 Loaded federal_accounts_agency_ALL_FY.csv  58830 rows, 10 columns



Unnamed: 0,fy,fiscal_quarter,fiscal_period,agency,id,code,type,name,amount,account_number
0,2017,2,,1067,5861,400,federal_account,"Salaries and Expenses, Selective Service System",9733364.0,090-0400
1,2017,2,,1068,5865,13,federal_account,"Hurricane Education Recovery, Office of Elemen...",0.0,091-0013
2,2017,2,,1068,5866,101,federal_account,"Indian Education, Office of Elementary and Sec...",122442.3,091-0101
3,2017,2,,1068,5867,102,federal_account,"Impact AID, Education",1083142000.0,091-0102
4,2017,2,,1068,5873,200,federal_account,"Student Financial Assistance, Education",3358375000.0,091-0200


### Federal Accounts with Agency Data Validation

In [10]:
path = "/content/drive/MyDrive/FederalFunding/All Years Combined/recipients_ALL_FY.csv"
df_recipients = pd.read_csv(path)

print(f"\n Loaded recipients_ALL_FY.csv  {len(df_recipients)} rows, {len(df_recipients.columns)} columns\n")
df_recipients.head()


 Loaded recipients_ALL_FY.csv  12502803 rows, 10 columns



Unnamed: 0,fy,quarter,budget_function_code,budget_subfunction_code,federal_account_code,recipient_id,recipient_name,recipient_code,obligated_amount,total_amount
0,2017,2,150,153,4250,ef67c4ac-d2a3-7968-a7f4-e9048d0dfddb,MISCELLANEOUS FOREIGN AWARDEES,MISCELLANEOUS FOREIGN AWARDEES,17273810.0,17273810.0
1,2017,2,150,153,4250,4d7011b0-f2c1-f534-0fb1-39cf16fbd9d3,PAN AMERICAN HEALTH ORGANIZATION,PAN AMERICAN HEALTH ORGANIZATION,15818415.0,15818415.0
2,2017,2,150,153,4250,3a2d5ddb-5dc5-0a56-2466-86838c24c75f,GENERAL SECRETARIAT OF THE ORGANIZATION OF AME...,GENERAL SECRETARIAT OF THE ORGANIZATION OF AME...,12687693.0,12687693.0
3,2017,2,150,153,4250,,Blank Recipient,,12268825.0,12268825.0
4,2017,2,150,153,4250,a9333d81-c8a4-846e-63d8-31eb47e5056a,INTERNATIONAL UNION FOR CONSERVATION OF NATURE...,INTERNATIONAL UNION FOR CONSERVATION OF NATURE...,260000.0,260000.0


### Recipients Data Validation

---

## Summary & Key Considerations

### Pipeline Achievements
 **Complete Data Consolidation**: Successfully transforms quarterly data  yearly data  comprehensive ALL-years datasets

 **Robust Error Handling**: Gracefully handles corrupt, empty, or missing files without stopping the pipeline

 **Schema Evolution Support**: Maintains data integrity while accommodating changing column structures over time

 **Comprehensive Logging**: Generates detailed manifests and processing logs for full traceability

 **Quality Validation**: Built-in data validation and quality checks ensure output integrity

### Important Gotchas & Improvements

#### Path Consistency Issues
 **Mixed Path Conventions**: 
- Early processing: `/My Drive/Federal Funding` (with space)
- Later processing: `/MyDrive/FederalFunding` (no space)
- **Recommendation**: Standardize on one path convention throughout

#### Naming Conventions
 **Stem vs Folder Naming**: 
- Recipients stem: `recipients` (plural)
- Recipients folder: `recipient` (singular)
- **Status**: This works fine as long as filenames start with the stem

#### Schema Management
 **Current Approach**: Natural column order preservation (columns added as they appear)
 **Alternative**: Define explicit schemas per dataset for more rigorous data validation
 **Trade-off**: Current approach is flexible but less strict about data structure

#### Memory Optimization
 **Current Scale**: Works well for current CSV-scale data
 **Future Scaling**: Consider chunked processing or Parquet format for very large datasets
 **Deduplication**: Currently uses exact row matching; could implement dataset-specific deduplication keys

#### Performance Considerations
 **Parallel Processing**: Could parallelize file reading within each category
 **Incremental Updates**: Could implement delta processing for new data additions
 **Caching**: Could cache intermediate results for repeated runs

### Output Files Generated
- `agency_ALL_FY.csv` + manifest
- `awards_ALL_FY.csv` + manifest  
- `federal_accounts_ALL_FY.csv` + manifest
- `federal_accounts_agency_ALL_FY.csv` + manifest
- `recipients_ALL_FY.csv` + manifest

All outputs include detailed manifests showing exactly which input files contributed to each consolidated dataset.