# 🏛️🌍 Agency-Geography Federal Spending Data Collection

## 📋 System Overview

**Comprehensive multi-dimensional data collection** combining agency filtering with geographic granularity across 17 years of federal spending data:

### 🎯 Core Purpose
- **Agency-Filtered Geographic Analysis**: Pull spending data by geography for each federal agency
- **Dual Agency Perspectives**: Separate collection for funding vs awarding agency roles
- **Complete Temporal Coverage**: FY2008-2024 with quarterly granularity
- **Multi-Layer Geography**: Country/State/County/Congressional District breakdowns

### 🏗️ Architecture Highlights
- **📊 Dimensional Matrix**: Agency × Geography × Time cube structure
- **🔄 Dual-Mode Processing**: Initial clean rebuild + surgical retry recovery
- **🏢 Agency Roster Management**: Cached top-tier agency directory with fallback
- **⚡ Smart Concurrency**: Per-type and per-layer worker optimization
- **🛡️ Production Resilience**: Exponential backoff, connection pooling, failure tracking

### 📈 Output Structure
**Per-FY CSV files** organized by agency type and geographic layer:
```
geography_by_agency/
├── funding/
│   ├── country/geo_country_funding_FY2024.csv
│   ├── state/geo_state_funding_FY2024.csv
│   ├── county/geo_county_funding_FY2024.csv
│   └── district/geo_district_funding_FY2024.csv
└── awarding/
    └── [same structure]
```

In [1]:
# USAspending geography-by-agency, ALL 4 layers, funding+awarding, FY2008–2024
# - Initial run: overwrite results & failures
# - Retry run:   merge/append results & overwrite failures
# - Parallel, resilient to socket drops

## 📦 Dependencies & Core Configuration

**Essential imports and global configuration** for agency-geography data collection pipeline:

### Core Libraries
- **`requests`**: HTTP client for USASpending.gov API integration
- **`pandas`**: DataFrame operations and CSV file management
- **`concurrent.futures`**: ThreadPoolExecutor for parallel agency processing
- **`datetime`**: Fiscal year and quarter date calculations
- **`pathlib`**: Modern file system path management

### Key Configuration Variables
- **`BASE_DIR`**: Root directory for all geography-by-agency outputs
- **`CACHE_DIR`**: Agency roster caching location
- **`FY_RANGE`**: 2008-2024 temporal coverage window
- **`AGENCY_TYPES`**: ["funding", "awarding"] - dual agency perspective
- **`GEO_LAYERS`**: ["country", "state", "county", "district"] - geographic granularity
- **`MAX_WORKERS`**: Default thread pool size with per-type/layer overrides

### API Configuration
- **`SESSION`**: Shared requests session with connection pooling
- **`MAX_ATTEMPTS_EXC`**: Network exception retry limit (default: 2)
- **`PAUSE`**: Inter-request delay for API rate limiting

In [1]:
# === USASpending Geography-by-Agency Runner (single entrypoint with selectors) ===
import os, time, json, random, requests, pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from http.client import RemoteDisconnected
from requests.exceptions import ConnectionError, ReadTimeout, ChunkedEncodingError
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry


## ⚙️ Concurrency Strategy & Worker Configuration

**Intelligent worker sizing** to optimize throughput while respecting API limits:

### Per-Type Worker Limits
- **Funding Agencies**: Specialized worker allocation for funding-role agencies
- **Awarding Agencies**: Optimized concurrency for awarding-role agencies
- **Dynamic Scaling**: Automatic fallback to `MAX_WORKERS` for undefined types

### Per-Layer Worker Limits
- **Country Layer**: 4 workers (lightweight, minimal data volume)
- **State Layer**: 6 workers (moderate complexity, 50 states + territories)
- **County Layer**: 8 workers (high volume, 3000+ counties with FIPS processing)
- **District Layer**: 6 workers (congressional districts, moderate complexity)

### Worker Selection Logic
```python
def get_workers(agency_type: str, geo_layer: str) -> int:
    type_limit = TYPE_WORKERS.get(agency_type, MAX_WORKERS)
    layer_limit = LAYER_WORKERS.get(geo_layer, MAX_WORKERS)
    return min(type_limit, layer_limit)
```

### Concurrency Benefits
- **Resource Management**: Prevents heavy layers from overwhelming the API
- **Fair Processing**: Ensures all agency-layer combinations get adequate resources
- **Rate Limit Compliance**: Distributes load to stay within API constraints

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 📅 Fiscal Year & Quarter Date Mapping

**Precise temporal filtering** with government fiscal year calendar conversion:

### Fiscal Year Calendar
- **FY 2024**: October 1, 2023 → September 30, 2024
- **Quarter Breakdown**:
  - Q1: Oct 1 - Dec 31 (previous calendar year)
  - Q2: Jan 1 - Mar 31 (current calendar year)
  - Q3: Apr 1 - Jun 30 (current calendar year)  
  - Q4: Jul 1 - Sep 30 (current calendar year)

### Date Conversion Logic
```python
def fyq_dates(fy: int, q: int) -> tuple[str, str]:
    # Converts FY/Quarter to exact start_date/end_date strings
    # Returns ISO format dates for API filtering
```

### API Integration
- **Time Period Filter**: `"time_period": [{"start_date": "2023-10-01", "end_date": "2023-12-31"}]`
- **Date Type**: `"action_date"` for obligation-based filtering
- **Precision**: Exact calendar date boundaries prevent data gaps or overlaps

### Temporal Coverage
- **17-Year Span**: FY2008 through FY2024 comprehensive collection
- **Quarterly Granularity**: 68 quarters × 4 layers × 2 agency types × ~200 agencies = ~100K+ API calls

In [3]:


# ===================== CONFIG =====================
BASE_DIR    = "/content/drive/MyDrive/USASpendingResults/geography/geography_by_agency"
CACHE_DIR   = os.path.join(BASE_DIR, "_cache")
SCOPE       = "place_of_performance"   # or "recipient_location"
START_FY, END_FY = 2008, 2024

# default pools; you can override via args to run_geography()
DEFAULT_TYPES  = ["funding", "awarding"]
DEFAULT_LAYERS = ["country", "state", "county", "district"]

MAX_WORKERS     = 16
TYPE_WORKERS    = {"funding": 12, "awarding": 12}
LAYER_WORKERS   = {"country": 12, "state": 12, "county": 6, "district": 4}
TIMEOUT_S       = 120
PAUSE           = 0.05
MAX_ATTEMPTS_EXC = 2  # network retry attempts for a single task (>=1)
BACKOFF_BASE    = 0.35

URL_GEO = "https://api.usaspending.gov/api/v2/search/spending_by_geography/"
URL_TOP = "https://api.usaspending.gov/api/v2/references/toptier_agencies/"

RETRY_EXC = (RemoteDisconnected, ConnectionError, ReadTimeout, ChunkedEncodingError)

EMPTY_COLS = ["code","name","amount","population","fy","quarter",
              "agency_type","agency_code","agency_name","geo_layer"]

# ===================== SETUP =====================
def setup_session(pool_maxsize=None):
    if pool_maxsize is None:
        pool_maxsize = MAX_WORKERS + 8
    s = requests.Session()
    adapter = HTTPAdapter(
        pool_connections=pool_maxsize,
        pool_maxsize=pool_maxsize,
        max_retries=Retry(total=0, connect=0, read=0, redirect=0, status=0),
        pool_block=True,
    )
    s.mount("https://", adapter)
    s.headers.update({"Connection": "keep-alive"})
    return s

SESSION = setup_session()

def ensure_dirs(types, layers):
    os.makedirs(BASE_DIR, exist_ok=True)
    os.makedirs(CACHE_DIR, exist_ok=True)
    for t in types:
        for layer in layers:
            os.makedirs(os.path.join(BASE_DIR, t, layer), exist_ok=True)

def get_workers(agency_type:str, layer:str, max_workers:int) -> int:
    return min(TYPE_WORKERS.get(agency_type, max_workers), LAYER_WORKERS.get(layer, max_workers))

# ===================== TOPTIER ROSTER (cached) =====================
def _cache_fp():
    return os.path.join(CACHE_DIR, "toptier_agencies.csv")

def toptier_lookup(max_attempts=3, use_cache=True):
    """
    Returns list of dicts: [{'code': '020', 'name': 'Department of the Treasury'}, ...]
    Tries API a few times; if it fails and use_cache=True, falls back to cache (if available).
    """
    last_err = None
    for i in range(1, max_attempts+1):
        try:
            r = SESSION.get(URL_TOP, timeout=60)
            r.raise_for_status()
            rows = r.json().get("results", []) or []
            out = []
            for rec in rows:
                code = (
                    rec.get("toptier_code")
                    or (rec.get("toptier_agency") or {}).get("toptier_code")
                    or rec.get("cgac_code")
                    or ""
                )
                name = (
                    rec.get("name")
                    or rec.get("agency_name")
                    or (rec.get("toptier_agency") or {}).get("name")
                    or ""
                )
                code = str(code).zfill(3) if code else ""
                if code and name:
                    out.append({"code": code, "name": name})
            if out:
                # refresh cache
                try:
                    pd.DataFrame(out).to_csv(_cache_fp(), index=False)
                except Exception:
                    pass
                print(f"✅ toptier_lookup: {len(out)} agencies (sample: {out[:3]})")
                return out
            else:
                print(f"⚠️ toptier_lookup parsed 0 agencies (attempt {i}); raw count={len(rows)}")
        except Exception as e:
            last_err = e
            print(f"↻ roster fetch retry {i}/{max_attempts} after error: {e}")
            time.sleep(1.25 * i)

    if use_cache and os.path.exists(_cache_fp()):
        try:
            df = pd.read_csv(_cache_fp(), dtype=str)
            if not df.empty and {"code","name"}.issubset(df.columns):
                out = df.assign(code=df["code"].astype(str).str.zfill(3)).to_dict("records")
                print(f"✅ toptier_lookup fallback to cache: {len(out)} agencies")
                return out
        except Exception as e:
            print(f"⚠️ cache read failed: {e}")

    raise RuntimeError(f"Failed to fetch/parse toptier roster: {last_err}")

# ===================== DATES & PAYLOAD =====================
def fyq_dates(fy:int, q:int):
    if   q == 1: return f"{fy-1}-10-01", f"{fy-1}-12-31"
    elif q == 2: return f"{fy}-01-01",   f"{fy}-03-31"
    elif q == 3: return f"{fy}-04-01",   f"{fy}-06-30"
    elif q == 4: return f"{fy}-07-01",   f"{fy}-09-30"
    raise ValueError("quarter must be 1..4")

def payload_for(agency_type:str, agency_name:str, layer:str, fy:int, q:int):
    start, end = fyq_dates(fy, q)
    return {
        "filters": {
            "date_type": "action_date",
            "time_period": [{"start_date": start, "end_date": end}],
            "agencies": [{"type": agency_type, "tier": "toptier", "name": agency_name}],
        },
        "scope": SCOPE,
        "geo_layer": layer,
        "subawards": False
    }

# ===================== PATHS =====================
def year_path_year(agency_type:str, layer:str, fy:int):
    # one CSV per FY per (type×layer)
    return os.path.join(BASE_DIR, agency_type, layer, f"geo_{layer}_{agency_type}_FY{fy}.csv")

def failures_path(agency_type:str, layer:str):
    return os.path.join(BASE_DIR, agency_type, layer, f"failures_{agency_type}_{layer}.csv")

# ===================== FETCH ONE =====================
def fetch_one(agency_type:str, agency_code:str, agency_name:str, layer:str, fy:int, q:int, attempts=MAX_ATTEMPTS_EXC):
    """
    Fetch one (type, agency, layer, fy, quarter).
    Returns a DataFrame with standardized columns (never None).
    On final network failure, returns EMPTY dataframe; callers log failure.
    """
    attempts = max(1, attempts)
    for attempt in range(1, attempts+1):
        try:
            r = SESSION.post(URL_GEO, json=payload_for(agency_type, agency_name, layer, fy, q), timeout=TIMEOUT_S)
            # parse
            try:
                data = r.json()
            except ValueError:
                data = {}
            if r.status_code != 200:
                raise RuntimeError(f"HTTP {r.status_code} {str(data or {'raw': r.text[:180]})[:300]}")

            rows = (data or {}).get("results")
            if rows is None:
                return pd.DataFrame(columns=EMPTY_COLS)

            df = pd.DataFrame(rows)
            if df.empty:
                return pd.DataFrame(columns=EMPTY_COLS)

            df = df.rename(columns={"shape_code":"code","display_name":"name","aggregated_amount":"amount"})
            keep = [c for c in ("code","name","amount","population") if c in df.columns]
            df = df[keep]
            df["amount"] = pd.to_numeric(df.get("amount"), errors="coerce")
            df["fy"], df["quarter"] = fy, q
            df["agency_type"], df["agency_code"], df["agency_name"] = agency_type, agency_code, agency_name
            df["geo_layer"] = layer
            if layer == "county" and "code" in df.columns:
                df["state_code"] = df["code"].astype(str).str[:2]
            return df

        except RETRY_EXC as e:
            if attempt == attempts:
                return pd.DataFrame(columns=EMPTY_COLS)
            backoff = BACKOFF_BASE * (2 ** (attempt - 1)) + random.uniform(0, BACKOFF_BASE)
            print(f"↻ retry {layer} {agency_type} {agency_code} FY{fy} Q{q} ({attempt}/{attempts}) after {type(e).__name__}: {e}")
            time.sleep(backoff)
        finally:
            time.sleep(PAUSE)

# ===================== SAVE HELPERS =====================
def write_failures_overwrite(agency_type:str, layer:str, failures:list[tuple[str,str,int,int,str]]):
    """
    failures: list of (agency_type, agency_code, fy, q, reason)
    """
    fp = failures_path(agency_type, layer)
    if failures:
        df = pd.DataFrame(failures, columns=["agency_type","agency_code","fy","quarter","reason"])
        df.to_csv(fp, index=False)
        print(f"📝 updated failures for {agency_type}/{layer}: {len(df)} rows → {fp}")
    else:
        if os.path.exists(fp):
            os.remove(fp)
        print(f"🧹 no failures remain for {agency_type}/{layer}; cleared {fp}")

def read_failures(agency_type:str, layer:str):
    fp = failures_path(agency_type, layer)
    if not os.path.exists(fp) or os.path.getsize(fp) == 0:
        return pd.DataFrame(columns=["agency_type","agency_code","fy","quarter","reason"])
    try:
        return pd.read_csv(fp)
    except pd.errors.EmptyDataError:
        return pd.DataFrame(columns=["agency_type","agency_code","fy","quarter","reason"])

def save_year_overwrite_year(agency_type:str, layer:str, fy:int, parts:list[pd.DataFrame]):
    parts = [p for p in (parts or []) if isinstance(p, pd.DataFrame)]
    df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
    keys = [k for k in ["agency_type","agency_code","geo_layer","fy","quarter","code"] if k in df_year.columns]
    if keys:
        df_year = df_year.drop_duplicates(subset=keys).sort_values(keys)
    out = year_path_year(agency_type, layer, fy)
    df_year.to_csv(out, index=False)
    return out, len(df_year)

def save_year_merge_year(agency_type:str, layer:str, fy:int, parts:list[pd.DataFrame]):
    parts = [p for p in (parts or []) if isinstance(p, pd.DataFrame)]
    new_chunk = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame()
    out = year_path_year(agency_type, layer, fy)

    if os.path.exists(out) and os.path.getsize(out) > 0:
        try:
            base = pd.read_csv(out, dtype=str)
        except pd.errors.EmptyDataError:
            base = pd.DataFrame()
    else:
        base = pd.DataFrame()

    merged = pd.concat([base, new_chunk], ignore_index=True) if not new_chunk.empty else base
    if merged.empty:
        merged = pd.DataFrame(columns=EMPTY_COLS)
    if "amount" in merged.columns:
        merged["amount"] = pd.to_numeric(merged["amount"], errors="coerce")

    keys = [k for k in ["agency_type","agency_code","geo_layer","fy","quarter","code"] if k in merged.columns]
    if keys:
        merged = merged.drop_duplicates(subset=keys, keep="last").sort_values(keys)

    merged.to_csv(out, index=False)
    return out, len(merged)

# ===================== MAIN ENTRY =====================
def run_geography(
    mode:str = "initial",
    types:list[str] = None,
    layers:list[str] = None,
    start_fy:int = START_FY,
    end_fy:int = END_FY,
    max_workers:int = MAX_WORKERS,
    attempts_per_task:int = MAX_ATTEMPTS_EXC,
):
    """
    mode: "initial" (overwrite) or "retry" (append/merge only failed tasks)
    types:  subset of ["funding","awarding"]
    layers: subset of ["country","state","county","district"]
    """
    if types is None:  types = DEFAULT_TYPES[:]
    if layers is None: layers = DEFAULT_LAYERS[:]

    ensure_dirs(types, layers)

    # Roster once per run
    agencies = toptier_lookup()
    code2name = {a["code"]: a["name"] for a in agencies}

    for agency_type in types:
        for layer in layers:
            workers = get_workers(agency_type, layer, max_workers)

            if mode == "initial":
                # reset failures
                write_failures_overwrite(agency_type, layer, failures=[])
                # Build all tasks across agencies × FY × Q
                tasks = [(agency_type, a["code"], a["name"], layer, fy, q)
                         for a in agencies
                         for fy in range(start_fy, end_fy+1)
                         for q in (1,2,3,4)]
                print(f"Submitting {len(tasks)} tasks for {agency_type}/{layer} with max_workers={workers} …")

                by_year, failures = {}, []
                with ThreadPoolExecutor(max_workers=workers) as ex:
                    fut2task = {ex.submit(fetch_one, *t, attempts=attempts_per_task): t for t in tasks}
                    for fut in as_completed(fut2task):
                        agency_type_, agency_code, agency_name, layer_, fy, q = fut2task[fut]
                        try:
                            dfq = fut.result()
                            by_year.setdefault(fy, []).append(dfq)
                            print(f"✅ {layer_} {agency_type_} {agency_code} FY{fy} Q{q}: {len(dfq)} rows")
                        except Exception as e:
                            print(f"⚠️ {layer_} {agency_type_} {agency_code} FY{fy} Q{q} failed: {e}")
                            failures.append((agency_type_, agency_code, fy, q, str(e)))
                        finally:
                            time.sleep(PAUSE)

                # Write one file per FY (overwrite)
                for fy, parts in sorted(by_year.items()):
                    out, n = save_year_overwrite_year(agency_type, layer, fy, parts)
                    print(f"📦 {layer} {agency_type} FY{fy}: {n} rows → {out}")

                # Failures file (overwrite with current failures)
                write_failures_overwrite(agency_type, layer, failures)

            elif mode == "retry":
                f = read_failures(agency_type, layer)
                f = f[pd.notna(pd.to_numeric(f.get("fy", pd.Series()), errors="coerce")) &
                      pd.notna(pd.to_numeric(f.get("quarter", pd.Series()), errors="coerce"))]
                if f.empty:
                    print(f"✅ No failures to retry for {agency_type}/{layer}.")
                    write_failures_overwrite(agency_type, layer, failures=[])
                    continue

                todo_raw = sorted(set((agency_type, str(row.agency_code).zfill(3), int(row.fy), int(row.quarter))
                                      for _, row in f.iterrows()))
                print(f"🔁 Retrying {len(todo_raw)} failed tasks for {agency_type}/{layer} with max_workers={workers} …")

                expanded, new_failures = [], []
                for (_, code, fy, q) in todo_raw:
                    nm = code2name.get(code)
                    if not nm:
                        new_failures.append((agency_type, code, fy, q, "Unknown toptier code"))
                        continue
                    expanded.append((agency_type, code, nm, layer, fy, q))

                by_year = {}
                with ThreadPoolExecutor(max_workers=workers) as ex:
                    fut2task = {ex.submit(fetch_one, *t, attempts=attempts_per_task): t for t in expanded}
                    for fut in as_completed(fut2task):
                        _, agency_code, agency_name, layer_, fy, q = fut2task[fut]
                        try:
                            dfq = fut.result()
                            by_year.setdefault(fy, []).append(dfq)
                            print(f"✅ retry OK: {layer_} {agency_type} {agency_code} FY{fy} Q{q} ({len(dfq)} rows)")
                        except Exception as e:
                            print(f"❌ retry failed: {layer_} {agency_type} {agency_code} FY{fy} Q{q}: {e}")
                            new_failures.append((agency_type, agency_code, fy, q, str(e)))
                        finally:
                            time.sleep(PAUSE)

                # Merge-append successes into per-FY files
                for fy, parts in sorted(by_year.items()):
                    out, n = save_year_merge_year(agency_type, layer, fy, parts)
                    print(f"📦 merged {layer} {agency_type} FY{fy}: {n} rows → {out}")

                # Overwrite failures with remaining
                write_failures_overwrite(agency_type, layer, new_failures)

            else:
                raise ValueError("mode must be 'initial' or 'retry'")

# ===================== EXAMPLES =====================
# Example 1: Initial run, just funding/state for FY 2018–2020
# run_geography(mode="initial", types=["funding"], layers=["state"], start_fy=2018, end_fy=2020)

# Example 2: Retry run, same slice
# run_geography(mode="retry", types=["funding"], layers=["state"], start_fy=2018, end_fy=2020)

# Example 3: Initial run across all 4 layers but only awarding type
# run_geography(mode="initial", types=["awarding"], layers=["country","state","county","district"], start_fy=2008, end_fy=2024)

## 🏢 Agency Roster Management & Caching

**Robust agency directory** with cached fallback for production reliability:

### Top-Tier Agency Discovery
- **API Endpoint**: `GET /api/v2/references/toptier_agencies/`
- **Data Retrieved**: Complete roster of federal top-tier agencies with codes and names
- **Caching Strategy**: Local CSV cache with automatic refresh and fallback

### Agency Data Structure
Each agency record contains:
- **`toptier_code`**: 3-digit numeric agency identifier (e.g., "012")
- **`name`**: Official agency name (e.g., "Department of Agriculture")
- **`abbreviation`**: Short agency code (e.g., "USDA")

### Caching & Resilience
- **Cache Location**: `{CACHE_DIR}/toptier_agencies.csv`
- **Live-First Strategy**: Attempts fresh API call, falls back to cached data
- **Error Handling**: Graceful degradation to cached roster on API failures
- **Cache Validation**: Checks for reasonable agency count and data structure

### Agency Filtering Integration
- **Funding Agencies**: Agencies that provide the funding for obligations
- **Awarding Agencies**: Agencies that execute the spending/contracts
- **Dual Collection**: Same geographic data collected from both perspectives

In [5]:
run_geography(mode="initial", types=["funding"], layers=["state"], start_fy=2008, end_fy=2024)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
↻ retry state funding 088 FY2016 Q3 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry state funding 088 FY2016 Q4 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ state funding 088 FY2015 Q1: 0 rows
↻ retry state funding 088 FY2017 Q1 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry state funding 088 FY2017 Q2 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry state funding 088 FY2017 Q3 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry state funding 088 FY2017 Q4 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected

  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state funding FY2008: 693 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2008.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state funding FY2009: 711 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2009.csv
📦 state funding FY2010: 867 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2010.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state funding FY2011: 892 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2011.csv
📦 state funding FY2012: 884 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2012.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state funding FY2013: 912 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2013.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state funding FY2014: 901 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2014.csv
📦 state funding FY2015: 855 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2015.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state funding FY2016: 742 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2016.csv
📦 state funding FY2017: 749 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2017.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state funding FY2018: 810 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2018.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state funding FY2019: 833 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2019.csv
📦 state funding FY2020: 825 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2020.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state funding FY2021: 959 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2021.csv
📦 state funding FY2022: 1095 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2022.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state funding FY2023: 1077 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2023.csv
📦 state funding FY2024: 1081 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/geo_state_funding_FY2024.csv
📝 updated failures for funding/state: 4 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/failures_funding_state.csv


## 🎯 API Payload Construction

**Dynamic request payload generation** for spending-by-geography API integration:

### Core Payload Structure
```json
{
  "agencies": [{"type": "funding", "tier": "toptier", "name": "Department of Defense"}],
  "time_period": [{"start_date": "2023-10-01", "end_date": "2023-12-31"}],
  "date_type": "action_date",
  "scope": "place_of_performance",
  "geo_layer": "county"
}
```

### Filter Components
- **Agency Filter**: Single agency per request for precise attribution
  - `type`: "funding" or "awarding" - determines agency role perspective
  - `tier`: "toptier" - federal cabinet-level agencies only
  - `name`: Exact agency name from cached roster
- **Time Filter**: Specific FY/Quarter date boundaries
- **Geographic Scope**: "place_of_performance" (where work happens) vs "recipient_location"
- **Layer Selection**: Country/State/County/District granularity

### API Endpoint
- **Primary**: `POST https://api.usaspending.gov/api/v2/search/spending_by_geography/`
- **Method**: POST with JSON payload for complex filtering
- **Response**: Geographic spending data aggregated by specified parameters

In [7]:
run_geography(mode="retry", types=["funding"], layers=["state"], start_fy=2008, end_fy=2024)

✅ toptier_lookup: 110 agencies (sample: [{'code': '247', 'name': '400 Years of African-American History Commission'}, {'code': '310', 'name': 'Access Board'}, {'code': '302', 'name': 'Administrative Conference of the U.S.'}])
✅ No failures to retry for funding/state.
🧹 no failures remain for funding/state; cleared /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/state/failures_funding_state.csv


## 🔄 Core Fetch Function & Data Normalization

**Production-grade API request handling** with intelligent retry and data standardization:

### Request Execution Flow
1. **API Call**: POST to spending-by-geography endpoint with constructed payload
2. **Response Validation**: Check HTTP status and JSON structure
3. **Data Extraction**: Parse geographic spending results from API response
4. **Normalization**: Standardize field names and data types
5. **Enhancement**: Add partition columns and derived fields

### Data Normalization Pipeline
- **Field Mapping**: 
  - `shape_code` → `code` (geographic identifier)
  - `display_name` → `name` (readable location name)
  - `aggregated_amount` → `amount` (spending totals)
- **Type Coercion**: Convert amount to numeric, handle nulls gracefully
- **Schema Consistency**: Ensure all outputs have identical column structure

### Partition Column Addition
Each row gets enhanced with:
- **`fy`**: Fiscal year (2008-2024)
- **`quarter`**: Quarter number (1-4)
- **`agency_type`**: "funding" or "awarding"
- **`agency_code`**: 3-digit top-tier agency code
- **`agency_name`**: Official agency name
- **`geo_layer`**: "country"|"state"|"county"|"district"

### County-Specific Enhancement
For county-layer data:
- **`state_code`**: First 2 digits of county FIPS code
- **`state_name`**: State name lookup (requires FIPS-to-state mapping)

In [8]:
run_geography(mode="initial", types=["funding"], layers=["county"], start_fy=2008, end_fy=2024)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
✅ county funding 389 FY2021 Q2: 0 rows
↻ retry county funding 389 FY2022 Q3 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ county funding 389 FY2021 Q3: 0 rows
✅ county funding 389 FY2021 Q4: 0 rows
✅ county funding 389 FY2022 Q1: 0 rows
↻ retry county funding 389 FY2022 Q4 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry county funding 389 FY2023 Q1 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ county funding 389 FY2022 Q2: 0 rows
↻ retry county funding 389 FY2023 Q2 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry county funding 389 FY2023 Q3 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconne

  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2008: 4716 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2008.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2009: 4694 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2009.csv
📦 county funding FY2010: 11529 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2010.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2011: 14033 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2011.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2012: 13852 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2012.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2013: 13628 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2013.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2014: 14208 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2014.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2015: 14535 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2015.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2016: 14938 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2016.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2017: 28690 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2017.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2018: 25304 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2018.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2019: 33014 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2019.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2020: 34594 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2020.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2021: 36030 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2021.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2022: 36063 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2022.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2023: 36038 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2023.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county funding FY2024: 35998 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/geo_county_funding_FY2024.csv
📝 updated failures for funding/county: 7 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/failures_funding_county.csv


## 🛡️ Network Resilience & Retry Strategy

**Enterprise-grade error handling** with exponential backoff and intelligent recovery:

### Network Exception Handling
Targeted retry on transport-level failures:
- **`ConnectionError`**: Network connectivity issues
- **`ReadTimeout`**: Server response timeout
- **`ChunkedEncodingError`**: HTTP transfer encoding problems
- **`RemoteDisconnected`**: Server closed connection unexpectedly

### Exponential Backoff Algorithm
```python
backoff = (0.4 * (2 ** (attempt - 1))) + random.uniform(0, 0.4)
```
- **Attempt 1**: Immediate retry (0 delay)
- **Attempt 2**: ~0.4-0.8 seconds with jitter
- **Attempt 3**: ~0.8-1.2 seconds with jitter

### Retry Logic Design
- **HTTP Status Handling**: 4xx/5xx responses are NOT retried (by design)
- **Network-Only Retries**: Only retry on connection/transport failures
- **Jitter Prevention**: Random component prevents thundering herd effects
- **Attempt Limit**: Configurable `MAX_ATTEMPTS_EXC` (default: 2)

### Graceful Failure Handling
- **Empty DataFrame Return**: Failed requests return proper schema with zero rows
- **Failure Tracking**: Records exact (agency, FY, quarter, layer) combination for retry
- **No Crash Guarantee**: File writers never break on empty results

In [10]:
run_geography(mode="retry", types=["funding"], layers=["county"], start_fy=2008, end_fy=2024)

✅ toptier_lookup: 110 agencies (sample: [{'code': '247', 'name': '400 Years of African-American History Commission'}, {'code': '310', 'name': 'Access Board'}, {'code': '302', 'name': 'Administrative Conference of the U.S.'}])
✅ No failures to retry for funding/county.
🧹 no failures remain for funding/county; cleared /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/county/failures_funding_county.csv


## 🔄 Dual-Mode Processing System

**Two distinct execution modes** for comprehensive data management and recovery:

### Mode 1: "Initial" (Clean Rebuild)
**Complete fresh collection** for new or full refresh scenarios:

#### Workflow Steps:
1. **Failure File Reset**: Clear existing failure tracking files
2. **Task Matrix Generation**: Build all agency × FY × quarter × layer combinations
3. **Parallel Execution**: Run thread pools with optimized worker counts
4. **Success Aggregation**: Group successful results by fiscal year
5. **File Overwrite**: Replace existing FY CSV files with new merged data
6. **Failure Recording**: Save failed tasks for potential retry

#### When to Use:
- First-time data collection
- Complete data refresh requirements
- When existing data integrity is questioned
- After major API or schema changes

### Mode 2: "Retry" (Surgical Recovery)
**Targeted failure recovery** for production resilience:

#### Workflow Steps:
1. **Failure Analysis**: Load existing failure CSV files per type/layer
2. **Task Reconstruction**: Rebuild failed tasks using cached agency roster
3. **Selective Execution**: Only retry previously failed requests
4. **Incremental Merge**: Append successful results to existing FY files
5. **Deduplication**: Remove duplicates based on composite keys
6. **Failure Update**: Maintain only currently failing tasks

#### When to Use:
- After initial runs with some failures
- Network interruption recovery
- Periodic maintenance to resolve transient issues
- Production environments requiring minimal downtime

In [12]:
run_geography(mode="initial", types=["funding"], layers=["district"], start_fy=2008, end_fy=2024)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
✅ district funding 389 FY2021 Q2: 0 rows
✅ district funding 389 FY2021 Q3: 0 rows
↻ retry district funding 389 FY2022 Q1 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ district funding 389 FY2021 Q4: 0 rows
↻ retry district funding 389 FY2022 Q2 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry district funding 389 FY2022 Q3 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry district funding 389 FY2022 Q4 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ district funding 389 FY2022 Q1: 0 rows
✅ district funding 389 FY2022 Q2: 0 rows
✅ district funding 389 FY2022 Q3: 0 rows
↻ retry district funding 389 FY2023 Q1 (1/2) 

  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2009: 5130 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2009.csv
📦 district funding FY2010: 6965 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2010.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2011: 7365 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2011.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2012: 7263 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2012.csv
📦 district funding FY2013: 7162 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2013.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2014: 7704 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2014.csv
📦 district funding FY2015: 8718 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2015.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2016: 8770 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2016.csv
📦 district funding FY2017: 9342 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2017.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2018: 9333 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2018.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2019: 10927 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2019.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2020: 10605 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2020.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2021: 10017 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2021.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2022: 8901 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2022.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district funding FY2023: 9096 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2023.csv
📦 district funding FY2024: 8934 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/geo_district_funding_FY2024.csv
📝 updated failures for funding/district: 9 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/failures_funding_district.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


## 💾 File Management & Deduplication Strategy

**Robust data persistence** with intelligent duplicate handling and file organization:

### File Naming Convention
```
geography_by_agency/
├── funding/
│   ├── state/geo_state_funding_FY2024.csv
│   ├── county/geo_county_funding_FY2024.csv
│   └── district/geo_district_funding_FY2024.csv
└── awarding/
    └── [mirror structure]
```

### Deduplication Logic
**Composite Key Strategy** prevents duplicate records:
- **Primary Keys**: `(agency_type, agency_code, geo_layer, fy, quarter, code)`
- **Geographic Code**: Country/State/County FIPS or District identifier
- **Temporal Keys**: Fiscal year and quarter combination
- **Agency Keys**: Type and code for precise attribution

### File Writing Modes

#### Initial Mode: `save_year_overwrite_year()`
- **Complete Replacement**: Overwrites entire FY file with merged results
- **Year Aggregation**: Combines all quarterly data for fiscal year
- **Duplicate Removal**: Drops duplicates on composite keys
- **Clean State**: Ensures file reflects only current collection run

#### Retry Mode: `save_year_merge_year()`
- **Incremental Addition**: Appends new results to existing FY file
- **Merge Strategy**: Combines new and existing data before deduplication
- **Preservation**: Maintains existing successful data while adding recoveries
- **Final Deduplication**: Ensures no duplicates in final merged file

### Failure File Management
- **Per-Type-Layer Files**: `failures_funding_county.csv` granular tracking
- **Overwrite Strategy**: `write_failures_overwrite()` maintains current failures only
- **Clean Completion**: Delete failure files when all tasks succeed

In [14]:
run_geography(mode="retry", types=["funding"], layers=["district"], start_fy=2008, end_fy=2024)

✅ toptier_lookup: 110 agencies (sample: [{'code': '247', 'name': '400 Years of African-American History Commission'}, {'code': '310', 'name': 'Access Board'}, {'code': '302', 'name': 'Administrative Conference of the U.S.'}])
✅ No failures to retry for funding/district.
🧹 no failures remain for funding/district; cleared /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/district/failures_funding_district.csv


## 🚀 Production Execution Sequence

**Systematic execution workflow** for comprehensive agency-geography data collection:

### Phase 1: Funding Agency Collection
Sequential processing by geographic layer complexity:

#### 1.1 State Layer (Moderate Complexity)
```python
run_geography(mode="initial", types=["funding"], layers=["state"])
run_geography(mode="retry", types=["funding"], layers=["state"])
```
- **~50 states/territories** × ~200 agencies × 68 quarters = ~680K requests
- **Moderate API load** with manageable response sizes

#### 1.2 County Layer (High Complexity)
```python
run_geography(mode="initial", types=["funding"], layers=["county"])
run_geography(mode="retry", types=["funding"], layers=["county"])
```
- **~3,000 counties** × agencies × quarters = **highest volume collection**
- **FIPS processing** with state code derivation
- **Memory intensive** due to large result sets

#### 1.3 District Layer (Moderate-High Complexity)
```python
run_geography(mode="initial", types=["funding"], layers=["district"])
run_geography(mode="retry", types=["funding"], layers=["district"])
```
- **~435 congressional districts** with redistricting complexity
- **Political geography** requiring careful temporal alignment

#### 1.4 Country Layer (Low Complexity)
```python
run_geography(mode="initial", types=["funding"], layers=["country"])
run_geography(mode="retry", types=["funding"], layers=["country"])
```
- **International spending** with limited geographic entities
- **Fastest processing** due to small result sets

In [15]:
run_geography(mode="initial", types=["funding"], layers=["country"], start_fy=2008, end_fy=2024)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
↻ retry country funding 088 FY2016 Q4 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ country funding 088 FY2015 Q1: 0 rows
↻ retry country funding 088 FY2017 Q1 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ country funding 088 FY2015 Q2: 0 rows
↻ retry country funding 088 FY2017 Q2 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry country funding 088 FY2017 Q3 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ country funding 088 FY2015 Q3: 0 rows
↻ retry country funding 088 FY2017 Q4 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ country fundi

  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country funding FY2008: 778 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2008.csv
📦 country funding FY2009: 754 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2009.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country funding FY2010: 744 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2010.csv
📦 country funding FY2011: 284 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2011.csv
📦 country funding FY2012: 378 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2012.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country funding FY2013: 387 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2013.csv
📦 country funding FY2014: 344 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2014.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country funding FY2015: 425 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2015.csv
📦 country funding FY2016: 413 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2016.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country funding FY2017: 534 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2017.csv
📦 country funding FY2018: 491 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2018.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country funding FY2019: 503 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2019.csv
📦 country funding FY2020: 461 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2020.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country funding FY2021: 471 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2021.csv
📦 country funding FY2022: 518 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2022.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country funding FY2023: 549 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2023.csv
📦 country funding FY2024: 626 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/geo_country_funding_FY2024.csv
📝 updated failures for funding/country: 4 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/failures_funding_country.csv


### Phase 2: Awarding Agency Collection
**Complete awarding agency perspective** across all geographic layers:

#### 2.1 All Layers Simultaneous Collection
```python
run_geography(mode="initial", types=["awarding"], layers=["country","state","county","district"])
run_geography(mode="retry", types=["awarding"], layers=["country","state","county","district"])
```

#### Strategic Considerations:
- **Dual Perspective Completion**: Funding vs awarding agency roles captured
- **Parallel Layer Processing**: All 4 layers processed simultaneously for efficiency
- **Resource Intensive**: Peak memory and API usage during this phase
- **Final Data Completeness**: Ensures comprehensive agency-geography matrix

### Total Collection Scope
**Comprehensive data matrix dimensions:**
- **Agency Types**: 2 (funding, awarding)
- **Geographic Layers**: 4 (country, state, county, district)  
- **Fiscal Years**: 17 (FY2008-FY2024)
- **Quarters per Year**: 4
- **Agencies**: ~200 top-tier federal agencies
- **Total API Requests**: ~200K+ calls across full collection

### Expected Outputs
Upon completion, the system generates:
- **136 CSV files**: 17 years × 4 layers × 2 agency types
- **Failure tracking files**: Per type-layer combination for any remaining issues
- **~2-10GB total data**: Depending on geographic granularity and spending volumes

In [17]:
run_geography(mode="retry", types=["funding"], layers=["country"], start_fy=2008, end_fy=2024)

✅ toptier_lookup: 110 agencies (sample: [{'code': '247', 'name': '400 Years of African-American History Commission'}, {'code': '310', 'name': 'Access Board'}, {'code': '302', 'name': 'Administrative Conference of the U.S.'}])
✅ No failures to retry for funding/country.
🧹 no failures remain for funding/country; cleared /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/funding/country/failures_funding_country.csv


## 📊 Data Output Structure & Standardized Schema

**Consistent data format** across all agency-geography combinations:

### Standardized Row Schema
Each output row contains the following standardized columns:

#### Geographic Identifiers
- **`code`**: Geographic code (country code, state FIPS, county FIPS, district ID)
- **`name`**: Display name for the geographic entity
- **`population`**: Population data when provided by API

#### Financial Data
- **`amount`**: Federal obligations/spending amount (numeric, coerced)

#### Temporal Dimensions
- **`fy`**: Fiscal year (2008-2024)
- **`quarter`**: Quarter number (1-4)

#### Agency Attribution
- **`agency_type`**: "funding" or "awarding" - agency role perspective
- **`agency_code`**: 3-digit top-tier agency code
- **`agency_name`**: Official federal agency name

#### Geographic Classification
- **`geo_layer`**: "country"|"state"|"county"|"district"

#### County-Specific Enhancements
For county-layer records only:
- **`state_code`**: First 2 digits of county FIPS code
- **`state_name`**: State name derived from FIPS mapping

### Example Output Row
```csv
code,name,amount,population,fy,quarter,agency_type,agency_code,agency_name,geo_layer,state_code,state_name
01001,Autauga County,1250000.50,58539,2024,1,funding,012,Department of Agriculture,county,01,Alabama
```

In [18]:
run_geography(mode="initial", types=["awarding"], layers=["country","state","county","district"], start_fy=2008, end_fy=2024)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
↻ retry country awarding 088 FY2016 Q3 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ country awarding 088 FY2014 Q3: 0 rows
↻ retry country awarding 088 FY2017 Q1 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry country awarding 088 FY2016 Q4 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ country awarding 088 FY2015 Q1: 0 rows
✅ country awarding 088 FY2015 Q2: 0 rows
✅ country awarding 088 FY2015 Q4: 0 rows
✅ country awarding 088 FY2015 Q3: 0 rows
↻ retry country awarding 088 FY2017 Q2 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry country awarding 088 FY2017 Q3 (1/2) after ConnectionError: ('Connection abort

  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country awarding FY2009: 934 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2009.csv
📦 country awarding FY2010: 957 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2010.csv
📦 country awarding FY2011: 936 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2011.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country awarding FY2012: 949 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2012.csv
📦 country awarding FY2013: 412 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2013.csv
📦 country awarding FY2014: 199 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2014.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country awarding FY2015: 218 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2015.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country awarding FY2016: 216 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2016.csv
📦 country awarding FY2017: 204 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2017.csv
📦 country awarding FY2018: 261 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2018.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 country awarding FY2019: 363 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2019.csv
📦 country awarding FY2020: 365 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2020.csv
📦 country awarding FY2021: 372 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2021.csv
📦 country awarding FY2022: 412 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2022.csv
📦 country awarding FY2023: 409 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/geo_country_awarding_FY2023.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
↻ retry state awarding 487 FY2020 Q4 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry state awarding 487 FY2021 Q1 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ state awarding 487 FY2019 Q1: 0 rows
✅ state awarding 487 FY2018 Q2: 0 rows
✅ state awarding 487 FY2018 Q4: 0 rows
✅ state awarding 487 FY2018 Q3: 0 rows
✅ state awarding 487 FY2019 Q3: 0 rows
✅ state awarding 487 FY2019 Q2: 0 rows
✅ state awarding 487 FY2019 Q4: 0 rows
↻ retry state awarding 487 FY2021 Q2 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry state awarding 487 FY2021 Q3 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ state awarding 487 FY202

  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state awarding FY2008: 671 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2008.csv
📦 state awarding FY2009: 736 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2009.csv
📦 state awarding FY2010: 785 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2010.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state awarding FY2011: 800 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2011.csv
📦 state awarding FY2012: 813 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2012.csv
📦 state awarding FY2013: 828 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2013.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state awarding FY2014: 847 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2014.csv
📦 state awarding FY2015: 872 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2015.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state awarding FY2016: 953 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2016.csv
📦 state awarding FY2017: 1012 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2017.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state awarding FY2018: 974 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2018.csv
📦 state awarding FY2019: 1003 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2019.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state awarding FY2020: 993 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2020.csv
📦 state awarding FY2021: 948 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2021.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 state awarding FY2022: 957 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/geo_state_awarding_FY2022.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
✅ county awarding 487 FY2017 Q4: 0 rows
✅ county awarding 487 FY2018 Q1: 0 rows
✅ county awarding 487 FY2018 Q3: 0 rows
↻ retry county awarding 487 FY2019 Q1 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry county awarding 487 FY2019 Q2 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ county awarding 487 FY2018 Q4: 0 rows
↻ retry county awarding 487 FY2019 Q3 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry county awarding 487 FY2019 Q4 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry county awarding 487 FY2020 Q1 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connec

  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2008: 29732 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2008.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2009: 45469 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2009.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2010: 47559 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2010.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2011: 46795 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2011.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2012: 45948 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2012.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2013: 46911 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2013.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2014: 39794 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2014.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2015: 40902 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2015.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2016: 41017 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2016.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2017: 48187 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2017.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2018: 48259 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2018.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2019: 48321 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2019.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2020: 61572 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2020.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2021: 61134 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2021.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2022: 60441 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2022.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 county awarding FY2023: 58334 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/geo_county_awarding_FY2023.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
✅ district awarding 3301 FY2024 Q3: 0 rows
✅ district awarding 3301 FY2024 Q2: 0 rows
✅ district awarding 3301 FY2024 Q4: 0 rows
↻ retry district awarding 387 FY2008 Q2 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ district awarding 387 FY2008 Q1: 0 rows
↻ retry district awarding 387 FY2008 Q3 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry district awarding 387 FY2008 Q4 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
↻ retry district awarding 387 FY2009 Q1 (1/2) after ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✅ district awarding 387 FY2008 Q3: 0 rows
✅ district awarding 387 FY2008 Q2: 0 rows
✅ district awarding 387 FY2008 Q

  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2008: 12099 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2008.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2009: 12028 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2009.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2010: 14146 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2010.csv
📦 district awarding FY2011: 12134 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2011.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2012: 12764 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2012.csv
📦 district awarding FY2013: 12706 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2013.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2014: 12695 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2014.csv
📦 district awarding FY2015: 11236 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2015.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2016: 11397 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2016.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2017: 12890 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2017.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2018: 12828 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2018.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)
  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2019: 12641 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2019.csv
📦 district awarding FY2020: 12712 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2020.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2021: 12837 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2021.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2022: 12605 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2022.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


📦 district awarding FY2023: 12556 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2023.csv
📦 district awarding FY2024: 13054 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/geo_district_awarding_FY2024.csv
📝 updated failures for awarding/district: 9 rows → /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/district/failures_awarding_district.csv


  df_year = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=EMPTY_COLS)


## 🎯 Final Execution & System Completion

**Comprehensive collection finale** with complete agency-geography matrix:

### Final Collection Commands
The notebook executes the complete systematic collection:

#### All Awarding Agency Layers (Simultaneous)
```python
run_geography(mode="initial", types=["awarding"], 
              layers=["country","state","county","district"])
run_geography(mode="retry", types=["awarding"], 
              layers=["country","state","county","district"])
```

### Collection Validation Checkpoints
After completion, verify:

#### File Structure Completeness
- **136 CSV files**: 17 years × 4 layers × 2 agency types
- **Failure files status**: Check for any remaining `failures_*.csv` files
- **Directory structure**: Proper funding/awarding subdivision

#### Data Quality Validation
- **Row counts**: Reasonable data volumes per file
- **Schema consistency**: All files have standardized columns
- **Temporal coverage**: Complete FY2008-2024 representation
- **Agency coverage**: All ~200 top-tier agencies represented

### Production Readiness Indicators
✅ **Zero failure files remaining**
✅ **Complete temporal coverage (17 years)**
✅ **All geographic layers populated**
✅ **Both agency perspectives captured**
✅ **Standardized schema across all outputs**

### Next Steps: Analysis & Integration
The completed dataset enables:
- **Multi-dimensional analysis**: Agency × Geography × Time cube queries
- **Comparative studies**: Funding vs awarding agency spending patterns
- **Geographic analysis**: Spending distribution across jurisdictions
- **Temporal trends**: 17-year spending evolution by location and agency

In [20]:
run_geography(mode="retry", types=["awarding"], layers=["country","state","county","district"], start_fy=2008, end_fy=2024)

✅ toptier_lookup: 110 agencies (sample: [{'code': '247', 'name': '400 Years of African-American History Commission'}, {'code': '310', 'name': 'Access Board'}, {'code': '302', 'name': 'Administrative Conference of the U.S.'}])
✅ No failures to retry for awarding/country.
🧹 no failures remain for awarding/country; cleared /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/country/failures_awarding_country.csv
✅ No failures to retry for awarding/state.
🧹 no failures remain for awarding/state; cleared /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/state/failures_awarding_state.csv
✅ No failures to retry for awarding/county.
🧹 no failures remain for awarding/county; cleared /content/drive/MyDrive/USASpendingResults/geography/geography_by_agency/awarding/county/failures_awarding_county.csv
✅ No failures to retry for awarding/district.
🧹 no failures remain for awarding/district; cleared /content/drive/MyDrive/USASpendingResults/g

## ⚡ Performance Optimization & Troubleshooting

**Production deployment considerations** for large-scale agency-geography collection:

### Performance Tuning Guidelines

#### Worker Optimization
- **Heavy Layers**: Reduce county/district workers if hitting API rate limits
- **Light Layers**: Increase country/state workers for faster completion
- **Memory Pressure**: Lower concurrent workers to reduce memory usage
- **API Rate Limits**: Monitor 429 responses and adjust accordingly

#### Connection Pool Tuning
```python
# Optimize for high-concurrency workloads
SESSION = setup_session(pool_maxsize=MAX_WORKERS + 10)
```

#### Request Throttling
```python
# Add throttling for rate-limited environments
time.sleep(0.05)  # 20 requests/second max
```

### Common Issues & Solutions

#### API Rate Limiting (429 Errors)
- **Reduce worker counts** in TYPE_WORKERS/LAYER_WORKERS
- **Add request delays** between submissions
- **Monitor failure files** for systematic 429 patterns

#### Memory Issues
- **Process layers sequentially** instead of parallel
- **Implement streaming writes** for large datasets
- **Clear DataFrames** after each FY processing

#### Network Timeouts
- **Increase TIMEOUT_S** for slow network conditions
- **Adjust MAX_ATTEMPTS_EXC** for retry-heavy environments
- **Monitor connection pool exhaustion**

### Monitoring & Alerting
- **Progress tracking**: Monitor console output for completion rates
- **Failure analysis**: Regular review of failure CSV patterns
- **Resource monitoring**: Track memory and CPU usage during execution
- **API health**: Monitor USASpending.gov status and response times