# USDA Ingestion Pipeline - Complete Testing

This notebook walks through the complete USDA ETL pipeline testing:
1. **Environment Setup**: Configure PYTHONPATH and imports
2. **Database Connection**: Verify connectivity
3. **Commodity Mapper**: Test USDA code lookups
4. **Extract**: Fetch data from USDA NASS API
5. **Transform**: Clean and normalize data
6. **Load**: Insert into database
7. **Verification**: Query and confirm results

**Goal**: Demonstrate full working USDA ingestion pipeline with output ‚úì

## Step 1: Environment Setup

In [1]:
import os
import sys
from pathlib import Path
import pandas as pd
from datetime import datetime

# Configure PYTHONPATH for namespace packages
workspace_root = Path(r'c:\Users\meili\forked\ca-biositing')
sys.path.insert(0, str(workspace_root / 'src' / 'ca_biositing' / 'pipeline'))
sys.path.insert(0, str(workspace_root / 'src' / 'ca_biositing' / 'datamodels'))
sys.path.insert(0, str(workspace_root / 'src' / 'ca_biositing' / 'webservice'))
os.chdir(str(workspace_root))

# Load environment variables
from dotenv import load_dotenv
load_dotenv(workspace_root / '.env')

print("‚úì Environment configured")
print(f"‚úì Working directory: {os.getcwd()}")
print(f"‚úì DATABASE_URL loaded: {bool(os.getenv('DATABASE_URL'))}")
print(f"‚úì USDA_NASS_API_KEY loaded: {bool(os.getenv('USDA_NASS_API_KEY'))}")

‚úì Environment configured
‚úì Working directory: c:\Users\meili\forked\ca-biositing
‚úì DATABASE_URL loaded: True
‚úì USDA_NASS_API_KEY loaded: True


## Step 2: Test Database Connection

In [2]:
from sqlalchemy import create_engine, text

engine = create_engine(os.getenv('DATABASE_URL'))

try:
    with engine.connect() as conn:
        result = conn.execute(text("SELECT version();"))
        version = result.fetchone()[0]
        print(f"‚úì Database connected")
        print(f"  PostgreSQL version: {version[:60]}...")
except Exception as e:
    print(f"‚úó Database connection failed: {e}")
    raise

‚úì Database connected
  PostgreSQL version: PostgreSQL 13.5 (Debian 13.5-1.pgdg110+1) on x86_64-pc-linux...


## Step 3: Test Commodity Mapper

In [3]:
from ca_biositing.pipeline.utils.commodity_mapper import get_mapped_commodity_ids

print("Testing Commodity Mapper:")
print("="*50)

try:
    commodity_codes = get_mapped_commodity_ids()
    print(f"‚úì Retrieved {len(commodity_codes)} commodity codes:")
    for idx, code in enumerate(commodity_codes[:5]):
        print(f"  - Code {idx + 1}: {code}")
except Exception as e:
    print(f"‚úó Commodity mapper failed: {e}")
    raise

Testing Commodity Mapper:
‚úì Retrieved 4 commodity codes:
  - Code 1: 11199199
  - Code 2: 37899999
  - Code 3: 10199999
  - Code 4: 26199999


## Step 4: Test USDA Extract (Fetch from API)

In [11]:

import requests
import time
from ca_biositing.pipeline.utils.nass_config import PRIORITY_COUNTIES

print("Testing USDA API - North San Joaquin Valley County-Level Data:")
print("="*60)

api_key = os.getenv('USDA_NASS_API_KEY')

# Map FIPS codes to 3-digit county codes (API requires separate state + county)
fips_to_county_code = {
    "06077": "077",  # San Joaquin
    "06099": "099",  # Stanislaus  
    "06047": "047",  # Merced
}

results_by_county = {}

for county_name, fips_code in PRIORITY_COUNTIES.items():
    county_code = fips_to_county_code[fips_code]
    print(f"\n[{county_name}] FIPS: {fips_code} ‚Üí County Code: {county_code}")
    
    # Use state_alpha + county_code (confirmed working from R package docs)
    params = {
        "key": api_key,
        "state_alpha": "CA",
        "county_code": county_code,  # 3-digit county code (077, 099, 047)
        "format": "JSON",
        "year": 2022  # Using 2022 since 2023 may not have complete data yet
    }
    
    try:
        resp = requests.get("https://quickstats.nass.usda.gov/api/api_GET", params=params, timeout=30)
        print(f"  Status: {resp.status_code}")
        
        data = resp.json()
        if isinstance(data, dict) and "data" in data:
            records = data["data"]
            print(f"  Records: {len(records)}")
            
            if len(records) > 0:
                results_by_county[county_name] = records
                commodities = set([r.get('commodity_desc') for r in records if r.get('commodity_desc')])
                print(f"  Commodities available: {', '.join(sorted(commodities)[:5])}...")
                
                # Show a sample
                sample = records[0]
                print(f"  Sample: {sample.get('commodity_desc')} - {sample.get('short_desc')[:50]}...")
        elif "error" in data:
            print(f"  Error: {data['error']}")
        else:
            print(f"  No data returned")
    except Exception as e:
        print(f"  Exception: {e}")
    
    time.sleep(1)

print(f"\n{'='*60}")
print(f"‚úì County-level exploration complete!")
print(f"  Counties with data: {len(results_by_county)}")

# Combine all results into a single DataFrame
if results_by_county:
    all_records = []
    for county_name, records in results_by_county.items():
        all_records.extend(records)
    
    raw_data = pd.DataFrame(all_records)
    print(f"  Total records: {len(raw_data)}")
    print(f"  Unique commodities: {raw_data['commodity_desc'].nunique()}")
    
    print(f"\n  Sample:")
    print(raw_data[['year', 'county_name', 'commodity_desc', 'short_desc']].drop_duplicates().head(3).to_string(index=False))
else:
    print("  ‚ö† No data found in any county")
    raw_data = pd.DataFrame()


Testing USDA API - North San Joaquin Valley County-Level Data:

[San Joaquin] FIPS: 06077 ‚Üí County Code: 077
  Status: 200
  Records: 2233
  Commodities available: AG LAND, AG SERVICES, ALMONDS, ALPACAS, ANIMAL TOTALS...
  Sample: ANIMAL TOTALS - ANIMAL TOTALS, INCL PRODUCTS - SALES, MEASURED IN ...

[Stanislaus] FIPS: 06099 ‚Üí County Code: 099
  Status: 200
  Records: 2102
  Commodities available: AG LAND, AG SERVICES, ALMONDS, ALPACAS, ANIMAL TOTALS...
  Sample: ANIMAL TOTALS - ANIMAL TOTALS, INCL PRODUCTS - SALES, MEASURED IN ...

[Merced] FIPS: 06047 ‚Üí County Code: 047
  Status: 200
  Records: 2229
  Commodities available: AG LAND, AG SERVICES, ALMONDS, ALPACAS, ANIMAL TOTALS...
  Sample: ANIMAL TOTALS - ANIMAL TOTALS, INCL PRODUCTS - SALES, MEASURED IN ...

‚úì County-level exploration complete!
  Counties with data: 3
  Total records: 6564
  Unique commodities: 191

  Sample:
 year county_name     commodity_desc                                               short_desc
 2022 

### Inspect raw data from API

In [12]:
print("="*80)
print("Inspecting Raw Data from USDA API")
print("="*80)

if 'raw_data' in locals() and len(raw_data) > 0:
    # CRITICAL: Filter to only the counties we requested
    # NOTE: API returns uppercase county names, so we need case-insensitive comparison
    priority_county_names = [name.upper() for name in PRIORITY_COUNTIES.keys()]
    print(f"\nüîç Filtering to priority counties (case-insensitive): {priority_county_names}")
    print(f"   Before filter: {len(raw_data)} records from counties: {raw_data['county_name'].unique().tolist()}")
    
    # Convert county_name to uppercase for comparison, then filter
    raw_data = raw_data[raw_data['county_name'].str.upper().isin(priority_county_names)].copy()
    print(f"   After filter: {len(raw_data)} records from counties: {raw_data['county_name'].unique().tolist()}")
    
    if len(raw_data) == 0:
        print("\n‚ö†Ô∏è WARNING: No records found for priority counties after filtering!")
        print("   This means the API returned data for different counties than requested.")
        print("   The NASS API state_fips + county_code parameters may not be working as expected.")
    
    print(f"\nüìä DataFrame Shape: {raw_data.shape}")
    print(f"   Rows: {len(raw_data)}, Columns: {len(raw_data.columns)}")
    
    print(f"\nüìã Column Information:")
    print(raw_data.info())
    
    print(f"\nüîç First 5 Rows:")
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', 50)
    print(raw_data.head())
    
    print(f"\nüìà Data Types:")
    print(raw_data.dtypes)
    
    print(f"\n‚ùå Missing Values:")
    missing = raw_data.isnull().sum()
    print(missing[missing > 0] if missing.sum() > 0 else "No missing values")
    
    print(f"\nüè∑Ô∏è Unique Values (key columns):")
    key_cols = ['commodity_desc', 'county_name', 'year', 'short_desc']
    for col in key_cols:
        if col in raw_data.columns:
            unique_count = raw_data[col].nunique()
            print(f"   {col}: {unique_count} unique values")
            if unique_count <= 10:
                print(f"      Values: {raw_data[col].unique().tolist()}")
    
    print(f"\nüìä Sample Value Ranges:")
    numeric_cols = raw_data.select_dtypes(include=['number']).columns
    for col in numeric_cols:
        print(f"   {col}: min={raw_data[col].min()}, max={raw_data[col].max()}")
    
    print(f"\n‚úÖ Sample Full Record (first row, all columns):")
    print(raw_data.iloc[0].to_string())
    
else:
    print("‚ö†Ô∏è No raw_data available to inspect")


DEBUG: Inspecting Raw Data from USDA API

üîç Filtering to priority counties (case-insensitive): ['SAN JOAQUIN', 'STANISLAUS', 'MERCED']
   Before filter: 6564 records from counties: ['SAN JOAQUIN', 'STANISLAUS', 'MERCED']
   After filter: 6564 records from counties: ['SAN JOAQUIN', 'STANISLAUS', 'MERCED']

üìä DataFrame Shape: (6564, 39)
   Rows: 6564, Columns: 39

üìã Column Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6564 entries, 0 to 6563
Data columns (total 39 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   state_name             6564 non-null   object
 1   congr_district_code    6564 non-null   object
 2   end_code               6564 non-null   object
 3   county_ansi            6564 non-null   object
 4   reference_period_desc  6564 non-null   object
 5   group_desc             6564 non-null   object
 6   year                   6564 non-null   int64 
 7   unit_desc              6564 non-n

In [29]:
# Verify raw_data is ready for transform
print("Data ready for transform:")
print(f"  Rows: {len(raw_data)}")
print(f"  Columns: {list(raw_data.columns)}")
print(f"  Counties: {raw_data['county_name'].unique().tolist() if 'county_name' in raw_data.columns else 'N/A'}")

# The Data Wrangler will be opened with the variable below
raw_data

Data ready for transform:
  Rows: 6564
  Columns: ['state_name', 'congr_district_code', 'end_code', 'county_ansi', 'reference_period_desc', 'group_desc', 'year', 'unit_desc', 'domain_desc', 'Value', 'agg_level_desc', 'prodn_practice_desc', 'util_practice_desc', 'week_ending', 'state_alpha', 'load_time', 'zip_5', 'domaincat_desc', 'location_desc', 'class_desc', 'state_fips_code', 'freq_desc', 'commodity_desc', 'county_code', 'begin_code', 'source_desc', 'short_desc', 'country_name', 'state_ansi', 'CV (%)', 'watershed_code', 'watershed_desc', 'statisticcat_desc', 'asd_desc', 'county_name', 'region_desc', 'country_code', 'sector_desc', 'asd_code']
  Counties: ['SAN JOAQUIN', 'STANISLAUS', 'MERCED']


Unnamed: 0,state_name,congr_district_code,end_code,county_ansi,reference_period_desc,group_desc,year,unit_desc,domain_desc,Value,agg_level_desc,prodn_practice_desc,util_practice_desc,week_ending,state_alpha,load_time,zip_5,domaincat_desc,location_desc,class_desc,state_fips_code,freq_desc,commodity_desc,county_code,begin_code,source_desc,short_desc,country_name,state_ansi,CV (%),watershed_code,watershed_desc,statisticcat_desc,asd_desc,county_name,region_desc,country_code,sector_desc,asd_code
0,CALIFORNIA,,00,077,YEAR,ANIMAL TOTALS,2022,$,TOTAL,910695000,COUNTY,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,,CA,2024-07-02 12:00:00.000,,NOT SPECIFIED,"CALIFORNIA, SAN JOAQUIN VALLEY, SAN JOAQUIN",INCL PRODUCTS,06,ANNUAL,ANIMAL TOTALS,077,00,CENSUS,"ANIMAL TOTALS, INCL PRODUCTS - SALES, MEASURED...",UNITED STATES,06,(L),00000000,,SALES,SAN JOAQUIN VALLEY,SAN JOAQUIN,,9000,ANIMALS & PRODUCTS,51
1,CALIFORNIA,,00,077,YEAR,ANIMAL TOTALS,2022,OPERATIONS,TOTAL,560,COUNTY,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,,CA,2024-07-02 12:00:00.000,,NOT SPECIFIED,"CALIFORNIA, SAN JOAQUIN VALLEY, SAN JOAQUIN",INCL PRODUCTS,06,ANNUAL,ANIMAL TOTALS,077,00,CENSUS,"ANIMAL TOTALS, INCL PRODUCTS - OPERATIONS WITH...",UNITED STATES,06,14.7,00000000,,SALES,SAN JOAQUIN VALLEY,SAN JOAQUIN,,9000,ANIMALS & PRODUCTS,51
2,CALIFORNIA,,00,077,YEAR,AQUACULTURE,2022,$,TOTAL,(D),COUNTY,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,,CA,2024-07-02 12:00:00.000,,NOT SPECIFIED,"CALIFORNIA, SAN JOAQUIN VALLEY, SAN JOAQUIN",ALL CLASSES,06,ANNUAL,AQUACULTURE TOTALS,077,00,CENSUS,"AQUACULTURE TOTALS - SALES & DISTRIBUTION, MEA...",UNITED STATES,06,(D),00000000,,SALES & DISTRIBUTION,SAN JOAQUIN VALLEY,SAN JOAQUIN,,9000,ANIMALS & PRODUCTS,51
3,CALIFORNIA,,00,077,YEAR,AQUACULTURE,2022,OPERATIONS,TOTAL,2,COUNTY,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,,CA,2024-07-02 12:00:00.000,,NOT SPECIFIED,"CALIFORNIA, SAN JOAQUIN VALLEY, SAN JOAQUIN",ALL CLASSES,06,ANNUAL,AQUACULTURE TOTALS,077,00,CENSUS,AQUACULTURE TOTALS - OPERATIONS WITH SALES & D...,UNITED STATES,06,(L),00000000,,SALES & DISTRIBUTION,SAN JOAQUIN VALLEY,SAN JOAQUIN,,9000,ANIMALS & PRODUCTS,51
4,CALIFORNIA,,00,077,YEAR,AQUACULTURE,2022,$,TOTAL,(D),COUNTY,ALL PRODUCTION PRACTICES,ALL UTILIZATION PRACTICES,,CA,2024-07-02 12:00:00.000,,NOT SPECIFIED,"CALIFORNIA, SAN JOAQUIN VALLEY, SAN JOAQUIN",CATFISH,06,ANNUAL,FOOD FISH,077,00,CENSUS,"FOOD FISH, CATFISH - SALES & DISTRIBUTION, MEA...",UNITED STATES,06,(D),00000000,,SALES & DISTRIBUTION,SAN JOAQUIN VALLEY,SAN JOAQUIN,,9000,ANIMALS & PRODUCTS,51
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6559,CALIFORNIA,,00,047,YEAR,VEGETABLES,2022,ACRES,TOTAL,24700,COUNTY,IN THE OPEN,PROCESSING,,CA,2024-03-08 15:00:00.000,,NOT SPECIFIED,"CALIFORNIA, SAN JOAQUIN VALLEY, MERCED",ALL CLASSES,06,ANNUAL,TOMATOES,047,00,SURVEY,"TOMATOES, IN THE OPEN, PROCESSING - ACRES PLANTED",UNITED STATES,06,,00000000,,AREA PLANTED,SAN JOAQUIN VALLEY,MERCED,,9000,CROPS,51
6560,CALIFORNIA,,00,047,YEAR,VEGETABLES,2022,TONS / ACRE,TOTAL,44.15,COUNTY,IN THE OPEN,PROCESSING,,CA,2024-03-08 15:00:00.000,,NOT SPECIFIED,"CALIFORNIA, SAN JOAQUIN VALLEY, MERCED",ALL CLASSES,06,ANNUAL,TOMATOES,047,00,SURVEY,"TOMATOES, IN THE OPEN, PROCESSING - YIELD, MEA...",UNITED STATES,06,,00000000,,YIELD,SAN JOAQUIN VALLEY,MERCED,,9000,CROPS,51
6561,CALIFORNIA,,00,047,YEAR,VEGETABLES,2022,TONS,TOTAL,1086000,COUNTY,IN THE OPEN,"PROCESSING, UTILIZED",,CA,2024-03-08 15:00:00.000,,NOT SPECIFIED,"CALIFORNIA, SAN JOAQUIN VALLEY, MERCED",ALL CLASSES,06,ANNUAL,TOMATOES,047,00,SURVEY,"TOMATOES, IN THE OPEN, PROCESSING, UTILIZED - ...",UNITED STATES,06,,00000000,,PRODUCTION,SAN JOAQUIN VALLEY,MERCED,,9000,CROPS,51
6562,CALIFORNIA,,00,047,YEAR,EXPENSES,2022,$ / ACRE,TOTAL,325,COUNTY,IRRIGATED,ALL UTILIZATION PRACTICES,,CA,2022-08-26 15:00:22.000,,NOT SPECIFIED,"CALIFORNIA, SAN JOAQUIN VALLEY, MERCED","CASH, CROPLAND",06,ANNUAL,RENT,047,00,SURVEY,"RENT, CASH, CROPLAND, IRRIGATED - EXPENSE, MEA...",UNITED STATES,06,7.6,00000000,,EXPENSE,SAN JOAQUIN VALLEY,MERCED,,9000,ECONOMICS,51


## Step 5: Test USDA Transform (Clean Data)

### Transform Step: Map API data to database format

**What this does:**
1. Maps commodity names ‚Üí commodity_code IDs (from usda_commodity table)
2. Creates Parameter records if they don't exist (YIELD, PRODUCTION, etc.)
3. Creates Unit records if they don't exist (BUSHELS, TONS, etc.)
4. Creates a single transformed DataFrame with all columns needed for both tables
5. Load step routes the data to two tables

**Output:** Single `transformed_data` DataFrame that load step uses to populate:
   - `UsdaCensusRecord` table (one per geoid+year+commodity)
   - `Observation` table (one per measurement)

In [25]:
from sqlalchemy import text
import pandas as pd
from sqlmodel import Session, select
from ca_biositing.datamodels.database import engine
from ca_biositing.datamodels.schemas.generated.ca_biositing import Parameter, Unit

print("Transform Step: Mapping API data to database schema")
print("="*70)

if 'raw_data' not in locals() or len(raw_data) == 0:
    print("‚ö† No raw_data - run API extraction first")
else:
    # Print actual columns to debug
    print(f"Debug: Available columns in raw_data: {list(raw_data.columns)[:10]}...")
    
    # Define parameter/unit configurations (will be keyed by name for DB inserts)
    PARAMETER_CONFIGS = {
        'YIELD': 'Yield per unit area',
        'PRODUCTION': 'Total production quantity',
        'AREA HARVESTED': 'Area harvested',
        'PRICE RECEIVED': 'Price received by farmer',
    }
    
    UNIT_CONFIGS = {
        'BUSHELS': 'US bushels',
        'TONS': 'Short tons (US)',
        'ACRES': 'US acres',
        'DOLLARS': 'US dollars',
    }
    
    # Step 1: Ensure Parameter/Unit records exist (following coworker's pattern)
    print("Step 1: Creating Parameter/Unit records if needed...")
    with Session(engine) as session:
        # Get existing parameters
        existing_params = session.exec(select(Parameter.name)).all()
        existing_param_names = set(existing_params)
        
        # Add only new parameters
        params_to_add = []
        for param_name, param_desc in PARAMETER_CONFIGS.items():
            if param_name not in existing_param_names:
                param = Parameter(name=param_name, description=param_desc, calculated=False)
                params_to_add.append(param)
                existing_param_names.add(param_name)
        
        if params_to_add:
            session.add_all(params_to_add)
            print(f"  Adding {len(params_to_add)} new parameters")
        else:
            print(f"  All {len(PARAMETER_CONFIGS)} parameters already exist")
        
        # Get existing units
        existing_units = session.exec(select(Unit.name)).all()
        existing_unit_names = set(existing_units)
        
        # Add only new units
        units_to_add = []
        for unit_name, unit_desc in UNIT_CONFIGS.items():
            if unit_name not in existing_unit_names:
                unit = Unit(name=unit_name, description=unit_desc)
                units_to_add.append(unit)
                existing_unit_names.add(unit_name)
        
        if units_to_add:
            session.add_all(units_to_add)
            print(f"  Adding {len(units_to_add)} new units")
        else:
            print(f"  All {len(UNIT_CONFIGS)} units already exist")
        
        # Commit only if we added anything
        if params_to_add or units_to_add:
            session.commit()
            print(f"  ‚úì Committed {len(params_to_add)} parameters, {len(units_to_add)} units")
    
    # Step 2: Map commodity names to IDs from database
    print("\nStep 2: Mapping commodity names to database IDs...")
    commodity_map = {}
    with engine.connect() as conn:
        result = conn.execute(text("SELECT id, name FROM usda_commodity"))
        for row in result:
            commodity_map[row.name.upper()] = row.id
    print(f"  Found {len(commodity_map)} commodities in database")
    
    # Step 3: Look up parameter_id and unit_id from database (by name)
    print("\nStep 3: Looking up parameter and unit IDs...")
    parameter_id_map = {}
    unit_id_map = {}
    with engine.connect() as conn:
        param_result = conn.execute(text("SELECT id, name FROM parameter WHERE name IN ({})".format(
            ','.join(f"'{p}'" for p in PARAMETER_CONFIGS.keys())
        )))
        for row in param_result:
            parameter_id_map[row.name.upper()] = row.id
        
        unit_result = conn.execute(text("SELECT id, name FROM unit WHERE name IN ({})".format(
            ','.join(f"'{u}'" for u in UNIT_CONFIGS.keys())
        )))
        for row in unit_result:
            unit_id_map[row.name.upper()] = row.id
    print(f"  Found {len(parameter_id_map)} parameters, {len(unit_id_map)} units")
    
    # Step 4: Create single transformed dataframe
    print("\nStep 4: Creating transformed dataframe...")
    
    transformed_data = raw_data.copy()
    
    # Map NASS API columns to our schema
    # API returns: county_code (3-digit), state_fips_code (2-digit), statisticcat_desc, unit_desc, commodity_desc, Value, etc.
    # Do NOT rename county_code to geoid here; we'll construct geoid next as full 5-digit FIPS string.
    column_mapping = {
        # 'county_code': 'geoid',  # removed: county_code is 3-digit; geoid must be 5-digit (state+county)
        'commodity_desc': 'commodity',
        'statisticcat_desc': 'statistic',
        'unit_desc': 'unit',
        'Value': 'observation',
        'county_name': 'county',
        'short_desc': 'description',
        'year': 'year'
    }
    
    # Rename columns that exist
    rename_dict = {k: v for k, v in column_mapping.items() if k in transformed_data.columns}
    transformed_data = transformed_data.rename(columns=rename_dict)
    
    print(f"  Renamed columns: {rename_dict}")
    
    # Construct 5-digit FIPS geoid from state + county codes (keep as string)
    # Prefer state_fips_code (2-digit) + county_code (3-digit). Fallback to CA ('06') if only state_alpha is present.
    state_fips_default = '06'  # California
    if 'state_fips_code' in transformed_data.columns and 'county_code' in transformed_data.columns:
        transformed_data['geoid'] = transformed_data['state_fips_code'].astype(str).str.zfill(2) + \
                                    transformed_data['county_code'].astype(str).str.zfill(3)
    elif 'state_alpha' in transformed_data.columns and 'county_code' in transformed_data.columns:
        state_alpha_to_fips = {'CA': '06'}  # Extend if querying other states
        transformed_data['geoid'] = transformed_data['state_alpha'].map(
            lambda x: state_alpha_to_fips.get(str(x).upper(), state_fips_default)
        ).astype(str) + transformed_data['county_code'].astype(str).str.zfill(3)
    elif 'county_code' in transformed_data.columns:
        # Fallback: assume CA and just pad county_code
        transformed_data['geoid'] = state_fips_default + transformed_data['county_code'].astype(str).str.zfill(3)
    else:
        print("  ‚ö† Warning: 'county_code' not found; cannot construct geoid")
        transformed_data['geoid'] = None
    
    # Ensure geoid is a 5-character string
    transformed_data['geoid'] = transformed_data['geoid'].astype(str).str.zfill(5)
    
    # Map commodity names to IDs
    def get_commodity_id(name):
        if pd.isna(name):
            return None
        if name.upper() in commodity_map:
            return commodity_map[name.upper()]
        # Try partial match
        for db_name, db_id in commodity_map.items():
            if name.upper() in db_name or db_name in name.upper():
                return db_id
        return None
    
    if 'commodity' in transformed_data.columns:
        transformed_data['commodity_id'] = transformed_data['commodity'].apply(get_commodity_id)
    else:
        print("  ‚ö† Warning: 'commodity' column not found")
        transformed_data['commodity_id'] = None
    
    # Map to parameter_id and unit_id from database (by name lookup)
    if 'statistic' in transformed_data.columns:
        transformed_data['parameter_id'] = transformed_data['statistic'].map(
            lambda x: parameter_id_map.get(x.upper()) if pd.notna(x) else None
        )
    
    if 'unit' in transformed_data.columns:
        transformed_data['unit_id'] = transformed_data['unit'].map(
            lambda x: unit_id_map.get(x.upper()) if pd.notna(x) else None
        )
    
    # Add metadata columns
    transformed_data['source_reference'] = 'USDA NASS QuickStats API'
    transformed_data['record_type'] = 'USDA'
    
    # Convert observation strings (with commas/decimals) to numeric float
    if 'observation' in transformed_data.columns:
        transformed_data['value'] = transformed_data['observation'].astype(str).str.replace(',', '')
        transformed_data['value'] = pd.to_numeric(transformed_data['value'], errors='coerce')
    
    # Coerce all ID columns to integers (nullable Int64 type)
    id_columns = ['commodity_id', 'parameter_id', 'unit_id']
    for col in id_columns:
        if col in transformed_data.columns:
            transformed_data[col] = pd.to_numeric(transformed_data[col], errors='coerce').astype('Int64')
    
    # Create note field
    transformed_data['note'] = transformed_data.apply(
        lambda row: f"{row.get('statistic', 'N/A')} in {row.get('unit', 'N/A')} for {row.get('commodity', 'N/A')} in {row.get('county', 'N/A')}", 
        axis=1
    )
    
    # Keep relevant columns (load step will create record_id FK)
    final_columns = [
        'geoid', 'year', 'commodity_id', 'source_reference',  # For UsdaCensusRecord
        'record_type', 'parameter_id', 'value', 'unit_id', 'note',  # For Observation
        'commodity', 'statistic', 'unit', 'county', 'description'  # Original for reference
    ]
    
    # Only include columns that exist
    final_columns = [col for col in final_columns if col in transformed_data.columns]
    transformed_data = transformed_data[final_columns]
    
    # Drop rows with missing required fields
    required_fields = ['geoid', 'year', 'commodity_id', 'parameter_id', 'unit_id', 'value']
    required_fields = [col for col in required_fields if col in transformed_data.columns]
    transformed_data = transformed_data.dropna(subset=required_fields)
    
    print(f"\n‚úì Transform complete!")
    print(f"  Total rows: {len(transformed_data)}")
    print(f"  Columns: {list(transformed_data.columns)}")
    
    # Show data types for ID columns
    print(f"\nData types for ID columns:")
    for col in ['commodity_id', 'parameter_id', 'unit_id', 'value']:
        if col in transformed_data.columns:
            print(f"  {col}: {transformed_data[col].dtype}")
    
    print(f"\nSample record:")
    if len(transformed_data) > 0:
        sample = transformed_data.head(1).to_dict('records')[0]
        for key, val in sample.items():
            print(f"  {key}: {val} (type: {type(val).__name__})")
    else:
        print("  ‚ö† No valid records after transformation")

Transform Step: Mapping API data to database schema
Debug: Available columns in raw_data: ['state_name', 'congr_district_code', 'end_code', 'county_ansi', 'reference_period_desc', 'group_desc', 'year', 'unit_desc', 'domain_desc', 'Value']...
Step 1: Creating Parameter/Unit records if needed...
  All 4 parameters already exist
  All 4 units already exist

Step 2: Mapping commodity names to database IDs...
  Found 4 commodities in database

Step 3: Looking up parameter and unit IDs...
  Found 4 parameters, 4 units

Step 4: Creating transformed dataframe...
  Renamed columns: {'commodity_desc': 'commodity', 'statisticcat_desc': 'statistic', 'unit_desc': 'unit', 'Value': 'observation', 'county_name': 'county', 'short_desc': 'description', 'year': 'year'}

‚úì Transform complete!
  Total rows: 65
  Columns: ['geoid', 'year', 'commodity_id', 'source_reference', 'record_type', 'parameter_id', 'value', 'unit_id', 'note', 'commodity', 'statistic', 'unit', 'county', 'description']

Data types fo

In [26]:
# Display transformed_data in Data Wrangler
print("Preparing to display transformed_data in Data Wrangler...")
print(f"Shape: {transformed_data.shape}")
print(f"\nPreview (first 5 rows):")
print(transformed_data.head().to_string())

# The Data Wrangler will be opened with the variable below
transformed_data

Preparing to display transformed_data in Data Wrangler...
Shape: (65, 14)

Preview (first 5 rows):
     geoid  year  commodity_id          source_reference record_type  parameter_id      value  unit_id                                             note commodity       statistic   unit       county                                  description
310  06077  2022             5  USDA NASS QuickStats API        USDA             3    14503.0        3  AREA HARVESTED in ACRES for CORN in SAN JOAQUIN      CORN  AREA HARVESTED  ACRES  SAN JOAQUIN                CORN, GRAIN - ACRES HARVESTED
318  06077  2022             5  USDA NASS QuickStats API        USDA             3    51836.0        3  AREA HARVESTED in ACRES for CORN in SAN JOAQUIN      CORN  AREA HARVESTED  ACRES  SAN JOAQUIN               CORN, SILAGE - ACRES HARVESTED
326  06077  2022             5  USDA NASS QuickStats API        USDA             2  1345187.0        2       PRODUCTION in TONS for CORN in SAN JOAQUIN      CORN      PRODU

Unnamed: 0,geoid,year,commodity_id,source_reference,record_type,parameter_id,value,unit_id,note,commodity,statistic,unit,county,description
310,06077,2022,5,USDA NASS QuickStats API,USDA,3,14503.0,3,AREA HARVESTED in ACRES for CORN in SAN JOAQUIN,CORN,AREA HARVESTED,ACRES,SAN JOAQUIN,"CORN, GRAIN - ACRES HARVESTED"
318,06077,2022,5,USDA NASS QuickStats API,USDA,3,51836.0,3,AREA HARVESTED in ACRES for CORN in SAN JOAQUIN,CORN,AREA HARVESTED,ACRES,SAN JOAQUIN,"CORN, SILAGE - ACRES HARVESTED"
326,06077,2022,5,USDA NASS QuickStats API,USDA,2,1345187.0,2,PRODUCTION in TONS for CORN in SAN JOAQUIN,CORN,PRODUCTION,TONS,SAN JOAQUIN,"CORN, SILAGE - PRODUCTION, MEASURED IN TONS"
327,06077,2022,5,USDA NASS QuickStats API,USDA,3,14503.0,3,AREA HARVESTED in ACRES for CORN in SAN JOAQUIN,CORN,AREA HARVESTED,ACRES,SAN JOAQUIN,"CORN, GRAIN, IRRIGATED - ACRES HARVESTED"
329,06077,2022,5,USDA NASS QuickStats API,USDA,3,51644.0,3,AREA HARVESTED in ACRES for CORN in SAN JOAQUIN,CORN,AREA HARVESTED,ACRES,SAN JOAQUIN,"CORN, SILAGE, IRRIGATED - ACRES HARVESTED"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6031,06047,2022,8,USDA NASS QuickStats API,USDA,3,8528.0,3,AREA HARVESTED in ACRES for TOMATOES in MERCED,TOMATOES,AREA HARVESTED,ACRES,MERCED,"TOMATOES, IN THE OPEN - ACRES HARVESTED"
6525,06047,2022,5,USDA NASS QuickStats API,USDA,3,63200.0,3,AREA HARVESTED in ACRES for CORN in MERCED,CORN,AREA HARVESTED,ACRES,MERCED,"CORN, SILAGE - ACRES HARVESTED"
6526,06047,2022,5,USDA NASS QuickStats API,USDA,2,1612000.0,2,PRODUCTION in TONS for CORN in MERCED,CORN,PRODUCTION,TONS,MERCED,"CORN, SILAGE - PRODUCTION, MEASURED IN TONS"
6558,06047,2022,8,USDA NASS QuickStats API,USDA,3,24600.0,3,AREA HARVESTED in ACRES for TOMATOES in MERCED,TOMATOES,AREA HARVESTED,ACRES,MERCED,"TOMATOES, IN THE OPEN, PROCESSING - ACRES HARV..."


In [32]:
# Check actual data types from NASS API for Value column
print("Investigating NASS API data types:")
print("="*60)

# 1) Raw Value dtype and sample values
print("\n1. Raw 'Value' column dtype and samples:")
try:
    raw_dtype = raw_data['Value'].dtype
    print(f"  Raw dtype: {raw_dtype}")
    print(f"  Sample values (first 10):")
    for idx, val in enumerate(raw_data['Value'].head(10)):
        print(f"    [{idx}] {repr(val)} (type: {type(val).__name__})")
except Exception as e:
    print(f"  ‚ö† Unable to inspect raw_data['Value']: {e}")

# 2) String formatting patterns: commas, decimals, whitespace
print("\n2. String formatting patterns in 'Value':")
try:
    value_str = raw_data['Value'].astype(str)
    has_commas = value_str.str.contains(',').sum()
    has_decimals = value_str.str.contains(r'\.').sum()
    has_whitespace = value_str.str.contains(r'\s').sum()
    total = len(value_str)
    print(f"  With commas: {has_commas}/{total}")
    print(f"  With decimal point: {has_decimals}/{total}")
    print(f"  With whitespace: {has_whitespace}/{total}")
except Exception as e:
    print(f"  ‚ö† Unable to analyze string patterns: {e}")

# 3) Coerce to numeric: remove commas, convert to float
print("\n3. Coercion to numeric float (remove commas, handle decimals):")
try:
    value_num = pd.to_numeric(value_str.str.replace(',', ''), errors='coerce')
    non_null = value_num.notna().sum()
    nulls = value_num.isna().sum()
    pct_numeric = round(100 * non_null / (non_null + nulls), 2) if (non_null + nulls) > 0 else 0.0
    print(f"  Converted dtype: {value_num.dtype}")
    print(f"  Numeric rows: {non_null}, Non-numeric (NaN): {nulls}, % numeric: {pct_numeric}%")
    if non_null > 0:
        print(f"  Range: min={value_num.min()}, max={value_num.max()}")
    # Show a few rows that failed conversion, if any
    if nulls > 0:
        failed_samples = value_str[value_num.isna()].head(5).tolist()
        print(f"  Samples that failed conversion: {failed_samples}")
except Exception as e:
    print(f"  ‚ö† Unable to convert 'Value' to numeric: {e}")

# 4) Reference: usda_commodity table entries
print("\n4. usda_commodity table (reference):")
try:
    with engine.connect() as conn:
        result = conn.execute(text("SELECT id, name, usda_code FROM usda_commodity ORDER BY id"))
        print(f"  Total commodities in database:")
        for row in result:
            print(f"    ID: {row.id}, Name: {row.name}, USDA Code: {row.usda_code}")
except Exception as e:
    print(f"  ‚ö† Unable to query usda_commodity: {e}")

Investigating NASS API data types:

1. Raw 'Value' column dtype and samples:
  Raw dtype: object
  Sample values (first 10):
    [0] '910,695,000' (type: str)
    [1] '560' (type: str)
    [2] '                 (D)' (type: str)
    [3] '2' (type: str)
    [4] '                 (D)' (type: str)
    [5] '1' (type: str)
    [6] '                 (D)' (type: str)
    [7] '1' (type: str)
    [8] '                 (D)' (type: str)
    [9] '1' (type: str)

2. String formatting patterns in 'Value':
  With commas: 2043/6564
  With decimal point: 30/6564
  With whitespace: 785/6564

3. Coercion to numeric float (remove commas, handle decimals):
  Converted dtype: float64
  Numeric rows: 5779, Non-numeric (NaN): 785, % numeric: 88.04%
  Range: min=-999000.0, max=17806949000.0
  Samples that failed conversion: ['                 (D)', '                 (D)', '                 (D)', '                 (D)', '                 (D)']

4. usda_commodity table (reference):
  Total commodities in database

## Step 6: Test USDA Load (Insert to Database)

In [33]:
from sqlalchemy import text
from sqlmodel import Session
from ca_biositing.datamodels.schemas.generated.ca_biositing import UsdaCensusRecord, Observation

print("Load Step: Insert to UsdaCensusRecord and Observation tables")
print("="*70)

if 'transformed_data' not in locals() or transformed_data is None or len(transformed_data) == 0:
    print("‚ö† No transformed_data - run transform step first")
else:
    try:
        session = Session(engine)
        
        try:
            # Step 1: Extract unique census records
            print("\nStep 1: Extracting unique census records...")
            census_df = transformed_data.groupby(['geoid', 'year', 'commodity_code']).first().reset_index()
            census_df = census_df[['geoid', 'year', 'commodity_code', 'source_reference']].drop_duplicates()
            print(f"  Found {len(census_df)} unique records to create")
            
            # Build lookup: (geoid, year, commodity_code) ‚Üí UsdaCensusRecord.id
            census_id_map = {}
            census_inserted = 0
            census_skipped = 0
            
            # Step 2: Insert census records and collect their IDs
            print(f"Step 2: Loading {len(census_df)} unique census records...")
            for _, row in census_df.iterrows():
                try:
                    # Check if record already exists
                    existing = session.query(UsdaCensusRecord).filter(
                        UsdaCensusRecord.geoid == row['geoid'],
                        UsdaCensusRecord.year == int(row['year']),
                        UsdaCensusRecord.commodity_code == int(row['commodity_code'])
                    ).first()
                    
                    if existing:
                        census_id_map[(row['geoid'], int(row['year']), int(row['commodity_code']))] = existing.id
                        census_skipped += 1
                    else:
                        # Create new record
                        census_record = UsdaCensusRecord(
                            geoid=row['geoid'],
                            year=int(row['year']),
                            commodity_code=int(row['commodity_code']),
                            source_reference=row['source_reference'],
                            dataset_id=None,
                            etl_run_id=None,
                            lineage_group_id=None
                        )
                        session.add(census_record)
                        session.flush()
                        census_id_map[(row['geoid'], int(row['year']), int(row['commodity_code']))] = census_record.id
                        census_inserted += 1
                except Exception as e:
                    print(f"  ‚ö† Error on census record: {e}")
                    census_skipped += 1
                    continue
            
            session.commit()
            print(f"  ‚úì Census records: {census_inserted} inserted, {census_skipped} skipped")
            
            # Step 3: Load observations
            print(f"\nStep 3: Loading {len(transformed_data)} observation records...")
            obs_inserted = 0
            obs_skipped = 0
            
            for _, row in transformed_data.iterrows():
                try:
                    key = (row['geoid'], int(row['year']), int(row['commodity_code']))
                    if key not in census_id_map:
                        obs_skipped += 1
                        continue
                    
                    census_id = census_id_map[key]
                    # Don't set 'id' - let database auto-generate it
                    observation = Observation(
                        record_id=census_id,
                        record_type=row['record_type'],
                        parameter_id=int(row['parameter_id']),
                        value=float(row['value']),
                        unit_id=int(row['unit_id']),
                        note=row['note']
                    )
                    session.add(observation)
                    obs_inserted += 1
                except Exception as e:
                    print(f"  ‚ö† Error on observation: {e}")
                    obs_skipped += 1
                    continue
            
            session.commit()
            print(f"  ‚úì Observations: {obs_inserted} inserted, {obs_skipped} skipped")
            
            # Step 4: Verify
            print(f"\n‚úì Load complete!")
            with engine.connect() as conn:
                census_count = conn.execute(text("""
                    SELECT COUNT(*) FROM usda_census_record 
                    WHERE source_reference = 'USDA NASS QuickStats API'
                """)).scalar()
                
                obs_count = conn.execute(text("""
                    SELECT COUNT(*) FROM observation o
                    JOIN usda_census_record c ON o.record_id = c.id
                    WHERE c.source_reference = 'USDA NASS QuickStats API'
                """)).scalar()
                
                print(f"  Total census records: {census_count}")
                print(f"  Total observations: {obs_count}")
        
        finally:
            session.close()
        
    except Exception as e:
        print(f"‚úó Load failed: {e}")
        import traceback
        traceback.print_exc()

Load Step: Insert to UsdaCensusRecord and Observation tables

Step 1: Extracting unique census records...
‚úó Load failed: 'commodity_code'


Traceback (most recent call last):
  File "C:\Users\meili\AppData\Local\Temp\ipykernel_34552\2462884913.py", line 17, in <module>
    census_df = transformed_data.groupby(['geoid', 'year', 'commodity_code']).first().reset_index()
                ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\meili\forked\ca-biositing\.pixi\envs\default\Lib\site-packages\pandas\core\frame.py", line 9210, in groupby
    return DataFrameGroupBy(
        obj=self,
    ...<7 lines>...
        dropna=dropna,
    )
  File "c:\Users\meili\forked\ca-biositing\.pixi\envs\default\Lib\site-packages\pandas\core\groupby\groupby.py", line 1331, in __init__
    grouper, exclusions, obj = get_grouper(
                               ~~~~~~~~~~~^
        obj,
        ^^^^
    ...<5 lines>...
        dropna=self.dropna,
        ^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "c:\Users\meili\forked\ca-biositing\.pixi\envs\default\Lib\site-packages\pandas\core\groupby\grouper.py", line 1043, in get_gr

## Step 7: Run Complete End-to-End Flow

In [None]:
from ca_biositing.pipeline.flows.usda_etl import usda_etl_flow

print("Running Complete USDA ETL Flow (Extract ‚Üí Transform ‚Üí Load):")
print("="*60)

try:
    success = usda_etl_flow()
    
    if success:
        print("\n‚úì USDA ETL FLOW COMPLETED SUCCESSFULLY!")
    else:
        print("\n‚ö† USDA ETL flow returned False (check logs for details)")
except Exception as e:
    print(f"‚úó Flow execution failed: {e}")
    raise

## Step 8: Verify Data in Database

In [None]:
print("Querying Recent USDA Records from Database:")
print("="*60)

try:
    with engine.connect() as conn:
        result = pd.read_sql(
            text("""
                SELECT 
                    id, geoid, commodity_name, year, total_records, created_at
                FROM usda_census_record
                ORDER BY created_at DESC
                LIMIT 10
            """),
            conn
        )
        
        if len(result) > 0:
            print(f"‚úì Found {len(result)} recent USDA records:")
            print(f"\n{result.to_string(index=False)}")
        else:
            print("‚ö† No USDA records in database (this may be expected for test)")
            
except Exception as e:
    print(f"‚úó Database query failed: {e}")
    raise

## Step 9: Summary Report

In [None]:
print("\n" + "="*60)
print("USDA INGESTION PIPELINE - TEST SUMMARY")
print("="*60)

checks = {
    "Environment Setup": True,
    "Database Connection": "db_url" in locals(),
    "Commodity Mapper": "commodity_codes" in locals() and len(commodity_codes) > 0,
    "USDA API Extract": "raw_data" in locals() and len(raw_data) > 0,
    "Transform Task": "transformed_data" in locals() and len(transformed_data) > 0,
    "Load Task": "etl_run_name" in locals(),
    "Database Records": len(result) > 0 if "result" in locals() else False,
}

print("\nTest Results:")
for test_name, passed in checks.items():
    status = "‚úì PASS" if passed else "‚úó FAIL"
    print(f"  {status}: {test_name}")

all_passed = all(checks.values())
print(f"\n{'='*60}")
if all_passed:
    print("üéâ ALL TESTS PASSED - USDA INGESTION WORKING!")
else:
    print("‚ö† Some tests failed - see above for details")
print(f"{'='*60}")

# USDA Ingestion Pipeline Testing

This notebook walks through the complete USDA ETL pipeline:
1. **Extract**: Fetch data from USDA NASS QuickStats API
2. **Transform**: Clean and normalize the data
3. **Load**: Insert records into the database
4. **Verify**: Query results to confirm success

**Goal**: Test all components and show working output by 5pm today.

## Step 1: Environment Setup

In [None]:
import os
import sys
from pathlib import Path

# Configure PYTHONPATH for namespace packages
workspace_root = Path(r'c:\Users\meili\forked\ca-biositing')
sys.path.insert(0, str(workspace_root / 'src' / 'ca_biositing' / 'pipeline'))
sys.path.insert(0, str(workspace_root / 'src' / 'ca_biositing' / 'datamodels'))
sys.path.insert(0, str(workspace_root / 'src' / 'ca_biositing' / 'webservice'))

os.chdir(str(workspace_root))

print("‚úì PYTHONPATH configured")
print(f"‚úì Working directory: {os.getcwd()}")

## Step 2: Load Environment Variables

In [None]:
from dotenv import load_dotenv
import os

# Load .env file
env_path = workspace_root / '.env'
load_dotenv(env_path)

# Verify critical environment variables
db_url = os.getenv('DATABASE_URL')
usda_api_key = os.getenv('USDA_NASS_API_KEY')

print("Environment Variables Loaded:")
print(f"  DATABASE_URL: {db_url[:50]}..." if db_url else "  DATABASE_URL: NOT SET")
print(f"  USDA_NASS_API_KEY: {usda_api_key[:20]}..." if usda_api_key else "  USDA_NASS_API_KEY: NOT SET")

if not db_url or not usda_api_key:
    raise ValueError("Missing required environment variables in .env")

print("\n‚úì All required environment variables loaded")

## Step 3: Verify Database Connection

In [None]:
from sqlalchemy import create_engine, text

# Create database connection
engine = create_engine(os.getenv('DATABASE_URL'))

try:
    with engine.connect() as conn:
        result = conn.execute(text("SELECT version();"))
        version = result.fetchone()[0]
        print(f"‚úì Database connected: {version[:60]}...")
except Exception as e:
    print(f"‚úó Database connection failed: {e}")
    raise

# Check if USDA tables exist
try:
    with engine.connect() as conn:
        result = conn.execute(text(
            "SELECT table_name FROM information_schema.tables "
            "WHERE table_schema = 'public' AND table_name LIKE 'usda%'"
        ))
        tables = [row[0] for row in result.fetchall()]
        print(f"\n‚úì USDA tables found: {tables}")
except Exception as e:
    print(f"‚ö† Could not query tables: {e}")

## Step 4: Verify Commodity Mapper

In [None]:
# Test commodity mapper
from ca_biositing.pipeline.utils.commodity_mapper import get_mapped_commodity_ids

print("Testing Commodity Mapper:")
print("="*50)

try:
    commodity_codes = get_mapped_commodity_ids()
    print(f"‚úì Retrieved {len(commodity_codes)} commodity codes from database:")
    for idx, code in enumerate(commodity_codes[:5]):
        print(f"  - Commodity {idx + 1}: {code}")
except Exception as e:
    print(f"‚úó Commodity mapper failed: {e}")
    raise

print(f"\n‚úì Commodity mapper working correctly")

## Step 5: Test Extract Task (USDA API)

In [None]:
import pandas as pd
from ca_biositing.pipeline.utils.usda_nass_to_pandas import usda_nass_to_df
from ca_biositing.pipeline.utils.nass_config import PRIORITY_COUNTIES

print("Testing USDA API Extract - County Level Data:")
print("="*60)

# Get commodity codes
commodity_codes = get_mapped_commodity_ids()
if commodity_codes:
    commodity_ids = commodity_codes[:1]
else:
    print("No commodity codes found!")
    raise ValueError("No commodity codes mapped in database")

print(f"North San Joaquin Valley Priority Counties:")
for county_name, fips_code in PRIORITY_COUNTIES.items():
    print(f"  {county_name} (FIPS: {fips_code})")

print(f"\nQuerying with specific parameters:")
print(f"  - Commodity ID: {commodity_ids[0]}")
print(f"  - Year: 2023")
print(f"  - Agg Level: COUNTY (county-level detail)")
print(f"  - Statistic: YIELD (bushels per acre)")
print(f"  - Unit: BUSHELS")
print(f"  - Domain: TOTAL (all operations)")
print("\nThis query respects the 50k record limit...\n")

try:
    raw_data = usda_nass_to_df(
        commodity_ids=commodity_ids,
        api_key=os.getenv('USDA_NASS_API_KEY'),
        year=2023,
        agg_level_desc="COUNTY",
        statisticcat_desc="YIELD",
        unit_desc="BUSHELS",
        domain_desc="TOTAL"
    )
    
    if len(raw_data) > 0:
        print(f"‚úì Extract successful!")
        print(f"  Columns: {list(raw_data.columns)[:7]}...")
        print(f"\n  First row sample:")
        print(raw_data.iloc[0].to_string()[:300])
    else:
        print("‚ö† No data returned - commodity may not have yield data")
except Exception as e:
    print(f"‚úó Extract failed: {e}")
    raise

## Step 6: Test Transform Task

In [None]:
from ca_biositing.pipeline.etl.transform.usda.usda_census_survey import validate_and_clean_usda_data

print("Testing Transform Task:")
print("="*50)

# Use the raw_data from extract (if available)
if 'raw_data' in locals() and len(raw_data) > 0:
    try:
        transformed_data = validate_and_clean_usda_data(raw_data.copy())
        
        print(f"‚úì Transform successful!")
        print(f"  Records after transform: {len(transformed_data)}")
        print(f"  Columns after transform: {list(transformed_data.columns)}")
        print(f"\n  Sample transformed record:")
        if len(transformed_data) > 0:
            print(transformed_data.iloc[0].to_string()[:300])
    except Exception as e:
        print(f"‚úó Transform failed: {e}")
        raise
else:
    print("‚ö† Skipping transform test (no data from extract)")

## Step 7: Test Load Task

In [None]:
from ca_biositing.pipeline.etl.load.usda.usda_census_survey import load_usda_data
from datetime import datetime

print("Testing Load Task:")
print("="*50)

if 'transformed_data' in locals() and len(transformed_data) > 0:
    try:
        # Create ETL run metadata
        etl_run_name = f"test_run_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        
        print(f"Loading {len(transformed_data)} records...")
        print(f"ETL Run: {etl_run_name}\n")
        
        load_result = load_usda_data(
            transformed_data=transformed_data.copy(),
            etl_run_name=etl_run_name,
            engine=engine
        )
        
        print(f"‚úì Load successful!")
        print(f"  Result: {load_result}")
    except Exception as e:
        print(f"‚úó Load failed: {e}")
        raise
else:
    print("‚ö† Skipping load test (no data from transform)")

## Step 8: Run Complete End-to-End Flow

In [None]:
from ca_biositing.pipeline.flows.usda_etl import usda_etl_flow

print("Running Complete USDA ETL Flow (Extract ‚Üí Transform ‚Üí Load):")
print("="*60)

try:
    # Run the Prefect flow
    success = usda_etl_flow()
    
    if success:
        print("\n‚úì USDA ETL FLOW COMPLETED SUCCESSFULLY!")
    else:
        print("\n‚ö† USDA ETL flow returned False (check logs for details)")
except Exception as e:
    print(f"‚úó Flow execution failed: {e}")
    raise

## Step 9: Verify Data in Database

In [None]:
import pandas as pd
from sqlalchemy import text

print("Verifying Data in Database:")
print("="*50)

try:
    with engine.connect() as conn:
        # Query recent USDA records
        query = text("""
            SELECT 
                id, 
                geoid, 
                commodity_name,
                year,
                total_records,
                created_at
            FROM usda_census_record
            ORDER BY created_at DESC
            LIMIT 10
        """)
        
        result = pd.read_sql(query, conn)
        
        if len(result) > 0:
            print(f"‚úì Found {len(result)} recent USDA records in database:")
            print(f"\n{result.to_string(index=False)}")
        else:
            print("‚ö† No USDA records found in database")
            
except Exception as e:
    print(f"‚úó Database query failed: {e}")
    raise

## Step 10: Summary Report

In [None]:
print("\n" + "="*60)
print("USDA INGESTION PIPELINE - TEST SUMMARY")
print("="*60)

checks = {
    "Environment Setup": True,
    "Database Connection": "db_url" in locals(),
    "Commodity Mapper": "commodity_codes" in locals() and len(commodity_codes) > 0,
    "USDA API Extract": "raw_data" in locals() and len(raw_data) > 0,
    "Transform Task": "transformed_data" in locals() and len(transformed_data) > 0,
    "Load Task": "etl_run_name" in locals(),
    "Database Records": len(result) > 0 if "result" in locals() else False,
}

print("\nTest Results:")
for test_name, passed in checks.items():
    status = "‚úì PASS" if passed else "‚úó FAIL"
    print(f"  {status}: {test_name}")

all_passed = all(checks.values())
print(f"\n{'='*60}")
if all_passed:
    print("üéâ ALL TESTS PASSED - USDA INGESTION WORKING!")
else:
    print("‚ö† Some tests failed - see above for details")
print(f"{'='*60}")

## Production-Ready API Template

### Template for USDA NASS API Data Ingestion

This template provides a reusable pattern for extracting, transforming, and preparing USDA agricultural data for database ingestion. Configuration-driven without code changes needed.

### How to Use This Template

1. **Configuration Section** (lines 1-30): Adjust these settings for different queries
   - `SELECTED_STATISTICS`: Choose which statistics to retrieve (default: YIELD)
   - `COUNTIES_TO_QUERY`: Add/remove counties with their FIPS and NASS codes
   - `YEAR`: Change data year if needed

2. **No Code Changes Needed**: The template handles everything else automatically
   - Database commodity mapping
   - County iteration with proper code conversion
   - API response parsing and error handling
   - Data transformation to output schema
   - Results summary




## Quick Reference: Template Usage Examples

### Example 1: Query Multiple Statistics

```python
# In configuration section, change:
SELECTED_STATISTICS = ['YIELD', 'PRODUCTION', 'AREA HARVESTED']

# Then run the template - it will iterate all three automatically
# Output: DataFrame with rows for each statistic per commodity
```

### Example 2: Add a New County

```python
# In configuration section, add to COUNTIES_TO_QUERY:
COUNTIES_TO_QUERY = {
    'San Joaquin': {'fips': '06077', 'nass_code': '077'},
    'Merced': {'fips': '06047', 'nass_code': '047'},
    'Kern': {'fips': '06029', 'nass_code': '029'},  # Add this line
}

# Template will automatically query the new county
```

### Example 3: Query Different Year

```python
# In configuration section, change:
YEAR = 2023  # or 2021, 2020, etc.

# Run template - will query the new year
```

### Example 4: Get All Commodity Statistics

```python
# In configuration section, change:
SELECTED_STATISTICS = list(STATISTICS_OPTIONS.keys())
# = ['AREA HARVESTED', 'PRODUCTION', 'YIELD', 'PRICE RECEIVED']

# Run template - will retrieve all four statistics for each commodity
```

### Example 5: Pipeline Integration

```python
# After running template, use output_df in pipeline tasks:

# 1. Transform (rename columns, clean data, etc.)
# transformed_df = transform(output_df)

# 2. Load to database
# from ca_biositing.pipeline.etl.load.usda.usda_census_survey import load
# success = load(transformed_df)
# Result: Records inserted with auto-generated timestamps
```

### Common Tasks

| Task | How To |
|------|--------|
| Get statistics for specific commodities only | Filter `output_df` before load: `output_df[output_df['commodity'].isin(['wheat', 'corn'])]` |
| Change default statistic | Modify `SELECTED_STATISTICS = ['PRODUCTION']` (default is ['YIELD']) |
| Skip a county temporarily | Remove it from `COUNTIES_TO_QUERY` or comment it out |
| Get raw API response | Check intermediate `records` variable or add `print(raw_record)` |
| Check for missing data | Run debug cell #VSC-cd5623db (DataFrame inspection) |

# Debug: Check what the API is actually returning
import requests
from urllib.parse import urlencode
import time
import json

print("="*60)
print("DEBUG: Inspecting Actual API Response Data")
print("="*60)

api_key = os.getenv('USDA_NASS_API_KEY')

# Query that returns 200
test_params = {
    "key": api_key,
    "state_alpha": "CA",
    "format": "JSON",
    "year": 2023,
    "commodity_desc": "CORN",
    "agg_level_desc": "COUNTY",
    "statisticcat_desc": "YIELD",
}

print("\nMaking request with parameters:")
for k, v in test_params.items():
    if k != "key":
        print(f"  {k}: {v}")

resp = requests.get("https://quickstats.nass.usda.gov/api/api_GET", params=test_params, timeout=30)
print(f"\nStatus: {resp.status_code}")

data = resp.json()
print(f"Response type: {type(data)}")
print(f"Response is dict: {isinstance(data, dict)}")
print(f"Response is list: {isinstance(data, list)}")

# Show the response
if isinstance(data, dict):
    print(f"\nDictionary keys: {list(data.keys())}")
    print(f"Dictionary content (first 500 chars):")
    print(json.dumps(data, indent=2)[:500])
elif isinstance(data, list):
    print(f"\nList length: {len(data)}")
    if len(data) > 0:
        print(f"First item: {data[0]}")
else:
    print(f"\nResponse (raw): {str(data)[:200]}")

# Also check if there's a special error or message field
if isinstance(data, dict):
    if "error" in data:
        print(f"\n‚ö† ERROR in response: {data['error']}")
    if "message" in data:
        print(f"‚ö† MESSAGE: {data['message']}")
    if "records" in data:
        print(f"Found 'records' key: {len(data['records'])} records")
        if len(data['records']) > 0:
            print(f"  First record: {data['records'][0]}")
