# USDA Ingestion Pipeline - Complete Testing

This notebook walks through the complete USDA ETL pipeline testing:
1. **Environment Setup**: Configure PYTHONPATH and imports
2. **Database Connection**: Verify connectivity
3. **Commodity Mapper**: Test USDA code lookups
4. **Extract**: Fetch data from USDA NASS API
5. **Transform**: Clean and normalize data
6. **Load**: Insert into database
7. **Verification**: Query and confirm results

**Goal**: Demonstrate full working USDA ingestion pipeline with output ‚úì

## Step 1: Environment Setup

In [104]:
import os
import sys
from pathlib import Path
import pandas as pd
from datetime import datetime

# Configure PYTHONPATH for namespace packages
workspace_root = Path(r'c:\Users\meili\forked\ca-biositing')
sys.path.insert(0, str(workspace_root / 'src' / 'ca_biositing' / 'pipeline'))
sys.path.insert(0, str(workspace_root / 'src' / 'ca_biositing' / 'datamodels'))
sys.path.insert(0, str(workspace_root / 'src' / 'ca_biositing' / 'webservice'))
os.chdir(str(workspace_root))

# Load environment variables
from dotenv import load_dotenv
load_dotenv(workspace_root / '.env')

print("‚úì Environment configured")
print(f"‚úì Working directory: {os.getcwd()}")
print(f"‚úì DATABASE_URL loaded: {bool(os.getenv('DATABASE_URL'))}")
print(f"‚úì USDA_NASS_API_KEY loaded: {bool(os.getenv('USDA_NASS_API_KEY'))}")

‚úì Environment configured
‚úì Working directory: c:\Users\meili\forked\ca-biositing
‚úì DATABASE_URL loaded: True
‚úì USDA_NASS_API_KEY loaded: True


## Step 2: Test Database Connection

In [105]:
from sqlalchemy import create_engine, text

engine = create_engine(os.getenv('DATABASE_URL'))

try:
    with engine.connect() as conn:
        result = conn.execute(text("SELECT version();"))
        version = result.fetchone()[0]
        print(f"‚úì Database connected")
        print(f"  PostgreSQL version: {version[:60]}...")
except Exception as e:
    print(f"‚úó Database connection failed: {e}")
    raise

‚úì Database connected
  PostgreSQL version: PostgreSQL 13.5 (Debian 13.5-1.pgdg110+1) on x86_64-pc-linux...


## Step 3: Test Commodity Mapper

In [109]:
from ca_biositing.pipeline.utils.commodity_mapper import get_mapped_commodity_ids

print("Testing Commodity Mapper:")
print("="*50)

try:
    commodity_codes = get_mapped_commodity_ids()
    print(f"‚úì Retrieved {len(commodity_codes)} commodity codes:")
    for idx, code in enumerate(commodity_codes[:5]):
        print(f"  - Code {idx + 1}: {code}")
except Exception as e:
    print(f"‚úó Commodity mapper failed: {e}")
    raise

Testing Commodity Mapper:
‚úì Retrieved 4 commodity codes:
  - Code 1: 11199199
  - Code 2: 37899999
  - Code 3: 10199999
  - Code 4: 26199999


## Step 4: Test USDA Extract (Fetch from API)

In [110]:

import requests
import time
from ca_biositing.pipeline.utils.nass_config import PRIORITY_COUNTIES

print("Testing USDA API - North San Joaquin Valley County-Level Data:")
print("="*60)

api_key = os.getenv('USDA_NASS_API_KEY')

# Map FIPS codes to 3-digit county codes (API requires separate state + county)
fips_to_county_code = {
    "06077": "077",  # San Joaquin
    "06099": "099",  # Stanislaus  
    "06047": "047",  # Merced
}

results_by_county = {}

for county_name, fips_code in PRIORITY_COUNTIES.items():
    county_code = fips_to_county_code[fips_code]
    print(f"\n[{county_name}] FIPS: {fips_code} ‚Üí County Code: {county_code}")
    
    # Use state_alpha + county_code (confirmed working from R package docs)
    params = {
        "key": api_key,
        "state_alpha": "CA",
        "county_code": county_code,  # 3-digit county code (077, 099, 047)
        "format": "JSON",
        "year": 2022  # Using 2022 since 2023 may not have complete data yet
    }
    
    try:
        resp = requests.get("https://quickstats.nass.usda.gov/api/api_GET", params=params, timeout=30)
        print(f"  Status: {resp.status_code}")
        
        data = resp.json()
        if isinstance(data, dict) and "data" in data:
            records = data["data"]
            print(f"  Records: {len(records)}")
            
            if len(records) > 0:
                results_by_county[county_name] = records
                commodities = set([r.get('commodity_desc') for r in records if r.get('commodity_desc')])
                print(f"  Commodities available: {', '.join(sorted(commodities)[:5])}...")
                
                # Show a sample
                sample = records[0]
                print(f"  Sample: {sample.get('commodity_desc')} - {sample.get('short_desc')[:50]}...")
        elif "error" in data:
            print(f"  Error: {data['error']}")
        else:
            print(f"  No data returned")
    except Exception as e:
        print(f"  Exception: {e}")
    
    time.sleep(1)

print(f"\n{'='*60}")
print(f"‚úì County-level exploration complete!")
print(f"  Counties with data: {len(results_by_county)}")

# Combine all results into a single DataFrame
if results_by_county:
    all_records = []
    for county_name, records in results_by_county.items():
        all_records.extend(records)
    
    raw_data = pd.DataFrame(all_records)
    print(f"  Total records: {len(raw_data)}")
    print(f"  Unique commodities: {raw_data['commodity_desc'].nunique()}")
    
    print(f"\n  Sample:")
    print(raw_data[['year', 'county_name', 'commodity_desc', 'short_desc']].drop_duplicates().head(3).to_string(index=False))
else:
    print("  ‚ö† No data found in any county")
    raw_data = pd.DataFrame()


Testing USDA API - North San Joaquin Valley County-Level Data:

[San Joaquin] FIPS: 06077 ‚Üí County Code: 077
  Status: 200
  Records: 2233
  Commodities available: AG LAND, AG SERVICES, ALMONDS, ALPACAS, ANIMAL TOTALS...
  Sample: ANIMAL TOTALS - ANIMAL TOTALS, INCL PRODUCTS - SALES, MEASURED IN ...

[Stanislaus] FIPS: 06099 ‚Üí County Code: 099
  Status: 200
  Records: 2102
  Commodities available: AG LAND, AG SERVICES, ALMONDS, ALPACAS, ANIMAL TOTALS...
  Sample: ANIMAL TOTALS - ANIMAL TOTALS, INCL PRODUCTS - SALES, MEASURED IN ...

[Merced] FIPS: 06047 ‚Üí County Code: 047
  Status: 200
  Records: 2229
  Commodities available: AG LAND, AG SERVICES, ALMONDS, ALPACAS, ANIMAL TOTALS...
  Sample: ANIMAL TOTALS - ANIMAL TOTALS, INCL PRODUCTS - SALES, MEASURED IN ...

‚úì County-level exploration complete!
  Counties with data: 3
  Total records: 6564
  Unique commodities: 191

  Sample:
 year county_name     commodity_desc                                               short_desc
 2022 

### Inspect raw data from API

In [111]:
print("="*80)
print("Inspecting Raw Data from USDA API")
print("="*80)

if 'raw_data' in locals() and len(raw_data) > 0:
    # CRITICAL: Filter to only the counties we requested
    # NOTE: API returns uppercase county names, so we need case-insensitive comparison
    priority_county_names = [name.upper() for name in PRIORITY_COUNTIES.keys()]
    print(f"\nüîç Filtering to priority counties (case-insensitive): {priority_county_names}")
    print(f"   Before filter: {len(raw_data)} records from counties: {raw_data['county_name'].unique().tolist()}")
    
    # Convert county_name to uppercase for comparison, then filter
    raw_data = raw_data[raw_data['county_name'].str.upper().isin(priority_county_names)].copy()
    print(f"   After filter: {len(raw_data)} records from counties: {raw_data['county_name'].unique().tolist()}")
    
    if len(raw_data) == 0:
        print("\n‚ö†Ô∏è WARNING: No records found for priority counties after filtering!")
        print("   This means the API returned data for different counties than requested.")
        print("   The NASS API state_fips + county_code parameters may not be working as expected.")
    
    print(f"\nüìä DataFrame Shape: {raw_data.shape}")
    print(f"   Rows: {len(raw_data)}, Columns: {len(raw_data.columns)}")
    
    print(f"\nüìã Column Information:")
    print(raw_data.info())
    
    print(f"\nüîç First 5 Rows:")
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', 50)
    print(raw_data.head())
    
    print(f"\nüìà Data Types:")
    print(raw_data.dtypes)
    
    print(f"\n‚ùå Missing Values:")
    missing = raw_data.isnull().sum()
    print(missing[missing > 0] if missing.sum() > 0 else "No missing values")
    
    print(f"\nüè∑Ô∏è Unique Values (key columns):")
    key_cols = ['commodity_desc', 'county_name', 'year', 'short_desc']
    for col in key_cols:
        if col in raw_data.columns:
            unique_count = raw_data[col].nunique()
            print(f"   {col}: {unique_count} unique values")
            if unique_count <= 10:
                print(f"      Values: {raw_data[col].unique().tolist()}")
    
    print(f"\nüìä Sample Value Ranges:")
    numeric_cols = raw_data.select_dtypes(include=['number']).columns
    for col in numeric_cols:
        print(f"   {col}: min={raw_data[col].min()}, max={raw_data[col].max()}")
    
    print(f"\n‚úÖ Sample Full Record (first row, all columns):")
    print(raw_data.iloc[0].to_string())
    
else:
    print("‚ö†Ô∏è No raw_data available to inspect")


Inspecting Raw Data from USDA API

üîç Filtering to priority counties (case-insensitive): ['SAN JOAQUIN', 'STANISLAUS', 'MERCED']
   Before filter: 6564 records from counties: ['SAN JOAQUIN', 'STANISLAUS', 'MERCED']
   After filter: 6564 records from counties: ['SAN JOAQUIN', 'STANISLAUS', 'MERCED']

üìä DataFrame Shape: (6564, 39)
   Rows: 6564, Columns: 39

üìã Column Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6564 entries, 0 to 6563
Data columns (total 39 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   year                   6564 non-null   int64 
 1   reference_period_desc  6564 non-null   object
 2   county_name            6564 non-null   object
 3   unit_desc              6564 non-null   object
 4   freq_desc              6564 non-null   object
 5   Value                  6564 non-null   object
 6   state_ansi             6564 non-null   object
 7   watershed_desc         6564 non-null   o

In [112]:
# Verify raw_data is ready for transform
print("Data ready for transform:")
print(f"  Rows: {len(raw_data)}")
print(f"  Columns: {list(raw_data.columns)}")
print(f"  Counties: {raw_data['county_name'].unique().tolist() if 'county_name' in raw_data.columns else 'N/A'}")

# The Data Wrangler will be opened with the variable below
raw_data

Data ready for transform:
  Rows: 6564
  Columns: ['year', 'reference_period_desc', 'county_name', 'unit_desc', 'freq_desc', 'Value', 'state_ansi', 'watershed_desc', 'agg_level_desc', 'prodn_practice_desc', 'class_desc', 'asd_desc', 'sector_desc', 'state_alpha', 'county_code', 'end_code', 'asd_code', 'load_time', 'short_desc', 'commodity_desc', 'domaincat_desc', 'week_ending', 'state_fips_code', 'domain_desc', 'statisticcat_desc', 'CV (%)', 'county_ansi', 'zip_5', 'util_practice_desc', 'country_code', 'state_name', 'begin_code', 'region_desc', 'watershed_code', 'source_desc', 'country_name', 'congr_district_code', 'location_desc', 'group_desc']
  Counties: ['SAN JOAQUIN', 'STANISLAUS', 'MERCED']


Unnamed: 0,year,reference_period_desc,county_name,unit_desc,freq_desc,Value,state_ansi,watershed_desc,agg_level_desc,prodn_practice_desc,class_desc,asd_desc,sector_desc,state_alpha,county_code,end_code,asd_code,load_time,short_desc,commodity_desc,domaincat_desc,week_ending,state_fips_code,domain_desc,statisticcat_desc,CV (%),county_ansi,zip_5,util_practice_desc,country_code,state_name,begin_code,region_desc,watershed_code,source_desc,country_name,congr_district_code,location_desc,group_desc
0,2022,YEAR,SAN JOAQUIN,$,ANNUAL,910695000,06,,COUNTY,ALL PRODUCTION PRACTICES,INCL PRODUCTS,SAN JOAQUIN VALLEY,ANIMALS & PRODUCTS,CA,077,00,51,2024-07-02 12:00:00.000,"ANIMAL TOTALS, INCL PRODUCTS - SALES, MEASURED...",ANIMAL TOTALS,NOT SPECIFIED,,06,TOTAL,SALES,(L),077,,ALL UTILIZATION PRACTICES,9000,CALIFORNIA,00,,00000000,CENSUS,UNITED STATES,,"CALIFORNIA, SAN JOAQUIN VALLEY, SAN JOAQUIN",ANIMAL TOTALS
1,2022,YEAR,SAN JOAQUIN,OPERATIONS,ANNUAL,560,06,,COUNTY,ALL PRODUCTION PRACTICES,INCL PRODUCTS,SAN JOAQUIN VALLEY,ANIMALS & PRODUCTS,CA,077,00,51,2024-07-02 12:00:00.000,"ANIMAL TOTALS, INCL PRODUCTS - OPERATIONS WITH...",ANIMAL TOTALS,NOT SPECIFIED,,06,TOTAL,SALES,14.7,077,,ALL UTILIZATION PRACTICES,9000,CALIFORNIA,00,,00000000,CENSUS,UNITED STATES,,"CALIFORNIA, SAN JOAQUIN VALLEY, SAN JOAQUIN",ANIMAL TOTALS
2,2022,YEAR,SAN JOAQUIN,$,ANNUAL,(D),06,,COUNTY,ALL PRODUCTION PRACTICES,ALL CLASSES,SAN JOAQUIN VALLEY,ANIMALS & PRODUCTS,CA,077,00,51,2024-07-02 12:00:00.000,"AQUACULTURE TOTALS - SALES & DISTRIBUTION, MEA...",AQUACULTURE TOTALS,NOT SPECIFIED,,06,TOTAL,SALES & DISTRIBUTION,(D),077,,ALL UTILIZATION PRACTICES,9000,CALIFORNIA,00,,00000000,CENSUS,UNITED STATES,,"CALIFORNIA, SAN JOAQUIN VALLEY, SAN JOAQUIN",AQUACULTURE
3,2022,YEAR,SAN JOAQUIN,OPERATIONS,ANNUAL,2,06,,COUNTY,ALL PRODUCTION PRACTICES,ALL CLASSES,SAN JOAQUIN VALLEY,ANIMALS & PRODUCTS,CA,077,00,51,2024-07-02 12:00:00.000,AQUACULTURE TOTALS - OPERATIONS WITH SALES & D...,AQUACULTURE TOTALS,NOT SPECIFIED,,06,TOTAL,SALES & DISTRIBUTION,(L),077,,ALL UTILIZATION PRACTICES,9000,CALIFORNIA,00,,00000000,CENSUS,UNITED STATES,,"CALIFORNIA, SAN JOAQUIN VALLEY, SAN JOAQUIN",AQUACULTURE
4,2022,YEAR,SAN JOAQUIN,$,ANNUAL,(D),06,,COUNTY,ALL PRODUCTION PRACTICES,CATFISH,SAN JOAQUIN VALLEY,ANIMALS & PRODUCTS,CA,077,00,51,2024-07-02 12:00:00.000,"FOOD FISH, CATFISH - SALES & DISTRIBUTION, MEA...",FOOD FISH,NOT SPECIFIED,,06,TOTAL,SALES & DISTRIBUTION,(D),077,,ALL UTILIZATION PRACTICES,9000,CALIFORNIA,00,,00000000,CENSUS,UNITED STATES,,"CALIFORNIA, SAN JOAQUIN VALLEY, SAN JOAQUIN",AQUACULTURE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6559,2022,YEAR,MERCED,ACRES,ANNUAL,24700,06,,COUNTY,IN THE OPEN,ALL CLASSES,SAN JOAQUIN VALLEY,CROPS,CA,047,00,51,2024-03-08 15:00:00.000,"TOMATOES, IN THE OPEN, PROCESSING - ACRES PLANTED",TOMATOES,NOT SPECIFIED,,06,TOTAL,AREA PLANTED,,047,,PROCESSING,9000,CALIFORNIA,00,,00000000,SURVEY,UNITED STATES,,"CALIFORNIA, SAN JOAQUIN VALLEY, MERCED",VEGETABLES
6560,2022,YEAR,MERCED,TONS / ACRE,ANNUAL,44.15,06,,COUNTY,IN THE OPEN,ALL CLASSES,SAN JOAQUIN VALLEY,CROPS,CA,047,00,51,2024-03-08 15:00:00.000,"TOMATOES, IN THE OPEN, PROCESSING - YIELD, MEA...",TOMATOES,NOT SPECIFIED,,06,TOTAL,YIELD,,047,,PROCESSING,9000,CALIFORNIA,00,,00000000,SURVEY,UNITED STATES,,"CALIFORNIA, SAN JOAQUIN VALLEY, MERCED",VEGETABLES
6561,2022,YEAR,MERCED,TONS,ANNUAL,1086000,06,,COUNTY,IN THE OPEN,ALL CLASSES,SAN JOAQUIN VALLEY,CROPS,CA,047,00,51,2024-03-08 15:00:00.000,"TOMATOES, IN THE OPEN, PROCESSING, UTILIZED - ...",TOMATOES,NOT SPECIFIED,,06,TOTAL,PRODUCTION,,047,,"PROCESSING, UTILIZED",9000,CALIFORNIA,00,,00000000,SURVEY,UNITED STATES,,"CALIFORNIA, SAN JOAQUIN VALLEY, MERCED",VEGETABLES
6562,2022,YEAR,MERCED,$ / ACRE,ANNUAL,325,06,,COUNTY,IRRIGATED,"CASH, CROPLAND",SAN JOAQUIN VALLEY,ECONOMICS,CA,047,00,51,2022-08-26 15:00:22.000,"RENT, CASH, CROPLAND, IRRIGATED - EXPENSE, MEA...",RENT,NOT SPECIFIED,,06,TOTAL,EXPENSE,7.6,047,,ALL UTILIZATION PRACTICES,9000,CALIFORNIA,00,,00000000,SURVEY,UNITED STATES,,"CALIFORNIA, SAN JOAQUIN VALLEY, MERCED",EXPENSES


## Step 5: Test USDA Transform (Clean Data)

### Transform Step: Map API data to database format

**What this does:**
1. Maps commodity names ‚Üí commodity_code IDs (from usda_commodity table)
2. Creates Parameter records if they don't exist (YIELD, PRODUCTION, etc.)
3. Creates Unit records if they don't exist (BUSHELS, TONS, etc.)
4. Creates a single transformed DataFrame with all columns needed for both tables
5. Load step routes the data to two tables

**Output:** Single `transformed_data` DataFrame that load step uses to populate:
   - `UsdaCensusRecord` table (one per geoid+year+commodity)
   - `Observation` table (one per measurement)

In [None]:
from sqlalchemy import text
import pandas as pd
import numpy as np
from sqlmodel import Session, select
from ca_biositing.datamodels.database import engine
from ca_biositing.datamodels.schemas.generated.ca_biositing import Parameter, Unit

print("Transform Step: Mapping API data to database schema")
print("="*70)

if 'raw_data' not in locals() or len(raw_data) == 0:
    print("‚ö† No raw_data - run API extraction first")
else:
    # Print actual columns to debug
    print(f"Debug: Available columns in raw_data: {list(raw_data.columns)[:10]}...")
    
    # Define parameter/unit configurations (will be keyed by name for DB inserts)
    # Note: Keys are in CAPS for config readability, but will be lowercased when stored in DB
    PARAMETER_CONFIGS = {
        'YIELD': 'Yield per unit area',
        'PRODUCTION': 'Total production quantity',
        'AREA HARVESTED': 'Area harvested',
        'AREA PLANTED': 'Area planted',
        'PRICE RECEIVED': 'Price received by farmer',
        'PRICE PAID': 'Price paid by farmer',
    }
    
    UNIT_CONFIGS = {
        'BUSHELS': 'US bushels',
        'TONS': 'Short tons (US)',
        'ACRES': 'US acres',
        'DOLLARS': 'US dollars',
        'DOLLARS PER BUSHEL': 'US dollars per bushel',
        'DOLLARS PER TON': 'US dollars per ton',
    }
    
    # Step 1: Ensure Parameter/Unit records exist (following coworker's pattern)
    print("Step 1: Creating Parameter/Unit records if needed...")
    with Session(engine) as session:
        # Get existing parameters
        existing_params = session.exec(select(Parameter.name)).all()
        existing_param_names = set(existing_params)
        
        # Add only new parameters (lowercase names for consistency)
        params_to_add = []
        for param_name, param_desc in PARAMETER_CONFIGS.items():
            param_name_lower = param_name.lower()
            if param_name_lower not in existing_param_names:
                param = Parameter(name=param_name_lower, description=param_desc, calculated=False)
                params_to_add.append(param)
                existing_param_names.add(param_name_lower)
        
        if params_to_add:
            session.add_all(params_to_add)
            print(f"  Adding {len(params_to_add)} new parameters")
        else:
            print(f"  All {len(PARAMETER_CONFIGS)} parameters already exist")
        
        # Get existing units
        existing_units = session.exec(select(Unit.name)).all()
        existing_unit_names = set(existing_units)
        
        # Add only new units (lowercase names for consistency)
        units_to_add = []
        for unit_name, unit_desc in UNIT_CONFIGS.items():
            unit_name_lower = unit_name.lower()
            if unit_name_lower not in existing_unit_names:
                unit = Unit(name=unit_name_lower, description=unit_desc)
                units_to_add.append(unit)
                existing_unit_names.add(unit_name_lower)
        
        if units_to_add:
            session.add_all(units_to_add)
            print(f"  Adding {len(units_to_add)} new units")
        else:
            print(f"  All {len(UNIT_CONFIGS)} units already exist")
        
        # Commit only if we added anything
        if params_to_add or units_to_add:
            session.commit()
            print(f"  ‚úì Committed {len(params_to_add)} parameters, {len(units_to_add)} units")
    
    # Step 2: Map commodity names to IDs from database
    print("\nStep 2: Mapping commodity names to database IDs...")
    commodity_map = {}
    with engine.connect() as conn:
        result = conn.execute(text("SELECT id, name FROM usda_commodity"))
        for row in result:
            commodity_map[row.name.upper()] = row.id
    print(f"  Found {len(commodity_map)} commodities in database")
    
    # Step 3: Look up parameter_id and unit_id from database (by name, lowercased)
    print("\nStep 3: Looking up parameter and unit IDs...")
    parameter_id_map = {}
    unit_id_map = {}
    with engine.connect() as conn:
        # Query for lowercase parameter names
        param_names_lower = [p.lower() for p in PARAMETER_CONFIGS.keys()]
        param_result = conn.execute(text("SELECT id, name FROM parameter WHERE name IN ({})".format(
            ','.join(f"'{p}'" for p in param_names_lower)
        )))
        for row in param_result:
            parameter_id_map[row.name.upper()] = row.id
        
        # Query for lowercase unit names
        unit_names_lower = [u.lower() for u in UNIT_CONFIGS.keys()]
        unit_result = conn.execute(text("SELECT id, name FROM unit WHERE name IN ({})".format(
            ','.join(f"'{u}'" for u in unit_names_lower)
        )))
        for row in unit_result:
            unit_id_map[row.name.upper()] = row.id
    print(f"  Found {len(parameter_id_map)} parameters, {len(unit_id_map)} units")
    
    # Step 4: Create single transformed dataframe
    print("\nStep 4: Creating transformed dataframe...")
    
    transformed_data = raw_data.copy()
    
    # Map NASS API columns to our schema
    column_mapping = {
        'commodity_desc': 'commodity',
        'statisticcat_desc': 'statistic',
        'unit_desc': 'unit',
        'Value': 'observation',
        'county_name': 'county',
        'short_desc': 'description',
        'year': 'year',
        # Survey-specific fields
        'freq_desc': 'survey_period',           # ANNUAL, MONTHLY, etc.
        'reference_period_desc': 'reference_month',  # MAY, END OF DEC, etc.
        'begin_code': 'begin_code',
        'end_code': 'end_code'
    }
    
    # Rename columns that exist
    rename_dict = {k: v for k, v in column_mapping.items() if k in transformed_data.columns}
    transformed_data = transformed_data.rename(columns=rename_dict)
    
    print(f"  Renamed columns: {list(rename_dict.keys())}")
    
    # Construct 5-digit FIPS geoid from state + county codes (keep as string)
    state_fips_default = '06'  # California
    if 'state_fips_code' in transformed_data.columns and 'county_code' in transformed_data.columns:
        transformed_data['geoid'] = transformed_data['state_fips_code'].astype(str).str.zfill(2) + \
                                    transformed_data['county_code'].astype(str).str.zfill(3)
    elif 'state_alpha' in transformed_data.columns and 'county_code' in transformed_data.columns:
        state_alpha_to_fips = {'CA': '06'}
        transformed_data['geoid'] = transformed_data['state_alpha'].map(
            lambda x: state_alpha_to_fips.get(str(x).upper(), state_fips_default)
        ).astype(str) + transformed_data['county_code'].astype(str).str.zfill(3)
    elif 'county_code' in transformed_data.columns:
        transformed_data['geoid'] = state_fips_default + transformed_data['county_code'].astype(str).str.zfill(3)
    else:
        print("  ‚ö† Warning: 'county_code' not found; cannot construct geoid")
        transformed_data['geoid'] = None
    
    transformed_data['geoid'] = transformed_data['geoid'].astype(str).str.zfill(5)
    
    # Map commodity names to IDs ‚Üí RENAME TO commodity_code for database consistency
    def get_commodity_id(name):
        if pd.isna(name):
            return None
        if name.upper() in commodity_map:
            return commodity_map[name.upper()]
        for db_name, db_id in commodity_map.items():
            if name.upper() in db_name or db_name in name.upper():
                return db_id
        return None
    
    if 'commodity' in transformed_data.columns:
        transformed_data['commodity_code'] = transformed_data['commodity'].apply(get_commodity_id)
    else:
        print("  ‚ö† Warning: 'commodity' column not found")
        transformed_data['commodity_code'] = None
    
    # Map to parameter_id and unit_id from database (by name lookup)
    if 'statistic' in transformed_data.columns:
        transformed_data['parameter_id'] = transformed_data['statistic'].map(
            lambda x: parameter_id_map.get(x.upper()) if pd.notna(x) else None
        )
    
    if 'unit' in transformed_data.columns:
        transformed_data['unit_id'] = transformed_data['unit'].map(
            lambda x: unit_id_map.get(x.upper()) if pd.notna(x) else None
        )
    
    # Add metadata columns
    transformed_data['source_reference'] = 'USDA NASS QuickStats API'
    
    # Capture source type (CENSUS vs SURVEY) for routing to correct table
    if 'source_desc' in transformed_data.columns:
        transformed_data['source_type'] = transformed_data['source_desc']
        print(f"  Captured source_type: {transformed_data['source_type'].value_counts().to_dict()}")
    else:
        print("  ‚ö† Warning: source_desc not found - defaulting to CENSUS")
        transformed_data['source_type'] = 'CENSUS'
    
    # Set record_type for polymorphic relationship (table name for discriminator)
    transformed_data['record_type'] = transformed_data['source_type'].map({
        'CENSUS': 'usda_census_record',
        'SURVEY': 'usda_survey_record'
    })
    print(f"  Set record_type: {transformed_data['record_type'].value_counts().to_dict()}")
    
    # ====================================================================
    # APPLY COWORKER CLEANING FUNCTIONS (from cleaning_functions.ipynb)
    # ====================================================================
    print("\nüßπ Step 4b: Apply cleaning functions from coworker pattern...")
    
    # Import cleaning functions (functions are embedded here for notebook portability)
    # In production, these would be imported from ca_biositing.pipeline.utils.cleaning_functions
    
    def replace_empty_with_na(df, columns=None, regex=r'^\s*$'):
        """Replace empty/whitespace-only strings with NaN"""
        if columns is None:
            return df.replace(regex, np.nan, regex=True)
        df = df.copy()
        cols = [c for c in columns if c in df.columns]
        if cols:
            df[cols] = df[cols].replace(regex, np.nan, regex=True)
        return df
    
    def to_lowercase_df(df, columns=None):
        """Lowercase string columns to reduce variations (e.g., 'Corn' vs 'corn')"""
        df = df.copy()
        if columns is None:
            str_cols = df.select_dtypes(include=['object', 'string']).columns
        else:
            str_cols = [c for c in columns if c in df.columns]
        for c in str_cols:
            df[c] = df[c].astype('string').str.lower().where(df[c].notna(), df[c])
        return df
    
    # Apply cleaning: replace empty strings with NaN
    string_cols = ['commodity', 'statistic', 'unit', 'county', 'description', 'survey_period', 'reference_month']
    string_cols = [c for c in string_cols if c in transformed_data.columns]
    transformed_data = replace_empty_with_na(transformed_data, columns=string_cols)
    print(f"  ‚úì Replaced empty strings with NaN in {len(string_cols)} columns")
    
    # Apply cleaning: lowercase all string columns for consistency
    transformed_data = to_lowercase_df(transformed_data, columns=string_cols)
    print(f"  ‚úì Lowercased {len(string_cols)} string columns for consistency")
    
    # Convert observation strings (with commas/decimals) to numeric float
    if 'observation' in transformed_data.columns:
        transformed_data['value_numeric'] = transformed_data['observation'].astype(str).str.replace(',', '')
        transformed_data['value_numeric'] = pd.to_numeric(transformed_data['value_numeric'], errors='coerce')
        transformed_data['value_text'] = transformed_data['observation'].astype(str)
    
    # Handle CV% field
    if 'CV (%)' in transformed_data.columns:
        transformed_data['cv_pct'] = pd.to_numeric(transformed_data['CV (%)'], errors='coerce')
    else:
        transformed_data['cv_pct'] = None
    
    # Coerce all ID columns to integers (nullable Int64 type)
    id_columns = ['commodity_code', 'parameter_id', 'unit_id']
    for col in id_columns:
        if col in transformed_data.columns:
            transformed_data[col] = pd.to_numeric(transformed_data[col], errors='coerce').astype('Int64')
    
    # Create note field
    transformed_data['note'] = transformed_data.apply(
        lambda row: f"{row.get('statistic', 'N/A')} in {row.get('unit', 'N/A')} for {row.get('commodity', 'N/A')} in {row.get('county', 'N/A')}", 
        axis=1
    )
    
    # Keep relevant columns (load step will create record_id FK)
    final_columns = [
        # Record fields (for UsdaCensusRecord/UsdaSurveyRecord)
        'geoid', 'year', 'commodity_code', 'source_reference', 'source_type', 'record_type',
        # Survey-specific fields
        'survey_period', 'reference_month', 'begin_code', 'end_code',
        # Observation fields
        'parameter_id', 'value_numeric', 'value_text', 'cv_pct', 'unit_id', 'note',
        # Original for reference
        'commodity', 'statistic', 'unit', 'county', 'description'
    ]
    
    # Only include columns that exist
    final_columns = [col for col in final_columns if col in transformed_data.columns]
    transformed_data = transformed_data[final_columns]
    
    # Drop rows with missing required fields
    required_fields = ['geoid', 'year', 'commodity_code', 'parameter_id', 'unit_id', 'value_numeric']
    required_fields = [col for col in required_fields if col in transformed_data.columns]
    transformed_data = transformed_data.dropna(subset=required_fields)
    
    print(f"\n‚úì Transform complete!")
    print(f"  Total rows: {len(transformed_data)}")
    print(f"  Columns: {list(transformed_data.columns)}")
    
    # Show data types for ID columns
    print(f"\nData types for ID columns:")
    for col in ['commodity_code', 'parameter_id', 'unit_id', 'value_numeric']:
        if col in transformed_data.columns:
            print(f"  {col}: {transformed_data[col].dtype}")
    
    # Show survey-specific fields captured
    print(f"\nSurvey-specific fields captured:")
    for col in ['survey_period', 'reference_month']:
        if col in transformed_data.columns:
            unique_vals = transformed_data[col].dropna().unique()
            print(f"  {col}: {len(unique_vals)} unique values - {unique_vals[:5].tolist()}")
    
    print(f"\nSample record:")
    if len(transformed_data) > 0:
        sample = transformed_data.head(1).to_dict('records')[0]
        for key, val in sample.items():
            print(f"  {key}: {val} (type: {type(val).__name__})")
    else:
        print("  ‚ö† No valid records after transformation")

Transform Step: Mapping API data to database schema
Debug: Available columns in raw_data: ['year', 'reference_period_desc', 'county_name', 'unit_desc', 'freq_desc', 'Value', 'state_ansi', 'watershed_desc', 'agg_level_desc', 'prodn_practice_desc']...
Step 1: Creating Parameter/Unit records if needed...
  All 6 parameters already exist
  All 6 units already exist

Step 2: Mapping commodity names to database IDs...
  Found 4 commodities in database

Step 3: Looking up parameter and unit IDs...
  Found 6 parameters, 6 units

Step 4: Creating transformed dataframe...
  Renamed columns: ['commodity_desc', 'statisticcat_desc', 'unit_desc', 'Value', 'county_name', 'short_desc', 'year', 'freq_desc', 'reference_period_desc', 'begin_code', 'end_code']
  Captured source_type: {'CENSUS': 6496, 'SURVEY': 68}
  Set record_type: {'usda_census_record': 6496, 'usda_survey_record': 68}

üßπ Step 4b: Apply cleaning functions from coworker pattern...
  ‚úì Replaced empty strings with NaN in 7 columns
  ‚ú

In [151]:
# Display transformed_data in Data Wrangler
print("Preparing to display transformed_data in Data Wrangler...")
print(f"Shape: {transformed_data.shape}")
print(f"\nPreview (first 5 rows):")
print(transformed_data.head().to_string())

# The Data Wrangler will be opened with the variable below
transformed_data

Preparing to display transformed_data in Data Wrangler...
Shape: (69, 21)

Preview (first 5 rows):
     geoid  year  commodity_code          source_reference source_type         record_type survey_period reference_month begin_code end_code  parameter_id  value_numeric value_text  cv_pct  unit_id                                             note commodity       statistic   unit       county                                  description
310  06077  2022               1  USDA NASS QuickStats API      CENSUS  usda_census_record        annual            year         00       00            19        14503.0     14,503    62.8        6  area harvested in acres for corn in san joaquin      corn  area harvested  acres  san joaquin                corn, grain - acres harvested
318  06077  2022               1  USDA NASS QuickStats API      CENSUS  usda_census_record        annual            year         00       00            19        51836.0     51,836    22.0        6  area harvested in acres fo

Unnamed: 0,geoid,year,commodity_code,source_reference,source_type,record_type,survey_period,reference_month,begin_code,end_code,parameter_id,value_numeric,value_text,cv_pct,unit_id,note,commodity,statistic,unit,county,description
310,06077,2022,1,USDA NASS QuickStats API,CENSUS,usda_census_record,annual,year,00,00,19,14503.0,14503,62.8,6,area harvested in acres for corn in san joaquin,corn,area harvested,acres,san joaquin,"corn, grain - acres harvested"
318,06077,2022,1,USDA NASS QuickStats API,CENSUS,usda_census_record,annual,year,00,00,19,51836.0,51836,22.0,6,area harvested in acres for corn in san joaquin,corn,area harvested,acres,san joaquin,"corn, silage - acres harvested"
326,06077,2022,1,USDA NASS QuickStats API,CENSUS,usda_census_record,annual,year,00,00,18,1345187.0,1345187,21.1,5,production in tons for corn in san joaquin,corn,production,tons,san joaquin,"corn, silage - production, measured in tons"
327,06077,2022,1,USDA NASS QuickStats API,CENSUS,usda_census_record,annual,year,00,00,19,14503.0,14503,63.5,6,area harvested in acres for corn in san joaquin,corn,area harvested,acres,san joaquin,"corn, grain, irrigated - acres harvested"
329,06077,2022,1,USDA NASS QuickStats API,CENSUS,usda_census_record,annual,year,00,00,19,51644.0,51644,21.8,6,area harvested in acres for corn in san joaquin,corn,area harvested,acres,san joaquin,"corn, silage, irrigated - acres harvested"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6525,06047,2022,1,USDA NASS QuickStats API,SURVEY,usda_survey_record,annual,year,00,00,19,63200.0,63200,11.2,6,area harvested in acres for corn in merced,corn,area harvested,acres,merced,"corn, silage - acres harvested"
6526,06047,2022,1,USDA NASS QuickStats API,SURVEY,usda_survey_record,annual,year,00,00,18,1612000.0,1612000,12.9,5,production in tons for corn in merced,corn,production,tons,merced,"corn, silage - production, measured in tons"
6558,06047,2022,4,USDA NASS QuickStats API,SURVEY,usda_survey_record,annual,year,00,00,19,24600.0,24600,,6,area harvested in acres for tomatoes in merced,tomatoes,area harvested,acres,merced,"tomatoes, in the open, processing - acres harv..."
6559,06047,2022,4,USDA NASS QuickStats API,SURVEY,usda_survey_record,annual,year,00,00,21,24700.0,24700,,6,area planted in acres for tomatoes in merced,tomatoes,area planted,acres,merced,"tomatoes, in the open, processing - acres planted"


In [152]:

print("="*80)
print("üîç DIAGNOSTIC: Data Reduction Breakdown")
print("="*80)

if 'raw_data' in locals() and 'transformed_data' in locals():
    print(f"\nStarting point:")
    print(f"  raw_data rows: {len(raw_data)}")
    
    # Check commodity mapping
    commodity_valid = transformed_data['commodity_code'].notna().sum()
    commodity_invalid = transformed_data['commodity_code'].isna().sum()
    print(f"\n1Ô∏è‚É£  After commodity mapping:")
    print(f"  Valid commodity_code: {commodity_valid} ({100*commodity_valid/len(transformed_data):.1f}%)")
    print(f"  Invalid/NULL commodity_code: {commodity_invalid} ({100*commodity_invalid/len(transformed_data):.1f}%)")
    
    # Check parameter mapping
    parameter_valid = transformed_data['parameter_id'].notna().sum()
    parameter_invalid = transformed_data['parameter_id'].isna().sum()
    print(f"\n2Ô∏è‚É£  After parameter mapping:")
    print(f"  Valid parameter_id: {parameter_valid} ({100*parameter_valid/len(transformed_data):.1f}%)")
    print(f"  Invalid/NULL parameter_id: {parameter_invalid} ({100*parameter_invalid/len(transformed_data):.1f}%)")
    
    # Check unit mapping
    unit_valid = transformed_data['unit_id'].notna().sum()
    unit_invalid = transformed_data['unit_id'].isna().sum()
    print(f"\n3Ô∏è‚É£  After unit mapping:")
    print(f"  Valid unit_id: {unit_valid} ({100*unit_valid/len(transformed_data):.1f}%)")
    print(f"  Invalid/NULL unit_id: {unit_invalid} ({100*unit_invalid/len(transformed_data):.1f}%)")
    
    # Check value conversion
    value_valid = transformed_data['value_numeric'].notna().sum()
    value_invalid = transformed_data['value_numeric'].isna().sum()
    print(f"\n4Ô∏è‚É£  After value_numeric conversion:")
    print(f"  Valid value_numeric: {value_valid} ({100*value_valid/len(transformed_data):.1f}%)")
    print(f"  Invalid/NULL value_numeric: {value_invalid} ({100*value_invalid/len(transformed_data):.1f}%)")
    
    # Check all required fields
    required_fields = ['geoid', 'year', 'commodity_code', 'parameter_id', 'unit_id', 'value_numeric']
    all_valid = transformed_data.dropna(subset=required_fields)
    print(f"\n5Ô∏è‚É£  After dropna (ALL required fields non-NULL):")
    print(f"  Records with all fields: {len(all_valid)} ({100*len(all_valid)/len(transformed_data):.1f}%)")
    print(f"  Records dropped: {len(transformed_data) - len(all_valid)}")
    
    # Show which single filter is most aggressive
    print(f"\nüìä Most aggressive single filters:")
    print(f"  Commodity mapping loses: {commodity_invalid} records ({100*commodity_invalid/len(raw_data):.1f}% of original)")
    print(f"  Parameter mapping loses: {parameter_invalid} records ({100*parameter_invalid/len(raw_data):.1f}% of original)")
    print(f"  Unit mapping loses: {unit_invalid} records ({100*unit_invalid/len(raw_data):.1f}% of original)")
    print(f"  Value conversion loses: {value_invalid} records ({100*value_invalid/len(raw_data):.1f}% of original)")
    
    # Show some examples of dropped records
    print(f"\nüìã Sample of DROPPED records (missing required fields):")
    dropped = transformed_data[transformed_data[required_fields].isna().any(axis=1)]
    print(f"  Total dropped: {len(dropped)}")
    if len(dropped) > 0:
        print(f"\n  Reasons for dropping (top 5):")
        reasons = []
        for _, row in dropped.head(5).iterrows():
            missing = []
            if pd.isna(row['commodity_code']):
                missing.append("commodity_code")
            if pd.isna(row['parameter_id']):
                missing.append("parameter_id")
            if pd.isna(row['unit_id']):
                missing.append("unit_id")
            if pd.isna(row['value_numeric']):
                missing.append("value_numeric")
            reasons.append(missing)
            print(f"    - {row.get('commodity', 'N/A')} / {row.get('statistic', 'N/A')} / {row.get('unit', 'N/A')}: Missing {', '.join(missing)}")
    
    print(f"\n{'='*80}")
else:
    print("‚ö†Ô∏è  raw_data or transformed_data not available - run Extract and Transform first")


üîç DIAGNOSTIC: Data Reduction Breakdown

Starting point:
  raw_data rows: 6564

1Ô∏è‚É£  After commodity mapping:
  Valid commodity_code: 69 (100.0%)
  Invalid/NULL commodity_code: 0 (0.0%)

2Ô∏è‚É£  After parameter mapping:
  Valid parameter_id: 69 (100.0%)
  Invalid/NULL parameter_id: 0 (0.0%)

3Ô∏è‚É£  After unit mapping:
  Valid unit_id: 69 (100.0%)
  Invalid/NULL unit_id: 0 (0.0%)

4Ô∏è‚É£  After value_numeric conversion:
  Valid value_numeric: 69 (100.0%)
  Invalid/NULL value_numeric: 0 (0.0%)

5Ô∏è‚É£  After dropna (ALL required fields non-NULL):
  Records with all fields: 69 (100.0%)
  Records dropped: 0

üìä Most aggressive single filters:
  Commodity mapping loses: 0 records (0.0% of original)
  Parameter mapping loses: 0 records (0.0% of original)
  Unit mapping loses: 0 records (0.0% of original)
  Value conversion loses: 0 records (0.0% of original)

üìã Sample of DROPPED records (missing required fields):
  Total dropped: 0



In [117]:
# Check actual data types from NASS API for Value column
print("Investigating NASS API data types:")
print("="*60)

# 1) Raw Value dtype and sample values
print("\n1. Raw 'Value' column dtype and samples:")
try:
    raw_dtype = raw_data['Value'].dtype
    print(f"  Raw dtype: {raw_dtype}")
    print(f"  Sample values (first 10):")
    for idx, val in enumerate(raw_data['Value'].head(10)):
        print(f"    [{idx}] {repr(val)} (type: {type(val).__name__})")
except Exception as e:
    print(f"  ‚ö† Unable to inspect raw_data['Value']: {e}")

# 2) String formatting patterns: commas, decimals, whitespace
print("\n2. String formatting patterns in 'Value':")
try:
    value_str = raw_data['Value'].astype(str)
    has_commas = value_str.str.contains(',').sum()
    has_decimals = value_str.str.contains(r'\.').sum()
    has_whitespace = value_str.str.contains(r'\s').sum()
    total = len(value_str)
    print(f"  With commas: {has_commas}/{total}")
    print(f"  With decimal point: {has_decimals}/{total}")
    print(f"  With whitespace: {has_whitespace}/{total}")
except Exception as e:
    print(f"  ‚ö† Unable to analyze string patterns: {e}")

# 3) Coerce to numeric: remove commas, convert to float
print("\n3. Coercion to numeric float (remove commas, handle decimals):")
try:
    value_num = pd.to_numeric(value_str.str.replace(',', ''), errors='coerce')
    non_null = value_num.notna().sum()
    nulls = value_num.isna().sum()
    pct_numeric = round(100 * non_null / (non_null + nulls), 2) if (non_null + nulls) > 0 else 0.0
    print(f"  Converted dtype: {value_num.dtype}")
    print(f"  Numeric rows: {non_null}, Non-numeric (NaN): {nulls}, % numeric: {pct_numeric}%")
    if non_null > 0:
        print(f"  Range: min={value_num.min()}, max={value_num.max()}")
    # Show a few rows that failed conversion, if any
    if nulls > 0:
        failed_samples = value_str[value_num.isna()].head(5).tolist()
        print(f"  Samples that failed conversion: {failed_samples}")
except Exception as e:
    print(f"  ‚ö† Unable to convert 'Value' to numeric: {e}")


Investigating NASS API data types:

1. Raw 'Value' column dtype and samples:
  Raw dtype: object
  Sample values (first 10):
    [0] '910,695,000' (type: str)
    [1] '560' (type: str)
    [2] '                 (D)' (type: str)
    [3] '2' (type: str)
    [4] '                 (D)' (type: str)
    [5] '1' (type: str)
    [6] '                 (D)' (type: str)
    [7] '1' (type: str)
    [8] '                 (D)' (type: str)
    [9] '1' (type: str)

2. String formatting patterns in 'Value':
  With commas: 2043/6564
  With decimal point: 30/6564
  With whitespace: 785/6564

3. Coercion to numeric float (remove commas, handle decimals):
  Converted dtype: float64
  Numeric rows: 5779, Non-numeric (NaN): 785, % numeric: 88.04%
  Range: min=-999000.0, max=17806949000.0
  Samples that failed conversion: ['                 (D)', '                 (D)', '                 (D)', '                 (D)', '                 (D)']


## Step 6: Test USDA Load (Insert to Database)

### Reset database to test load code (optional)


In [None]:
print("="*80)
print("‚ö†Ô∏è  CLEANUP: Delete USDA data from database (for fresh testing)")
print("="*80)

from sqlalchemy import text

# First, check how many USDA observations exist
with engine.connect() as conn:
    usda_obs_count = conn.execute(text("""
        SELECT COUNT(*) FROM observation 
        WHERE record_type IN ('usda_census_record', 'usda_survey_record')
    """)).scalar()
    
    census_count = conn.execute(text("SELECT COUNT(*) FROM usda_census_record")).scalar()
    survey_count = conn.execute(text("SELECT COUNT(*) FROM usda_survey_record")).scalar()

print(f"\nüìä USDA data in database:")
print(f"  - USDA observations (record_type='usda_census_record'/'usda_survey_record'): {usda_obs_count}")
print(f"  - Census records: {census_count}")
print(f"  - Survey records: {survey_count}")

if usda_obs_count == 0 and census_count == 0 and survey_count == 0:
    print("\n‚úì No USDA data to delete - database is already clean")
else:
    confirm = input(f"\n‚ö†Ô∏è  Delete the {usda_obs_count} USDA observations, {census_count} census records, and {survey_count} survey records?\nType 'YES' to confirm, anything else to cancel: ").strip()
    
    if confirm == 'YES':
        with engine.begin() as conn:
            print("\nüóëÔ∏è  Truncating USDA observations (record_type='usda_census_record' or 'usda_survey_record')...")
            # Use DELETE with CASCADE for observation records since they have foreign keys
            result = conn.execute(text("""
                DELETE FROM observation 
                WHERE record_type IN ('usda_census_record', 'usda_survey_record')
            """))
            print(f"  ‚úì Deleted {result.rowcount} USDA observations")
            
            print("üóëÔ∏è  Truncating survey records (resets auto-increment)...")
            conn.execute(text("TRUNCATE TABLE usda_survey_record CASCADE"))
            print(f"  ‚úì Truncated usda_survey_record and reset sequence")
            
            print("üóëÔ∏è  Truncating census records (resets auto-increment)...")
            conn.execute(text("TRUNCATE TABLE usda_census_record CASCADE"))
            print(f"  ‚úì Truncated usda_census_record and reset sequence")
        
        print("\n‚úÖ CLEANUP COMPLETE - USDA data removed, ID sequences reset, other data preserved")
    else:
        print("\n‚ùå Cleanup cancelled")


‚ö†Ô∏è  CLEANUP: Delete USDA data from database (for fresh testing)

üìä USDA data in database:
  - USDA observations (record_type='usda_census_record'/'usda_survey_record'): 182
  - Census records: 46
  - Survey records: 13



üóëÔ∏è  Deleting USDA observations only (record_type='usda_census_record' or 'usda_survey_record')...
  ‚úì Deleted 182 USDA observations
üóëÔ∏è  Deleting survey records...
  ‚úì Deleted 13 survey records
üóëÔ∏è  Deleting census records...
  ‚úì Deleted 46 census records

‚úÖ CLEANUP COMPLETE - USDA data removed, other data preserved


### Load to database

In [None]:

print(f"  ‚úÖ Inserted {obs_inserted} observations")

# ============================================================================
# STEP 7: Link legacy records to datasets (backfill any that were created before)
# ============================================================================
print("\nüíæ STEP 7: Link legacy records to datasets...")

# Build reproducible mapping from dataset table
dataset_link_map = {}
with engine.connect() as conn:
    result = conn.execute(text("""
        SELECT id, name FROM dataset 
        WHERE name LIKE 'USDA_%'
        ORDER BY id
    """))
    for row in result:
        dataset_id, name = row
        try:
            year = int(name.split('_')[-1])
            source_type = 'CENSUS' if 'CENSUS' in name else 'SURVEY'
            dataset_link_map[(year, source_type)] = dataset_id
        except (ValueError, IndexError):
            pass

# Link census records
with engine.begin() as conn:
    for (year, source_type), dataset_id in dataset_link_map.items():
        if source_type == 'CENSUS':
            result = conn.execute(text(f"""
                UPDATE usda_census_record 
                SET dataset_id = {dataset_id}
                WHERE year = {year} AND dataset_id IS NULL
            """))
            if result.rowcount > 0:
                print(f"  Linked {result.rowcount} legacy census records for year {year}")

# Link survey records
with engine.begin() as conn:
    for (year, source_type), dataset_id in dataset_link_map.items():
        if source_type == 'SURVEY':
            result = conn.execute(text(f"""
                UPDATE usda_survey_record 
                SET dataset_id = {dataset_id}
                WHERE year = {year} AND dataset_id IS NULL
            """))
            if result.rowcount > 0:
                print(f"  Linked {result.rowcount} legacy survey records for year {year}")

# Link observations
with engine.begin() as conn:
    result = conn.execute(text("""
        UPDATE observation 
        SET dataset_id = d.id
        FROM dataset d
        WHERE observation.dataset_id IS NULL
        AND observation.record_type = 'usda_census_record'
        AND d.name LIKE 'USDA_CENSUS_%'
    """))
    if result.rowcount > 0:
        print(f"  Linked {result.rowcount} legacy census observations")
    
    result = conn.execute(text("""
        UPDATE observation 
        SET dataset_id = d.id
        FROM dataset d
        WHERE observation.dataset_id IS NULL
        AND observation.record_type = 'usda_survey_record'
        AND d.name LIKE 'USDA_SURVEY_%'
    """))
    if result.rowcount > 0:
        print(f"  Linked {result.rowcount} legacy survey observations")

print("\n" + "="*80)
print("‚úÖ LOAD COMPLETE")
print(f"Census Records:    0 inserted, {census_skipped} skipped (linked to dataset_id)")
print(f"Survey Records:    0 inserted, {survey_skipped} skipped (linked to dataset_id)")
print(f"Observations:      {obs_inserted} inserted (all with dataset_id)")
print("="*80)


USDA DATA LOAD - Extract ‚Üí Transform ‚Üí Load

üì¶ STEP 0: Ensure Source and Dataset entries exist...
  ‚úì Source 'USDA NASS API' already exists (ID: 1)

  Found 2 dataset configurations needed:
    ‚úì Dataset 'USDA_CENSUS_2022' exists (ID: 2)
    ‚úì All datasets ready. Mapping: {(2022, 'CENSUS'): 2}
    ‚úì Dataset 'USDA_SURVEY_2022' exists (ID: 3)
    ‚úì All datasets ready. Mapping: {(2022, 'CENSUS'): 2, (2022, 'SURVEY'): 3}

üì¶ STEP 1: Load existing records from database...
  ‚úì Found 9 census records
  ‚úì Found 7 survey records

üìã STEP 2: Prepare new records for insertion...
  Census: 0 new, 52 already exist
  Survey: 0 new, 17 already exist

üíæ STEP 3-6: Insert observations...


  conn.execute(insert(Observation).values(obs_records))


CompileError: Unconsumed column names: geoid, value_numeric

# Step 7: Verification

In [155]:
print("="*80)
print("‚úÖ VERIFICATION: Check data in database")
print("="*80)

with engine.connect() as conn:
    # Total counts
    census_count = conn.execute(text("SELECT COUNT(*) FROM usda_census_record")).scalar()
    survey_count = conn.execute(text("SELECT COUNT(*) FROM usda_survey_record")).scalar()
    obs_count = conn.execute(text("SELECT COUNT(*) FROM observation")).scalar()
    
    print(f"\nüìä Total records in database:")
    print(f"  Census records: {census_count}")
    print(f"  Survey records: {survey_count}")
    print(f"  Observations:   {obs_count}")
    
    # Check timestamp coverage
    obs_with_timestamps = conn.execute(text("""
        SELECT COUNT(created_at), COUNT(updated_at) 
        FROM observation
    """)).fetchone()
    
    print(f"\n‚è±Ô∏è  Observation timestamps:")
    print(f"  With created_at: {obs_with_timestamps[0]}")
    print(f"  With updated_at: {obs_with_timestamps[1]}")
    
    # Show sample of newest observations with timestamps
    print(f"\nüìã Sample of newest observations (with timestamps):")
    result = conn.execute(text("""
        SELECT id, record_id, created_at, updated_at
        FROM observation
        WHERE created_at IS NOT NULL
        ORDER BY id DESC LIMIT 3
    """))
    for row in result:
        print(f"  ID {row[0]}: created={row[2]}, updated={row[3]}")

‚úÖ VERIFICATION: Check data in database

üìä Total records in database:
  Census records: 52
  Survey records: 13
  Observations:   121

‚è±Ô∏è  Observation timestamps:
  With created_at: 121
  With updated_at: 121

üìã Sample of newest observations (with timestamps):
  ID 303: created=2026-01-27 23:46:51.155016, updated=2026-01-27 23:46:51.155016
  ID 302: created=2026-01-27 23:46:51.155016, updated=2026-01-27 23:46:51.155016
  ID 301: created=2026-01-27 23:46:51.155016, updated=2026-01-27 23:46:51.155016
