# CMIP6 ET₀ Calculation - Reference Evapotranspiration

## Overview

This notebook calculates **daily reference evapotranspiration (ET₀)** from CMIP6 climate data using the **FAO-56 Penman-Monteith** method.

## Purpose

ET₀ represents the atmospheric evaporative demand and is a physically-based evaporation proxy calculated from CMIP6 meteorological variables. ET₀ can be used as input for further calibration to pan evaporation in a separate process.

## Method

From CMIP6 variables:
- **tasmax, tasmin** → temperature (used to calculate mean temperature)
- **hurs** → mean relative humidity (used to calculate vapor pressure)
- **rsds** → incoming solar radiation (converted to MJ/m²/day)
- **sfcWind** → wind speed at 10m (adjusted to 2m height)

These variables are used to compute **reference evapotranspiration (ET₀)** using the **FAO-56 Penman-Monteith** formulation.

**ET₀ Definition:**
- ET₀ (mm/day) = physically based evaporation driven by radiation + temperature + humidity + wind
- ET₀ represents the atmospheric evaporative demand
- ET₀ is a clean, model-consistent evaporation signal

## Input Variables (CMIP6)

- **tasmax**: Daily maximum temperature (°C) - required
- **tasmin**: Daily minimum temperature (°C) - required
- **hurs**: Daily mean relative humidity (%) - required
- **rsds**: Surface downwelling shortwave radiation (W/m²) - required
- **sfcWind**: Wind speed at 10m (m/s) - required

## Output

- **eto**: Daily reference evapotranspiration (mm/day) - calculated using FAO-56 Penman-Monteith

## Section 1: Imports and Configuration

In [1]:
import pandas as pd
import numpy as np
import xarray as xr
import glob
import os
import time
import re
import math
from pathlib import Path
from datetime import datetime
from tqdm import tqdm

print("Libraries imported successfully")

Libraries imported successfully


## Section 2.5: Caching Functions

To optimize performance, extracted NetCDF data is cached to CSV files. On subsequent runs, 
cached data is loaded automatically instead of re-extracting from NetCDF files.

**Cache Location:** Cached files are saved in the output directory with naming convention:
`{model_scenario}_{lat_str}_{lon_str}_{variable}.csv`

**Cache Behavior:**
- If cache exists and is valid → Load from cache (fast)
- If cache doesn't exist or is invalid → Extract from NetCDF files and save to cache


In [2]:
def get_cached_variable_path(output_dir, model_scenario, lat_str, lon_str, variable):
    """
    Generate the path for a cached variable CSV file.
    
    Parameters:
    -----------
    output_dir : str
        Output directory
    model_scenario : str
        Model and scenario string (e.g., "ACCESS_CM2_SSP245")
    lat_str : str
        Latitude formatted as string (e.g., "-31.75")
    lon_str : str
        Longitude formatted as string (e.g., "117.60")
    variable : str
        Variable name (e.g., "tasmax", "hurs", "sfcWind")
    
    Returns:
    --------
    str
        Path to cached CSV file
    """
    cache_filename = f"{model_scenario}_{lat_str}_{lon_str}_{variable}.csv"
    cache_path = os.path.join(output_dir, cache_filename)
    return cache_path


def load_cached_variable(cache_path):
    """
    Load cached variable data from CSV file.
    
    Parameters:
    -----------
    cache_path : str
        Path to cached CSV file
    
    Returns:
    --------
    pd.DataFrame or None
        DataFrame with date and value columns if file exists and is valid, None otherwise
    """
    if not os.path.exists(cache_path):
        return None
    
    try:
        df = pd.read_csv(cache_path)
        if 'date' not in df.columns or 'value' not in df.columns:
            print(f"  [WARNING] Cached file missing required columns, will re-extract")
            return None
        
        df['date'] = pd.to_datetime(df['date'])
        
        # Basic validation - check if file has data
        if len(df) == 0:
            print(f"  [WARNING] Cached file is empty, will re-extract")
            return None
        
        return df
    
    except Exception as e:
        print(f"  [WARNING] Error loading cached file: {e}, will re-extract")
        return None


def save_cached_variable(df, cache_path):
    """
    Save extracted variable data to CSV cache file.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with date and value columns
    cache_path : str
        Path to save cached CSV file
    """
    try:
        df[['date', 'value']].to_csv(
            cache_path,
            index=False,
            encoding='utf-8',
            float_format='%.6f'
        )
        print(f"  [INFO] Saved to cache: {os.path.basename(cache_path)}")
    except Exception as e:
        print(f"  [WARNING] Failed to save cache: {e}")



## Section 2: NetCDF Data Extraction Function

In [3]:
def extract_daily_data_from_netcdf(netcdf_dir, variable, target_lat, target_lon, tolerance=0.01):
    """
    Extract daily time series data for a specific coordinate from NetCDF files.
    
    Parameters:
    -----------
    netcdf_dir : str
        Directory containing NetCDF files for the variable
    variable : str
        Variable name (tasmax, tasmin, hurs, rsds, sfcWind, etc.)
    target_lat : float
        Target latitude
    target_lon : float
        Target longitude
    tolerance : float
        Coordinate matching tolerance in degrees
    
    Returns:
    --------
    pd.DataFrame
        DataFrame with columns: date, value
    """
    start_time = time.time()
    
    # Find all NetCDF files in the directory
    nc_files = sorted(glob.glob(os.path.join(netcdf_dir, f"*{variable}*.nc")))
    
    # Pattern 2: Files in subdirectories named {variable}_*
    if len(nc_files) == 0:
        var_subdirs = glob.glob(os.path.join(netcdf_dir, f"{variable}_*"))
        for var_subdir in var_subdirs:
            if os.path.isdir(var_subdir):
                found_files = sorted(glob.glob(os.path.join(var_subdir, "*.nc")))
                if found_files:
                    nc_files.extend(found_files)
                    print(f"  Found files in subdirectory: {os.path.basename(var_subdir)}/")
                    break
    
    # For rsds, also check rad_* folders
    if len(nc_files) == 0 and variable == 'rsds':
        rad_subdirs = glob.glob(os.path.join(netcdf_dir, "rad_*"))
        for rad_subdir in rad_subdirs:
            if os.path.isdir(rad_subdir):
                found_files = sorted(glob.glob(os.path.join(rad_subdir, "*rsds*.nc")))
                if found_files:
                    nc_files.extend(found_files)
                    print(f"  Found files in subdirectory: {os.path.basename(rad_subdir)}/")
                    break
    
    # For sfcWind, also check wind_* folders
    if len(nc_files) == 0 and variable == 'sfcWind':
        wind_subdirs = glob.glob(os.path.join(netcdf_dir, "wind_*"))
        for wind_subdir in wind_subdirs:
            if os.path.isdir(wind_subdir):
                found_files = sorted(glob.glob(os.path.join(wind_subdir, "*sfcWind*.nc")))
                # If no sfcWind-specific files, try all .nc files in wind folder
                if not found_files:
                    found_files = sorted(glob.glob(os.path.join(wind_subdir, "*.nc")))
                if found_files:
                    nc_files.extend(found_files)
                    print(f"  Found files in subdirectory: {os.path.basename(wind_subdir)}/")
                    break
    
    if len(nc_files) == 0:
        print(f"  ERROR: No NetCDF files found in {netcdf_dir}")
        return None
    
    print(f"  Found {len(nc_files)} NetCDF files")
    
    # Cache coordinate information from first file
    lat_name = None
    lon_name = None
    time_name = None
    lat_idx = None
    lon_idx = None
    var_name = None
    
    # List to store daily data
    all_data = []
    
    # Process first file to get coordinate structure
    if len(nc_files) > 0:
        try:
            ds_sample = xr.open_dataset(nc_files[0], decode_times=False)
            
            # Get variable name
            for v in ds_sample.data_vars:
                if variable in v.lower() or v.lower() in variable.lower():
                    var_name = v
                    break
            
            if var_name is None:
                possible_names = [variable, variable.upper(), f'{variable}_day']
                if variable == 'rsds':
                    possible_names.extend(['rad', 'RAD', 'rad_day'])
                elif variable == 'sfcWind':
                    possible_names.extend(['wind', 'WIND', 'wind_day', 'sfcwind', 'SFCWIND'])
                for name in possible_names:
                    if name in ds_sample.data_vars:
                        var_name = name
                        break
            
            # Get coordinate names
            for coord in ds_sample.coords:
                coord_lower = coord.lower()
                if 'lat' in coord_lower:
                    lat_name = coord
                elif 'lon' in coord_lower:
                    lon_name = coord
                elif 'time' in coord_lower:
                    time_name = coord
            
            if lat_name and lon_name:
                # Find nearest grid point
                lat_idx = np.abs(ds_sample[lat_name].values - target_lat).argmin()
                lon_idx = np.abs(ds_sample[lon_name].values - target_lon).argmin()
                
                actual_lat = float(ds_sample[lat_name].values[lat_idx])
                actual_lon = float(ds_sample[lon_name].values[lon_idx])
                
                # Check if within tolerance
                if abs(actual_lat - target_lat) > tolerance or abs(actual_lon - target_lon) > tolerance:
                    print(f"  Warning: Nearest point ({actual_lat:.4f}, {actual_lon:.4f}) is outside tolerance")
                else:
                    print(f"  Using grid point: ({actual_lat:.4f}, {actual_lon:.4f})")
            
            ds_sample.close()
            
        except Exception as e:
            print(f"  Warning: Could not read sample file: {e}")
    
    if var_name is None or lat_idx is None or lon_idx is None:
        print(f"  ERROR: Could not determine coordinate structure")
        return None
    
    # Process all files with progress bar
    print(f"  Processing files...")
    for nc_file in tqdm(nc_files, desc=f"  {variable}", unit="file"):
        try:
            ds = xr.open_dataset(nc_file, decode_times=False)
            
            # Extract data using cached indices
            data = ds[var_name].isel({lat_name: lat_idx, lon_name: lon_idx})
            
            # Convert to numpy array
            values = data.values
            if values.ndim > 1:
                values = values.flatten()
            
            # Get time values - extract year from filename
            year = None
            filename = os.path.basename(nc_file)
            year_match = re.search(r'(\d{4})', filename)
            if year_match:
                year = int(year_match.group(1))
                # Create daily dates for the year (handles leap years automatically)
                time_values = pd.date_range(start=f'{year}-01-01', end=f'{year}-12-31', freq='D')
            else:
                time_values = pd.date_range(start='2035-01-01', periods=len(values), freq='D')
            
            # Ensure correct number of dates
            if len(time_values) != len(values):
                if len(time_values) > len(values):
                    time_values = time_values[:len(values)]
            
            # Create DataFrame for this file
            if len(values) > 0:
                df_file = pd.DataFrame({
                    'date': time_values[:len(values)],
                    'value': values
                })
                all_data.append(df_file)
            
            ds.close()
            
        except Exception as e:
            tqdm.write(f"    Error processing {os.path.basename(nc_file)}: {e}")
            continue
    
    if len(all_data) == 0:
        print(f"  ERROR: No data extracted")
        return None
    
    # Combine all data
    print(f"  Combining data from {len(all_data)} files...")
    combined_df = pd.concat(all_data, ignore_index=True)
    
    # Sort by date
    combined_df = combined_df.sort_values('date').reset_index(drop=True)
    
    # Remove duplicate dates (keep first occurrence)
    combined_df = combined_df.drop_duplicates(subset='date', keep='first')
    
    # Check for missing years and fill them
    if len(combined_df) > 0:
        min_date = combined_df['date'].min()
        max_date = combined_df['date'].max()
        
        # Create complete date range from min to max
        expected_dates = pd.date_range(start=min_date, end=max_date, freq='D')
        expected_dates_df = pd.DataFrame({'date': expected_dates})
        
        # Find missing dates
        combined_df['date'] = pd.to_datetime(combined_df['date'])
        merged_df = expected_dates_df.merge(combined_df, on='date', how='left')
        missing_dates = merged_df[merged_df['value'].isna()]
        
        if len(missing_dates) > 0:
            print(f"  [WARNING] Found {len(missing_dates)} missing dates - attempting to fill...")
            
            # Identify missing years
            missing_dates['year'] = missing_dates['date'].dt.year
            missing_years = sorted(missing_dates['year'].unique())
            
            for missing_year in missing_years:
                year_dates = missing_dates[missing_dates['year'] == missing_year]
                print(f"  [INFO] Missing year {missing_year}: {len(year_dates)} days")
                
                # Try to find a similar year to use as reference (use previous year if available, else next year)
                ref_year = None
                ref_data = None
                
                # Try previous year
                prev_year = missing_year - 1
                prev_year_data = combined_df[combined_df['date'].dt.year == prev_year]
                if len(prev_year_data) >= 365:
                    ref_year = prev_year
                    ref_data = prev_year_data.copy()
                    print(f"    Using {prev_year} as reference year")
                
                # If previous year not available, try next year
                if ref_data is None:
                    next_year = missing_year + 1
                    next_year_data = combined_df[combined_df['date'].dt.year == next_year]
                    if len(next_year_data) >= 365:
                        ref_year = next_year
                        ref_data = next_year_data.copy()
                        print(f"    Using {next_year} as reference year")
                
                # Fill missing year data
                if ref_data is not None:
                    # Create data for missing year by adjusting day-of-year from reference year
                    filled_data = []
                    for missing_date in year_dates['date']:
                        # Find corresponding day-of-year in reference year
                        # Adjust for leap years
                        ref_date = pd.Timestamp(year=ref_year, month=missing_date.month, day=missing_date.day)
                        
                        # If reference date doesn't exist (e.g., Feb 29 in non-leap year), use Feb 28
                        if not ref_date.is_leap_year and missing_date.month == 2 and missing_date.day == 29:
                            ref_date = pd.Timestamp(year=ref_year, month=2, day=28)
                        
                        # Find matching date in reference data
                        ref_match = ref_data[ref_data['date'].dt.month == ref_date.month]
                        ref_match = ref_match[ref_match['date'].dt.day == ref_date.day]
                        
                        if len(ref_match) > 0:
                            filled_value = ref_match['value'].iloc[0]
                            filled_data.append({
                                'date': missing_date,
                                'value': filled_value
                            })
                    
                    if len(filled_data) > 0:
                        filled_df = pd.DataFrame(filled_data)
                        combined_df = pd.concat([combined_df, filled_df], ignore_index=True)
                        print(f"    Filled {len(filled_data)} days for year {missing_year}")
                    else:
                        print(f"    [WARNING] Could not fill data for year {missing_year}")
                else:
                    print(f"    [ERROR] No suitable reference year found for {missing_year}")
            
            # Re-sort after filling
            combined_df = combined_df.sort_values('date').reset_index(drop=True)
            combined_df = combined_df.drop_duplicates(subset='date', keep='first')
    
    # Check for constant values by year (indicates missing/corrupted data in source files)
    combined_df['year'] = pd.to_datetime(combined_df['date']).dt.year
    for year in combined_df['year'].unique():
        year_data = combined_df[combined_df['year'] == year]['value']
        non_null_data = year_data.dropna()
        if len(non_null_data) > 10:  # Only check if we have enough data points
            if non_null_data.nunique() == 1:
                constant_value = non_null_data.iloc[0]
                num_days = len(non_null_data)
                print(f"  [WARNING] Year {year} has constant {variable} value: {constant_value:.2f} for {num_days} days (likely missing/corrupted source data)")
    combined_df = combined_df.drop(columns=['year'])
    
    elapsed_time = time.time() - start_time
    print(f"  ✓ Extracted {len(combined_df):,} daily records in {elapsed_time:.1f} seconds")
    print(f"  Date range: {combined_df['date'].min()} to {combined_df['date'].max()}")
    
    return combined_df

In [4]:
# CONFIGURATION - CHANGE VALUES BELOW AS NEEDED
# ============================================================================
# All other settings will automatically adjust based on these values

# Output Directory - OPTIONAL: Set to None to auto-generate, or specify a custom path
OUTPUT_DIR_MANUAL = r"C:\Users\ibian\Desktop\ClimAdapt\Anameka\Anameka_South_16_226042"  # Set to None for auto-generation

# Model (usually doesn't need to change)
MODEL = "ACCESS CM2"  # e.g., "ACCESS CM2"

# Scenario - CHANGE THIS
SCENARIO = "SSP585"   # Options: "SSP245", "SSP585", etc.

# Coordinates - CHANGE THESE
LATITUDE = -31.75   # Target latitude in decimal degrees (-90 to 90)
LONGITUDE = 117.5999984741211  # Target longitude in decimal degrees (-180 to 180)

# ============================================================================
# AUTOMATIC SETTINGS (derived from above - no need to change)
# ============================================================================

# Base directories
CMIP6_BASE_DIR = r"C:\Users\ibian\Desktop\ClimAdapt\CMIP6"
base_output_dir = r"C:\Users\ibian\Desktop\ClimAdapt\Anameka"
COORD_TOLERANCE = 0.01  # degrees (approximately 1.1 km)

# Auto-generate output directory and filename components based on scenario and coordinates
# For directory names, use underscore format (filesystem-friendly)
lat_str_dir = f"{LATITUDE:.2f}".replace('.', '_').replace('-', 'neg')
lon_str_dir = f"{LONGITUDE:.2f}".replace('.', '_').replace('-', 'neg')
# For output filenames, use decimal format (keep dots and minus signs)
lat_str = f"{LATITUDE:.2f}"
lon_str = f"{LONGITUDE:.2f}"
model_scenario = f"{MODEL.replace(' ', '_')}_{SCENARIO}"
model_scenario_dir = f"{model_scenario}_{lat_str_dir}_{lon_str_dir}"

# Use manual output directory if specified, otherwise auto-generate
if OUTPUT_DIR_MANUAL is not None and OUTPUT_DIR_MANUAL != "":
    OUTPUT_DIR = OUTPUT_DIR_MANUAL
    print(f"  [INFO] Using manual output directory: {OUTPUT_DIR}")
else:
    OUTPUT_DIR = os.path.join(base_output_dir, model_scenario_dir)
    print(f"  [INFO] Auto-generated output directory: {OUTPUT_DIR}")

# Variables required for ET₀ calculation (FAO-56 Penman-Monteith)
REQUIRED_VARIABLES = ['tasmax', 'tasmin', 'hurs', 'rsds', 'sfcWind']

# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("="*70)
print("CONFIGURATION")
print("="*70)
print(f"  Model: {MODEL}")
print(f"  Scenario: {SCENARIO}")
print(f"  Coordinates: ({LATITUDE:.6f}, {LONGITUDE:.6f})")
print(f"  CMIP6 Base Directory: {CMIP6_BASE_DIR}")
print(f"  Output Directory: {OUTPUT_DIR}")
print(f"  Required Variables: {', '.join(REQUIRED_VARIABLES)}")
print("="*70)
print("\nAll paths and filenames will automatically use the above settings.\n")

  [INFO] Using manual output directory: C:\Users\ibian\Desktop\ClimAdapt\Anameka\Anameka_South_16_226042
CONFIGURATION
  Model: ACCESS CM2
  Scenario: SSP585
  Coordinates: (-31.750000, 117.599998)
  CMIP6 Base Directory: C:\Users\ibian\Desktop\ClimAdapt\CMIP6
  Output Directory: C:\Users\ibian\Desktop\ClimAdapt\Anameka\Anameka_South_16_226042
  Required Variables: tasmax, tasmin, hurs, rsds, sfcWind

All paths and filenames will automatically use the above settings.



## Section 3: FAO-56 Penman-Monteith Calculation Functions

In [5]:
def calculate_saturation_vapor_pressure(temperature_c):
    """
    Calculate saturation vapor pressure (kPa) at temperature T (°C).
    Using SILO formula: e_s(T) = 0.611 × exp(17.27 × T / (T + 237.3))
    """
    return 0.611 * np.exp(17.27 * temperature_c / (temperature_c + 237.3))


def calculate_actual_vapor_pressure(hurs_df, tasmax_df, tasmin_df):
    """
    Calculate actual vapor pressure (kPa) from mean relative humidity.
    Using mean humidity: ea = es(Tmean) × hurs / 100
    """
    # Normalize dates to date-only (remove time component) to handle different time stamps
    # Make copies to avoid modifying original dataframes
    tasmax_df_norm = tasmax_df.copy()
    tasmin_df_norm = tasmin_df.copy()
    hurs_df_norm = hurs_df.copy()
    
    # Normalize dates to date-only
    tasmax_df_norm['date'] = pd.to_datetime(tasmax_df_norm['date']).dt.date
    tasmin_df_norm['date'] = pd.to_datetime(tasmin_df_norm['date']).dt.date
    hurs_df_norm['date'] = pd.to_datetime(hurs_df_norm['date']).dt.date
    
    # Merge all dataframes
    merged = tasmax_df_norm.merge(tasmin_df_norm, on='date', suffixes=('_max', '_min'))
    merged = merged.rename(columns={'value_max': 'tasmax', 'value_min': 'tasmin'})
    
    # Calculate mean temperature
    merged['tmean'] = (merged['tasmax'] + merged['tasmin']) / 2.0
    
    # Merge with hurs (mean relative humidity)
    merged = merged.merge(hurs_df_norm[['date', 'value']], on='date')
    merged = merged.rename(columns={'value': 'hurs'})
    
    # Calculate saturation vapor pressure at mean temperature
    merged['es'] = calculate_saturation_vapor_pressure(merged['tmean'])
    
    # Calculate actual vapor pressure using mean relative humidity
    merged['ea'] = merged['es'] * merged['hurs'] / 100.0
    
    # Return DataFrame with date and ea columns
    ea_df = merged[['date', 'ea']].copy()
    ea_df = ea_df.rename(columns={'ea': 'value'})
    
    # Convert date back to datetime for proper CSV export (from date object to datetime)
    ea_df['date'] = pd.to_datetime(ea_df['date'])
    
    return ea_df


def convert_wind_10m_to_2m(wind_10m):
    """
    Convert wind speed from 10m height to 2m height.
    Using logarithmic wind profile: u2 = u10 × ln(2/0.0002) / ln(10/0.0002)
    """
    z0 = 0.0002  # Roughness length (m) - typical for open water/pan
    u2 = wind_10m * np.log(2.0 / z0) / np.log(10.0 / z0)
    return u2


def calculate_eto_fao56_penman_monteith(tasmax_df, tasmin_df, hurs_df, 
                                        rsds_df, sfcWind_df, latitude):
    """
    Calculate reference evapotranspiration (ET₀) using FAO-56 Penman-Monteith method.
    
    Formula: ET₀ = (0.408 × Δ × (Rn - G) + γ × (900/(T+273)) × u₂ × (es - ea)) / (Δ + γ × (1 + 0.34 × u₂))
    """
    # Normalize dates to date-only (remove time component) to handle different time stamps
    # Make copies to avoid modifying original dataframes
    tasmax_df_norm = tasmax_df.copy()
    tasmin_df_norm = tasmin_df.copy()
    rsds_df_norm = rsds_df.copy()
    sfcWind_df_norm = sfcWind_df.copy()
    
    # Normalize dates to date-only
    tasmax_df_norm['date'] = pd.to_datetime(tasmax_df_norm['date']).dt.date
    tasmin_df_norm['date'] = pd.to_datetime(tasmin_df_norm['date']).dt.date
    rsds_df_norm['date'] = pd.to_datetime(rsds_df_norm['date']).dt.date
    sfcWind_df_norm['date'] = pd.to_datetime(sfcWind_df_norm['date']).dt.date
    
    # Merge all dataframes
    merged = tasmax_df_norm.merge(tasmin_df_norm, on='date', suffixes=('_max', '_min'))
    merged = merged.rename(columns={'value_max': 'tasmax', 'value_min': 'tasmin'})
    
    # Calculate mean temperature
    merged['tmean'] = (merged['tasmax'] + merged['tasmin']) / 2.0
    
    # Calculate actual vapor pressure (this function also normalizes dates internally)
    ea_df = calculate_actual_vapor_pressure(hurs_df, tasmax_df, tasmin_df)
    # Normalize ea_df date as well
    ea_df_norm = ea_df.copy()
    ea_df_norm['date'] = pd.to_datetime(ea_df_norm['date']).dt.date
    merged = merged.merge(ea_df_norm, on='date')
    merged = merged.rename(columns={'value': 'ea'})
    
    # Calculate saturation vapor pressure at mean temperature
    merged['es'] = calculate_saturation_vapor_pressure(merged['tmean'])
    
    # Calculate slope of vapor pressure curve (Δ) in kPa/°C
    merged['delta'] = 4098 * merged['es'] / ((merged['tmean'] + 237.3) ** 2)
    
    # Psychrometric constant (γ) in kPa/°C (at sea level, 101.3 kPa)
    gamma = 0.665e-3 * 101.3  # 0.0675 kPa/°C
    
    # Convert rsds from W/m² to MJ/m²/day
    merged = merged.merge(rsds_df_norm[['date', 'value']], on='date')
    merged = merged.rename(columns={'value': 'rsds'})
    merged['rsds_mj'] = merged['rsds'] * 0.0864  # W/m² to MJ/m²/day
    
    # Calculate net radiation (Rn) - simplified: Rn ≈ 0.77 × Rs (assuming albedo = 0.23)
    # For daily calculations, soil heat flux (G) is assumed to be 0
    merged['rn'] = 0.77 * merged['rsds_mj']  # MJ/m²/day
    merged['g'] = 0.0  # Soil heat flux assumed 0 for daily calculations
    
    # Convert wind speed from 10m to 2m
    merged = merged.merge(sfcWind_df_norm[['date', 'value']], on='date')
    merged = merged.rename(columns={'value': 'wind_10m'})
    merged['wind_2m'] = convert_wind_10m_to_2m(merged['wind_10m'])
    
    # Calculate vapor pressure deficit (es - ea) in kPa
    merged['vpd'] = merged['es'] - merged['ea']
    
    # FAO-56 Penman-Monteith formula
    numerator = (0.408 * merged['delta'] * (merged['rn'] - merged['g']) + 
                 gamma * (900.0 / (merged['tmean'] + 273.0)) * merged['wind_2m'] * merged['vpd'])
    denominator = merged['delta'] + gamma * (1.0 + 0.34 * merged['wind_2m'])
    
    merged['eto'] = numerator / denominator
    
    # Ensure non-negative values
    merged['eto'] = merged['eto'].clip(lower=0.0)
    
    # Return DataFrame with date and eto columns
    eto_df = merged[['date', 'eto']].copy()
    eto_df = eto_df.rename(columns={'eto': 'value'})
    
    # Convert date back to datetime for proper CSV export (from date object to datetime)
    eto_df['date'] = pd.to_datetime(eto_df['date'])
    
    return eto_df

## Section 5: Main Processing - Extract CMIP6 Data and Calculate ET₀

In [6]:
# Construct data directory path
data_dir = os.path.join(CMIP6_BASE_DIR, f"{MODEL} {SCENARIO}")

if not os.path.exists(data_dir):
    raise ValueError(f"Data directory not found: {data_dir}")

print("="*70)
print(f"Processing Coordinate: ({LATITUDE:.6f}, {LONGITUDE:.6f})")
print(f"Model: {MODEL}, Scenario: {SCENARIO}")
print("="*70)
print(f"\nData directory: {data_dir}\n")

# Extract data for all required variables
extracted_data = {}

for variable in REQUIRED_VARIABLES:
    print(f"\n{'='*70}")
    print(f"Processing variable: {variable}")
    print(f"{'='*70}")
    
    # Check for cached data first
    cache_path = get_cached_variable_path(OUTPUT_DIR, model_scenario, lat_str, lon_str, variable)
    print(f"  [DEBUG] Cache path: {cache_path}")
    df = load_cached_variable(cache_path)
    
    if df is not None:
        # Use cached data
        extracted_data[variable] = df
        print(f"  [OK] Loaded from cache: {len(df):,} records for {variable}")
        print(f"  [INFO] Date range: {df['date'].min()} to {df['date'].max()}")
    else:
        # Extract from NetCDF files
        df = extract_daily_data_from_netcdf(
            data_dir, 
            variable, 
            LATITUDE, 
            LONGITUDE, 
            tolerance=COORD_TOLERANCE
        )
        
        if df is not None and len(df) > 0:
            extracted_data[variable] = df
            print(f"  [OK] Extracted {len(df):,} records for {variable}")
            # Save to cache for future runs
            save_cached_variable(df, cache_path)
        else:
            raise ValueError(f"Failed to extract data for required variable: {variable}")

# Check if all required variables are available
missing_vars = [v for v in REQUIRED_VARIABLES if v not in extracted_data]

if missing_vars:
    raise ValueError(f"Missing required variables: {missing_vars}")

print(f"\n{'='*70}")
print("Calculating ET₀ using FAO-56 Penman-Monteith method...")
print(f"{'='*70}\n")

# Calculate ET₀ using FAO-56 Penman-Monteith
eto_df = calculate_eto_fao56_penman_monteith(
    extracted_data['tasmax'],
    extracted_data['tasmin'],
    extracted_data['hurs'],
    extracted_data['rsds'],
    extracted_data['sfcWind'],
    LATITUDE
)

print(f"  [OK] Calculated ET₀ for {len(eto_df):,} days")
print(f"  Date range: {eto_df['date'].min()} to {eto_df['date'].max()}")
print(f"  ET₀ range: {eto_df['value'].min():.2f} to {eto_df['value'].max():.2f} mm/day")
print(f"  ET₀ mean: {eto_df['value'].mean():.2f} mm/day")

# Save ET₀ to CSV (using auto-generated filename components from configuration)
# Format lat_str to use 'neg' prefix instead of '-' for filenames
lat_str_filename = lat_str.replace('-', 'neg')
output_filename = f"{model_scenario}_{lat_str_filename}_{lon_str}_eto.csv"
output_path = os.path.join(OUTPUT_DIR, output_filename)

eto_df.to_csv(output_path, index=False, encoding='utf-8', float_format='%.2f')
print(f"\n  [OK] Saved ET₀ data to: {output_filename}")

print(f"\n{'='*70}")
print("[SUCCESS] ET₀ CALCULATION COMPLETED!")
print(f"{'='*70}")

Processing Coordinate: (-31.750000, 117.599998)
Model: ACCESS CM2, Scenario: SSP585

Data directory: C:\Users\ibian\Desktop\ClimAdapt\CMIP6\ACCESS CM2 SSP585


Processing variable: tasmax
  [DEBUG] Cache path: C:\Users\ibian\Desktop\ClimAdapt\Anameka\Anameka_South_16_226042\ACCESS_CM2_SSP585_-31.75_117.60_tasmax.csv
  [OK] Loaded from cache: 10,957 records for tasmax
  [INFO] Date range: 2035-01-01 00:00:00 to 2064-12-30 00:00:00

Processing variable: tasmin
  [DEBUG] Cache path: C:\Users\ibian\Desktop\ClimAdapt\Anameka\Anameka_South_16_226042\ACCESS_CM2_SSP585_-31.75_117.60_tasmin.csv
  [OK] Loaded from cache: 10,957 records for tasmin
  [INFO] Date range: 2035-01-01 00:00:00 to 2064-12-30 00:00:00

Processing variable: hurs
  [DEBUG] Cache path: C:\Users\ibian\Desktop\ClimAdapt\Anameka\Anameka_South_16_226042\ACCESS_CM2_SSP585_-31.75_117.60_hurs.csv
  [OK] Loaded from cache: 10,957 records for hurs
  [INFO] Date range: 2035-01-01 00:00:00 to 2064-12-30 00:00:00

Processing variable: 