# Grid Point CMIP6 Retrieval and MET Conversion Notebook (v2)

## Overview

This notebook extracts CMIP6 data from NetCDF files and converts them to APSIM MET format. It processes climate variables from NC files organized in `C:\Users\ibian\Desktop\ClimAdapt\CMIP6\{Model} {Scenario}` folders, extracts data for specified coordinates, creates CSV cache files, and generates MET files for APSIM simulations.

## File Structure

- **Input**: NC files in `C:\Users\ibian\Desktop\ClimAdapt\CMIP6\{Model} {Scenario}\` (e.g., `ACCESS CM2 SSP245`)
- **User Input**: Model, Scenario, Latitude and longitude coordinates (decimal degrees)
- **Output**: MET files, CSV cache files, and combined CSV files for the specified coordinate

## Variables Processed

The notebook processes 4 climate variables:

- **tasmax**: Daily maximum temperature (°C) → **maxt** in MET format
- **tasmin**: Daily minimum temperature (°C) → **mint** in MET format
- **pr**: Daily precipitation (mm) → **rain** in MET format
- **rsds**: Daily surface downwelling shortwave radiation (W/m²) → **radn** (MJ/m²) in MET format

**Note**: vp (vapor pressure) and evap (evapotranspiration) fields are left blank in the MET file. code field is hardcoded to "222222".

## MET Format Specifications

The MET format is used by APSIM for weather data input. It includes:

- **Required fields**: year, day, maxt (from tasmax), mint (from tasmin), rain (from pr), radn (from rsds)
- **radn field**: Required, converted from rsds (W/m² to MJ/m²)
- **Blank fields**: evap (evaporation), vp (vapor pressure) - left blank
- **Code field**: Hardcoded to "222222" for all rows
- **Metadata**: latitude, longitude, tav (annual average temperature), amp (annual amplitude)

## File Naming Conventions

- **Input NC files**: `{Model} {Scenario}\*{variable}*.nc` (e.g., `ACCESS CM2 SSP245\tasmax*.nc`)
- **Output CSV cache files**: `{Model}_{Scenario}_{Lat}_{Lon}_{variable}.csv` (e.g., `ACCESS_CM2_SSP245_-31.75_117.60_tasmax.csv`)
- **Output MET files**: `{Model}_{Scenario}_{Lat}_{Lon}.met` (e.g., `ACCESS_CM2_SSP245_-31.75_117.60.met`)
- **Output CSV files**: `{Model}_{Scenario}_{Lat}_{Lon}.csv` (e.g., `ACCESS_CM2_SSP245_-31.75_117.60.csv`)

## Usage Instructions

1. Set the configuration parameters (Model, Scenario, output directory) in Section 1
2. Provide your target latitude and longitude coordinates in Section 2
3. Run all cells sequentially to extract data from NetCDF files and create MET files
4. Output files (CSV cache and MET) will be saved with coordinate-based naming

## Coordinate Matching

The notebook finds the nearest grid point to your specified coordinates within a tolerance (default: 0.01 degrees ≈ 1.1 km). If the nearest point is outside tolerance, a warning will be displayed.

## Notes

- **CMIP6 data structure**: Data is organized by Model and Scenario in separate folders
- **Coordinate format**: Provide coordinates in decimal degrees (latitude: -90 to 90, longitude: -180 to 180)
- **Variable extraction**: 4 variables (tasmax, tasmin, pr, rsds) are extracted from NC files and saved as individual CSV cache files
- **MET conversion**: tasmax→maxt, tasmin→mint, pr→rain, rsds→radn
- **Unit conversions**: rsds (W/m²) is converted to radn (MJ/m²) by multiplying by 0.0864
- **Blank fields**: vp and evap are left blank in MET format
- **Code field**: Hardcoded to "222222" for all rows

## Section 1: Imports and Configuration

In [1]:
import pandas as pd
import numpy as np
import xarray as xr
import glob
import os
import time
import re
from pathlib import Path
from datetime import datetime
from tqdm import tqdm

print("Libraries imported successfully")

Libraries imported successfully


In [2]:
# Configuration
CMIP6_BASE_DIR = r"C:\Users\ibian\Desktop\ClimAdapt\CMIP6"
OUTPUT_DIR = r"C:\Users\ibian\Desktop\ClimAdapt\Anameka\-31.45_117.55"  # Output directory for MET files and CSV cache
COORD_TOLERANCE = 0.01  # degrees (approximately 1.1 km)

# Model and Scenario - UPDATE THESE AS NEEDED
MODEL = "ACCESS CM2"  # e.g., "ACCESS CM2"
SCENARIO = "obs"  # e.g., "obs", "SSP245", or "SSP585"

# Date Range - UPDATE THESE AS NEEDED
START_YEAR = 1986  # Start year for MET file (None = use all available data)
END_YEAR = 2014  # End year for MET file (None = use all available data)

# Variables to process (4 variables)
# MET format mapping:
# - tasmax → maxt (maximum temperature)
# - tasmin → mint (minimum temperature)
# - pr → rain (precipitation)
# - rsds → radn (radiation, converted from W/m² to MJ/m²)
VARIABLES = ['tasmax', 'tasmin', 'pr', 'rsds']

# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Configuration loaded:")
print(f"  - CMIP6 Base Directory: {CMIP6_BASE_DIR}")
print(f"  - Output Directory: {OUTPUT_DIR}")
print(f"  - Model: {MODEL}")
print(f"  - Scenario: {SCENARIO}")
print(f"  - Date Range: {START_YEAR if START_YEAR is not None else 'all available'} to {END_YEAR if END_YEAR is not None else 'all available'}")
print(f"  - Coordinate Tolerance: {COORD_TOLERANCE} degrees")
print(f"  - Variables: {', '.join(VARIABLES)}")

Configuration loaded:
  - CMIP6 Base Directory: C:\Users\ibian\Desktop\ClimAdapt\CMIP6
  - Output Directory: C:\Users\ibian\Desktop\ClimAdapt\Anameka\-31.45_117.55
  - Model: ACCESS CM2
  - Scenario: obs
  - Date Range: 1986 to 2014
  - Coordinate Tolerance: 0.01 degrees
  - Variables: tasmax, tasmin, pr, rsds


## Section 2: User Coordinate Input

Provide the latitude and longitude coordinates for the grid point you want to process.

In [3]:
# USER INPUT: Provide your target coordinates here
LATITUDE = -31.45  # Target latitude in decimal degrees (-90 to 90)
LONGITUDE = 117.55  # Target longitude in decimal degrees (-180 to 180)

# Validate coordinates
if not (-90 <= LATITUDE <= 90):
    raise ValueError(f"Latitude must be between -90 and 90. Provided: {LATITUDE}")
if not (-180 <= LONGITUDE <= 180):
    raise ValueError(f"Longitude must be between -180 and 180. Provided: {LONGITUDE}")

print(f"Target Coordinate:")
print(f"  Latitude: {LATITUDE:.6f}°")
print(f"  Longitude: {LONGITUDE:.6f}°")
print(f"  Model: {MODEL}")
print(f"  Scenario: {SCENARIO}")
print(f"  Tolerance: {COORD_TOLERANCE} degrees")

Target Coordinate:
  Latitude: -31.450000°
  Longitude: 117.550000°
  Model: ACCESS CM2
  Scenario: obs
  Tolerance: 0.01 degrees


## Section 3: NetCDF Extraction Function

In [4]:
def get_cached_variable_path(output_dir, model_scenario, lat_str, lon_str, variable):
    """
    Generate the path for a cached variable CSV file.
    
    Parameters:
    -----------
    output_dir : str
        Output directory
    model_scenario : str
        Model and scenario string (e.g., "ACCESS_CM2_SSP245")
    lat_str : str
        Latitude formatted as string (e.g., "-31.75")
    lon_str : str
        Longitude formatted as string (e.g., "117.60")
    variable : str
        Variable name (e.g., "tasmax", "tasmin", "pr", "rsds")
    
    Returns:
    --------
    str
        Path to cached CSV file
    """
    cache_filename = f"{model_scenario}_{lat_str}_{lon_str}_{variable}.csv"
    cache_path = os.path.join(output_dir, cache_filename)
    return cache_path


def load_cached_variable(cache_path):
    """
    Load cached variable data from CSV file.
    
    Parameters:
    -----------
    cache_path : str
        Path to cached CSV file
    
    Returns:
    --------
    pd.DataFrame or None
        DataFrame with date and value columns if file exists and is valid, None otherwise
    """
    print(f"  [INFO] Checking cache: {os.path.basename(cache_path)}")
    
    if not os.path.exists(cache_path):
        print(f"  [INFO] Cache file not found, will extract from NetCDF")
        return None
    
    try:
        df = pd.read_csv(cache_path)
        if 'date' not in df.columns or 'value' not in df.columns:
            print(f"  [WARNING] Cached file missing required columns, will re-extract")
            return None
        
        df['date'] = pd.to_datetime(df['date'])
        
        # Basic validation - check if file has data
        if len(df) == 0:
            print(f"  [WARNING] Cached file is empty, will re-extract")
            return None
        
        print(f"  [INFO] Cache file found and valid - loaded {len(df):,} records")
        print(f"  [INFO] Date range: {df['date'].min()} to {df['date'].max()}")
        return df
    
    except Exception as e:
        print(f"  [WARNING] Error loading cached file: {e}, will re-extract")
        return None


def save_cached_variable(df, cache_path):
    """
    Save extracted variable data to CSV cache file.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with date and value columns
    cache_path : str
        Path to save cached CSV file
    """
    try:
        df[['date', 'value']].to_csv(
            cache_path,
            index=False,
            encoding='utf-8',
            float_format='%.6f'
        )
        print(f"  [INFO] Saved to cache: {os.path.basename(cache_path)}")
    except Exception as e:
        print(f"  [WARNING] Failed to save cache: {e}")

In [5]:
def extract_daily_data_from_netcdf(netcdf_dir, variable, target_lat, target_lon, tolerance=0.01):
    """
    Extract daily time series data for a specific coordinate from NetCDF files.
    
    Parameters:
    -----------
    netcdf_dir : str
        Directory containing NetCDF files for the variable
    variable : str
        Variable name (tasmax, tasmin, pr, rsds)
    target_lat : float
        Target latitude
    target_lon : float
        Target longitude
    tolerance : float
        Coordinate matching tolerance in degrees
    
    Returns:
    --------
    pd.DataFrame
        DataFrame with columns: date, value
    """
    start_time = time.time()
    
    # Find all NetCDF files in the directory
    # Pattern 1: Files directly in the directory matching *{variable}*.nc
    nc_files = sorted(glob.glob(os.path.join(netcdf_dir, f"*{variable}*.nc")))
    
    # Pattern 2: Files in subdirectories named {variable}_* (e.g., pr_ACCESS CM2 SSP245)
    if len(nc_files) == 0:
        var_subdirs = glob.glob(os.path.join(netcdf_dir, f"{variable}_*"))
        for var_subdir in var_subdirs:
            if os.path.isdir(var_subdir):
                found_files = sorted(glob.glob(os.path.join(var_subdir, "*.nc")))
                if found_files:
                    nc_files.extend(found_files)
                    print(f"  Found files in subdirectory: {os.path.basename(var_subdir)}/")
                    break
    
    # Pattern 2b: For rsds, also check for "rad_" folder (folder named "rad_" but files contain "rsds")
    # Example: rad_ACCESS CM2 SSP245/ contains files named *rsds*.nc
    if len(nc_files) == 0 and variable == 'rsds':
        rad_subdirs = glob.glob(os.path.join(netcdf_dir, "rad_*"))
        for rad_subdir in rad_subdirs:
            if os.path.isdir(rad_subdir):
                # Search for files containing "rsds" in the rad_ folder
                found_files = sorted(glob.glob(os.path.join(rad_subdir, "*rsds*.nc")))
                if found_files:
                    nc_files.extend(found_files)
                    print(f"  Found files in subdirectory: {os.path.basename(rad_subdir)}/")
                    break
                # Fallback: if no rsds files found, try all .nc files
                if len(nc_files) == 0:
                    found_files = sorted(glob.glob(os.path.join(rad_subdir, "*.nc")))
                    if found_files:
                        nc_files.extend(found_files)
                        print(f"  Found files in subdirectory: {os.path.basename(rad_subdir)}/")
                        break
    
    # Pattern 3: Check subdirectory named exactly after the variable
    if len(nc_files) == 0:
        var_dir = os.path.join(netcdf_dir, variable)
        if os.path.exists(var_dir) and os.path.isdir(var_dir):
            nc_files = sorted(glob.glob(os.path.join(var_dir, "*.nc")))
            if len(nc_files) > 0:
                print(f"  Found files in subdirectory: {variable}/")
    
    if len(nc_files) == 0:
        print(f"  ERROR: No NetCDF files found in {netcdf_dir}")
        print(f"  Searched patterns:")
        print(f"    - {netcdf_dir}/*{variable}*.nc")
        print(f"    - {netcdf_dir}/{variable}_*/*.nc")
        if variable == 'rsds':
            print(f"    - {netcdf_dir}/rad_*/*rsds*.nc")
            print(f"    - {netcdf_dir}/rad_*/*.nc")
        print(f"    - {netcdf_dir}/{variable}/*.nc")
        return None
    
    print(f"  Found {len(nc_files)} NetCDF files")
    
    # Cache coordinate information from first file
    lat_name = None
    lon_name = None
    time_name = None
    lat_idx = None
    lon_idx = None
    actual_lat = None
    actual_lon = None
    var_name = None
    
    # List to store daily data
    all_data = []
    
    # Process first file to get coordinate structure
    if len(nc_files) > 0:
        try:
            ds_sample = xr.open_dataset(nc_files[0], decode_times=False)
            
            # Get variable name
            for v in ds_sample.data_vars:
                if variable in v.lower() or v.lower() in variable.lower():
                    var_name = v
                    break
            
            # For rsds, also check for "rad" variable name (some datasets use "rad" instead of "rsds")
            if var_name is None and variable == 'rsds':
                for v in ds_sample.data_vars:
                    if 'rad' in v.lower() and 'rsds' not in v.lower():
                        var_name = v
                        break
            
            if var_name is None:
                possible_names = [variable, variable.upper(), f'{variable}_day']
                # For rsds, also try "rad" as variable name
                if variable == 'rsds':
                    possible_names.extend(['rad', 'RAD', 'rad_day'])
                for name in possible_names:
                    if name in ds_sample.data_vars:
                        var_name = name
                        break
            
            # Get coordinate names
            for coord in ds_sample.coords:
                coord_lower = coord.lower()
                if 'lat' in coord_lower:
                    lat_name = coord
                elif 'lon' in coord_lower:
                    lon_name = coord
                elif 'time' in coord_lower:
                    time_name = coord
            
            if lat_name and lon_name:
                # Check coordinate bounds of the NetCDF file
                lat_values = ds_sample[lat_name].values
                lon_values = ds_sample[lon_name].values
                
                lat_min = float(np.min(lat_values))
                lat_max = float(np.max(lat_values))
                lon_min = float(np.min(lon_values))
                lon_max = float(np.max(lon_values))
                
                # Check if target coordinate is within file bounds
                lat_in_bounds = lat_min <= target_lat <= lat_max
                lon_in_bounds = lon_min <= target_lon <= lon_max
                
                print(f"  Grid bounds: Lat [{lat_min:.4f}, {lat_max:.4f}], Lon [{lon_min:.4f}, {lon_max:.4f}]")
                print(f"  Target coordinate: ({target_lat:.4f}, {target_lon:.4f})")
                
                if not lat_in_bounds:
                    print(f"  [WARNING] Target latitude {target_lat:.4f} is OUTSIDE file bounds [{lat_min:.4f}, {lat_max:.4f}]")
                    print(f"  [WARNING] This may cause all values to be zero or incorrect!")
                if not lon_in_bounds:
                    print(f"  [WARNING] Target longitude {target_lon:.4f} is OUTSIDE file bounds [{lon_min:.4f}, {lon_max:.4f}]")
                    print(f"  [WARNING] This may cause all values to be zero or incorrect!")
                
                if lat_in_bounds and lon_in_bounds:
                    print(f"  [OK] Target coordinate is within file bounds")
                
                # Find nearest grid point (cache indices)
                lat_idx = np.abs(lat_values - target_lat).argmin()
                lon_idx = np.abs(lon_values - target_lon).argmin()
                
                actual_lat = float(lat_values[lat_idx])
                actual_lon = float(lon_values[lon_idx])
                
                # Check if within tolerance
                lat_diff = abs(actual_lat - target_lat)
                lon_diff = abs(actual_lon - target_lon)
                
                if lat_diff > tolerance or lon_diff > tolerance:
                    print(f"  [WARNING] Nearest point ({actual_lat:.4f}, {actual_lon:.4f}) is outside tolerance")
                    print(f"  [WARNING] Distance: {lat_diff:.4f}° lat, {lon_diff:.4f}° lon (tolerance: {tolerance:.4f}°)")
                else:
                    print(f"  [OK] Using grid point: ({actual_lat:.4f}, {actual_lon:.4f})")
                    print(f"  [OK] Distance from target: {lat_diff:.4f}° lat, {lon_diff:.4f}° lon")
            
            ds_sample.close()
            
        except Exception as e:
            print(f"  Warning: Could not read sample file: {e}")
    
    if var_name is None or lat_idx is None or lon_idx is None:
        print(f"  ERROR: Could not determine coordinate structure")
        return None
    
    # Process all files with progress bar
    print(f"  Processing files...")
    for nc_file in tqdm(nc_files, desc=f"  {variable}", unit="file"):
        try:
            # Open NetCDF file with minimal decoding for speed
            ds = xr.open_dataset(nc_file, decode_times=False)
            
            # Extract data using cached indices
            data = ds[var_name].isel({lat_name: lat_idx, lon_name: lon_idx})
            
            # Convert to numpy array (load into memory)
            values = data.values
            if values.ndim > 1:
                values = values.flatten()
            
            # Get time values - try multiple methods to ensure accuracy and handle leap years (366 days)
            time_values = None
            
            # Method 1: Try to use time coordinate from NetCDF file (most reliable)
            if time_name and time_name in ds.coords:
                try:
                    time_coord = ds[time_name]
                    if len(time_coord) == len(values):
                        # Try to decode times
                        try:
                            # Decode time coordinate
                            time_decoded = xr.decode_cf(ds[[time_name]])[time_name]
                            time_values = pd.to_datetime(time_decoded.values)
                            if len(time_values) == len(values):
                                pass  # Success - using decoded time coordinate
                        except:
                            # If decoding fails, try manual conversion
                            if hasattr(time_coord, 'units') and 'days since' in time_coord.units.lower():
                                base_date_str = time_coord.units.split('since')[1].strip().split()[0]
                                base_date = pd.to_datetime(base_date_str)
                                time_values = base_date + pd.to_timedelta(time_coord.values, unit='D')
                                if len(time_values) != len(values):
                                    time_values = None
                except Exception as e:
                    pass  # Fall back to other methods
            
            # Method 2: Extract year from filename and create date range
            # This method automatically handles leap years (366 days) correctly
            if time_values is None:
                year = None
                filename = os.path.basename(nc_file)
                all_years = re.findall(r'\d{4}', filename)
                for year_str in all_years:
                    year_candidate = int(year_str)
                    if 2000 <= year_candidate <= 2100:
                        year = year_candidate
                        break
                
                if year:
                    # Create dates based on ACTUAL data length
                    # pd.date_range with freq='D' automatically handles leap years
                    # For leap years (e.g., 2024, 2028), it will include Feb 29 (366 days)
                    # For non-leap years, it will have 365 days
                    time_values = pd.date_range(start=f'{year}-01-01', periods=len(values), freq='D')
                else:
                    # Fallback: use 2035 as default (start of typical CMIP6 data range)
                    time_values = pd.date_range(start='2035-01-01', periods=len(values), freq='D')
            
            # Ensure we have the correct number of dates matching the data
            # This handles edge cases where time coordinate might not match exactly
            if len(time_values) != len(values):
                if len(time_values) > len(values):
                    time_values = time_values[:len(values)]
                else:
                    # Extend if needed (shouldn't happen normally, but handle it)
                    additional_days = len(values) - len(time_values)
                    last_date = time_values[-1]
                    additional_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=additional_days, freq='D')
                    time_values = pd.concat([pd.Series(time_values), pd.Series(additional_dates)]).values
            
            # Create DataFrame for this file
            # Use actual data length to ensure all days are included (365 or 366 for leap years)
            if len(values) > 0:
                df_file = pd.DataFrame({
                    'date': time_values[:len(values)],
                    'value': values
                })
                all_data.append(df_file)
            
            ds.close()
            
        except Exception as e:
            tqdm.write(f"    Error processing {os.path.basename(nc_file)}: {e}")
            continue
    
    if len(all_data) == 0:
        print(f"  ERROR: No data extracted")
        return None
    
    # Combine all data
    print(f"  Combining data from {len(all_data)} files...")
    combined_df = pd.concat(all_data, ignore_index=True)
    
    # Sort by date
    combined_df = combined_df.sort_values('date').reset_index(drop=True)
    
    # Remove duplicate dates (keep first occurrence)
    combined_df = combined_df.drop_duplicates(subset='date', keep='first')
    
    elapsed_time = time.time() - start_time
    print(f"  ✓ Extracted {len(combined_df):,} daily records in {elapsed_time:.1f} seconds")
    print(f"  Date range: {combined_df['date'].min()} to {combined_df['date'].max()}")
    
    return combined_df

## Section 4: MET Conversion Functions

In [6]:
def calculate_tav_amp(df):
    """
    Calculate annual average temperature (tav) and annual amplitude (amp).
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with 'date' as index and 'maxt' and 'mint' columns
    
    Returns:
    --------
    tuple: (tav, amp)
        tav: Annual average ambient temperature
        amp: Annual amplitude in mean monthly temperature
    """
    # Calculate daily mean temperature
    df = df.copy()
    df['tmean'] = (df['maxt'] + df['mint']) / 2.0
    
    # Calculate monthly means
    df['year'] = df.index.year
    df['month'] = df.index.month
    monthly_means = df.groupby(['year', 'month'])['tmean'].mean()
    
    # Calculate overall annual average (tav)
    tav = df['tmean'].mean()
    
    # Calculate annual amplitude (amp)
    # Average of all January means minus average of all July means, divided by 2
    jan_means = monthly_means[monthly_means.index.get_level_values('month') == 1].mean()
    jul_means = monthly_means[monthly_means.index.get_level_values('month') == 7].mean()
    amp = (jan_means - jul_means) / 2.0
    
    return tav, amp

In [7]:
def create_met_file(tasmax_df, tasmin_df, pr_df, rsds_df, scenario=None, 
                    output_dir=None, latitude=None, longitude=None, model=None, 
                    start_year=None, end_year=None):
    """
    Create MET format file from tasmax, tasmin, pr, rsds DataFrames.
    
    Parameters:
    -----------
    tasmax_df : pd.DataFrame
        DataFrame with date and value columns for maximum temperature
    tasmin_df : pd.DataFrame
        DataFrame with date and value columns for minimum temperature
    pr_df : pd.DataFrame
        DataFrame with date and value columns for precipitation
    rsds_df : pd.DataFrame
        DataFrame with date and value columns for surface downwelling shortwave radiation (W/m²) - REQUIRED
    scenario : str
        Scenario name (e.g., SSP585 or SSP245)
    output_dir : str
        Output directory path
    latitude : float
        Latitude in decimal degrees
    longitude : float
        Longitude in decimal degrees
    model : str
        Model name (e.g., "ACCESS CM2")
    start_year : int, optional
        Start year for MET file. Only data from this year onwards will be included.
        If None, uses the minimum date from the data.
    end_year : int, optional
        End year for MET file. Only data up to this year will be included.
        If None, uses the maximum date from the data.
    
    Returns:
    --------
    tuple: (tav, amp, num_rows, final_date_range)
        tav: Annual average ambient temperature
        amp: Annual amplitude in mean monthly temperature
        num_rows: Number of rows in the MET file
        final_date_range: Dict with 'start' and 'end' dates from the final data
    """
    # Merge all dataframes on date
    merged = tasmax_df.copy()
    merged = merged.rename(columns={'value': 'maxt'})
    merged['date'] = pd.to_datetime(merged['date'])
    
    # Merge tasmin
    tasmin_df['date'] = pd.to_datetime(tasmin_df['date'])
    merged = merged.merge(tasmin_df[['date', 'value']], on='date', how='outer')
    merged = merged.rename(columns={'value': 'mint'})
    
    # Merge pr (precipitation/rain)
    pr_df['date'] = pd.to_datetime(pr_df['date'])
    merged = merged.merge(pr_df[['date', 'value']], on='date', how='outer')
    merged = merged.rename(columns={'value': 'rain'})
    
    # Merge rsds (radiation) - REQUIRED
    # rsds is in W/m², convert to MJ/m² by multiplying by 0.0864 (seconds per day / 1e6)
    if rsds_df is None or len(rsds_df) == 0:
        raise ValueError("rsds_df is required but is None or empty")
    
    rsds_df['date'] = pd.to_datetime(rsds_df['date'])
    rsds_df['date'] = rsds_df['date'].dt.normalize()  # Remove time component for proper date matching
    # Convert W/m² to MJ/m² (multiply by seconds per day / 1e6)
    rsds_df['value_mj'] = rsds_df['value'] * 0.0864
    # Debug: Check rsds data before merge
    print(f"  [DEBUG] rsds_df: {len(rsds_df)} rows, value range: {rsds_df['value'].min():.2f} to {rsds_df['value'].max():.2f} W/m²")
    print(f"  [DEBUG] After conversion: value_mj range: {rsds_df['value_mj'].min():.2f} to {rsds_df['value_mj'].max():.2f} MJ/m²")
    # Normalize merged dates before merge (in case they have time components)
    merged['date'] = merged['date'].dt.normalize()
    # Debug: Check date ranges before merge
    print(f"  [DEBUG] merged date range: {merged['date'].min()} to {merged['date'].max()}")
    print(f"  [DEBUG] rsds_df date range: {rsds_df['date'].min()} to {rsds_df['date'].max()}")
    print(f"  [DEBUG] rsds_df sample values (first 5): {rsds_df[['date', 'value_mj']].head().to_dict('records')}")
    merged = merged.merge(rsds_df[['date', 'value_mj']], on='date', how='outer')
    merged = merged.rename(columns={'value_mj': 'radn'})
    # Debug: Check radn after merge
    radn_nonzero = (merged['radn'].notna() & (merged['radn'] != 0)).sum()
    radn_total = merged['radn'].notna().sum()
    radn_zeros = (merged['radn'] == 0).sum()
    radn_nan = merged['radn'].isna().sum()
    print(f"  [DEBUG] After merge: {radn_nonzero} non-zero, {radn_zeros} zeros, {radn_nan} NaN out of {len(merged)} total rows")
    if radn_nonzero == 0 and radn_total > 0:
        print(f"  [WARNING] All radn values are zero after merge! Checking date overlap...")
        merged_dates = set(merged['date'].dt.date)
        rsds_dates = set(rsds_df['date'].dt.date)
        matching_dates = merged_dates.intersection(rsds_dates)
        print(f"  [WARNING] Date overlap: {len(matching_dates)} matching dates out of {len(merged_dates)} merged dates and {len(rsds_dates)} rsds dates")
        if len(matching_dates) == 0:
            print(f"  [ERROR] NO DATE OVERLAP! Dates don't match between merged and rsds dataframes!")
    
    # vp and evap are left blank, code is hardcoded to '222222'
    merged['vp'] = ''
    merged['evap'] = ''
    merged['code'] = '222222'  # Hardcoded code value for all rows
    
    # Sort by date
    merged = merged.sort_values('date').reset_index(drop=True)
    
    # Filter data to only include dates from start_year onwards and up to end_year
    if start_year is not None:
        start_date = pd.Timestamp(year=start_year, month=1, day=1)
        merged = merged[merged['date'] >= start_date].copy()
    
    if end_year is not None:
        end_date = pd.Timestamp(year=end_year, month=12, day=31)
        merged = merged[merged['date'] <= end_date].copy()
    
    if len(merged) == 0:
        raise ValueError("No data remaining after filtering! Check START_YEAR and END_YEAR settings.")
    
    # Normalize dates to remove time components
    merged['date'] = pd.to_datetime(merged['date']).dt.normalize()
    
    # Remove duplicate dates (keep first occurrence)
    merged = merged.drop_duplicates(subset='date', keep='first').reset_index(drop=True)
    
    # Get actual min and max dates from filtered data
    actual_min_date = merged['date'].min()
    actual_max_date = merged['date'].max()
    
    # Use the actual data range, rounded to full years
    min_date = pd.Timestamp(year=actual_min_date.year, month=1, day=1)
    max_date = pd.Timestamp(year=actual_max_date.year, month=12, day=31)
    
    # Create complete date range (includes all days, including day 366 for leap years)
    complete_date_range = pd.date_range(start=min_date, end=max_date, freq='D')
    complete_date_range = pd.to_datetime(complete_date_range).normalize()
    
    # Set date as index for reindexing
    merged = merged.set_index('date')
    
    # Reindex to include all days in the complete range
    merged = merged.reindex(complete_date_range)
    
    # Fill missing values for numeric columns using forward fill then backward fill
    numeric_cols = ['maxt', 'mint', 'rain', 'radn']
    for col in numeric_cols:
        if col in merged.columns:
            merged[col] = merged[col].ffill().bfill()
            if merged[col].notna().any():
                mean_val = merged[col].mean()
                if pd.notna(mean_val):
                    merged[col] = merged[col].fillna(mean_val)
                else:
                    merged[col] = merged[col].fillna(0.0)
            else:
                merged[col] = merged[col].fillna(0.0)
    
    # Handle vp, evap - ensure they remain blank (empty string)
    for col in ['vp', 'evap']:
        if col in merged.columns:
            merged[col] = merged[col].fillna('')
    
    # Handle code - ensure it's hardcoded to '222222'
    if 'code' in merged.columns:
        merged['code'] = '222222'
    
    # Reset index to get date back as a column
    merged = merged.reset_index()
    merged = merged.rename(columns={'index': 'date'})
    
    # Calculate tav and amp
    merged_temp = merged[['date', 'maxt', 'mint']].copy()
    merged_temp = merged_temp.set_index('date')
    merged_temp.index = pd.to_datetime(merged_temp.index)
    tav, amp = calculate_tav_amp(merged_temp)
    
    # Create year and day columns
    merged['year'] = merged['date'].dt.year
    merged['day'] = merged['date'].dt.dayofyear
    
    met_data = merged[['year', 'day', 'radn', 'maxt', 'mint', 'rain', 'evap', 'vp', 'code']].copy()
    
    # Prepare header
    current_date = datetime.now().strftime('%Y%m%d')
    model_scenario = f"{model.replace(' ', '_')}_{scenario}" if model and scenario else "CMIP6"
    
    header = f"""[weather.met.weather]
!Your Ref:  "
latitude = {latitude:.2f}  (DECIMAL DEGREES)
longitude =  {longitude:.2f}  (DECIMAL DEGREES)
tav = {tav:.2f} (oC) ! Annual average ambient temperature.
amp = {amp:.2f} (oC) ! Annual amplitude in mean monthly temperature.
!Data Extracted from CMIP6 {model} {scenario} dataset on {current_date} for APSIM
!As evaporation is read at 9am, it has been shifted to day before
!ie The evaporation measured on 20 April is in row for 19 April
!The 6 digit code indicates the source of the 6 data columns
!0 actual observation, 1 actual observation composite station
!2 interpolated from daily observations
!3 interpolated from daily observations using anomaly interpolation method for CLIMARC data
!6 synthetic pan
!7 interpolated long term averages
!more detailed two digit codes are available in SILO's 'Standard' format files
!
!For further information see the documentation on the datadrill
!  http://www.longpaddock.qld.gov.au/silo
!
year  day radn  maxt   mint  rain  evap    vp   code
 ()   () (MJ/m^2) (oC)  (oC)  (mm)  (mm) (hPa)     ()
"""
    
    # Create output filename
    lat_str = f"{latitude:.2f}"
    lon_str = f"{longitude:.2f}"
    output_filename = f"{model_scenario}_{lat_str}_{lon_str}.met"
    output_path = os.path.join(output_dir, output_filename)
    
    # Write MET file
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(header)
        # Write data rows
        for _, row in met_data.iterrows():
            # Format numbers with proper spacing
            radn_val = row['radn'] if row['radn'] != '' and pd.notna(row['radn']) else ''
            evap_val = row['evap'] if row['evap'] != '' and pd.notna(row['evap']) else ''
            vp_val = row['vp'] if row['vp'] != '' and pd.notna(row['vp']) else ''
            
            if radn_val != '':
                radn_str = f"{float(radn_val):6.1f}"
            else:
                radn_str = "      "  # 6 spaces
                
            if evap_val != '':
                evap_str = f"{float(evap_val):6.1f}"
            else:
                evap_str = "      "  # 6 spaces
                
            if vp_val != '':
                vp_str = f"{float(vp_val):6.1f}"
            else:
                vp_str = "      "  # 6 spaces
                
            
            # Code is hardcoded to '222222' for all rows
            code_str = "222222"
            
            # Handle NaN values for maxt, mint, rain - use 0.0 as default
            maxt_val = row['maxt'] if pd.notna(row['maxt']) else 0.0
            mint_val = row['mint'] if pd.notna(row['mint']) else 0.0
            rain_val = row['rain'] if pd.notna(row['rain']) else 0.0
            
            # Format with proper column widths
            line = f"{int(row['year']):4d} {int(row['day']):4d} {radn_str} {maxt_val:6.1f} {mint_val:6.1f} {rain_val:6.1f} {evap_str} {vp_str} {code_str}\n"
            f.write(line)
    
    print(f"  [OK] Created MET file: {output_filename}")
    
    # Also create CSV version
    csv_filename = f"{model_scenario}_{lat_str}_{lon_str}.csv"
    csv_path = os.path.join(output_dir, csv_filename)
    
    # Write CSV (without header comments, just data)
    met_data.to_csv(csv_path, index=False, encoding='utf-8', float_format='%.1f')
    print(f"  [OK] Created CSV file: {csv_filename}")
    
    # Get date range
    if 'date' in merged.columns and len(merged) > 0:
        final_date_range = {
            'start': merged['date'].min(),
            'end': merged['date'].max()
        }
    else:
        final_date_range = {'start': None, 'end': None}
    
    return tav, amp, len(met_data), final_date_range

## Section 5: Main Processing

This section loads CSV files for all variables and creates the MET file.

In [8]:
def process_coordinate(model, scenario, latitude, longitude, variables, cmip6_base_dir, output_dir, tolerance=0.01):
    """
    Main processing function for user-provided coordinate.
    Extract all variables from NC files and convert to MET format.
    
    Parameters:
    -----------
    model : str
        Model name (e.g., "ACCESS CM2")
    scenario : str
        Scenario name (e.g., "SSP245")
    latitude : float
        Target latitude
    longitude : float
        Target longitude
    variables : list
        List of variable names to extract
    cmip6_base_dir : str
        Base directory containing Model Scenario folders
    output_dir : str
        Output directory for results
    tolerance : float
        Coordinate matching tolerance
    
    Returns:
    --------
    dict: Summary statistics
    """
    print("="*70)
    print(f"Processing Coordinate: ({latitude:.6f}, {longitude:.6f})")
    print(f"Model: {model}, Scenario: {scenario}")
    print("="*70)
    
    # Construct data directory path
    data_dir = os.path.join(cmip6_base_dir, f"{model} {scenario}")
    
    if not os.path.exists(data_dir):
        print(f"ERROR: Data directory not found: {data_dir}")
        return None
    
    print(f"\nData directory: {data_dir}")
    
    
    # Prepare coordinate strings for cache file naming
    lat_str = f"{latitude:.2f}"
    lon_str = f"{longitude:.2f}"
    model_scenario = f"{model.replace(' ', '_')}_{scenario}"
    
    # Extract data for all variables (check cache first)
    extracted_data = {}
    
    for variable in variables:
        print(f"\n{'='*70}")
        print(f"Processing variable: {variable}")
        print(f"{'='*70}")
        
        # Check if cached CSV file exists
        cache_path = get_cached_variable_path(output_dir, model_scenario, lat_str, lon_str, variable)
        df = load_cached_variable(cache_path)
        
        # If cache doesn't exist or is invalid, extract from NetCDF
        if df is None:
            # Extract data from NetCDF files
            df = extract_daily_data_from_netcdf(
                data_dir, 
                variable, 
                latitude, 
                longitude, 
                tolerance=tolerance
            )
            
            # Save to cache if extraction was successful
            if df is not None and len(df) > 0:
                save_cached_variable(df, cache_path)
        
        # Add to extracted_data if we have valid data
        if df is not None and len(df) > 0:
            extracted_data[variable] = df
        else:
            print(f"  WARNING: No data available for {variable}")
    # Check if required variables are available for MET conversion
    # Note: rsds is now mandatory (required for radn in MET format)
    required_vars = ['tasmax', 'tasmin', 'pr', 'rsds']
    missing_vars = [v for v in required_vars if v not in extracted_data]
    
    if missing_vars:
        print(f"\nERROR: Missing required variables for MET conversion: {missing_vars}")
        return None
    
    # Create MET file
    print(f"\n{'='*70}")
    print("Creating MET file...")
    print(f"{'='*70}")
    
    # Get required variables
    tasmax_df = extracted_data['tasmax']
    tasmin_df = extracted_data['tasmin']
    pr_df = extracted_data['pr']
    
    # Get rsds variable for MET format (now mandatory)
    rsds_df = extracted_data.get('rsds', None)
    if rsds_df is None:
        print(f"  ERROR: rsds is required but was not extracted")
        return None
    
    # Note: code is hardcoded to '222222', vp and evap are left blank
    
    tav, amp, num_rows, final_date_range = create_met_file(
        tasmax_df=tasmax_df,
        tasmin_df=tasmin_df,
        pr_df=pr_df,
        rsds_df=rsds_df,
        scenario=scenario,
        output_dir=output_dir,
        latitude=latitude,
        longitude=longitude,
        model=model,
        start_year=START_YEAR,
        end_year=END_YEAR
    )
    
    # Summary
    summary = {
        'latitude': latitude,
        'longitude': longitude,
        'model': model,
        'scenario': scenario,
        'variables_extracted': list(extracted_data.keys()),
        'num_variables': len(extracted_data),
        'tav': tav,
        'amp': amp,
        'num_rows': num_rows,
        'date_range': final_date_range if final_date_range['start'] is not None else {
            'start': tasmax_df['date'].min(),
            'end': tasmax_df['date'].max()
        }
    }
    
    print(f"\n{'='*70}")
    print("Processing Summary")
    print(f"{'='*70}")
    print(f"  Variables extracted: {len(extracted_data)}")
    print(f"    - {', '.join(extracted_data.keys())}")
    print(f"  MET file rows: {num_rows}")
    print(f"  Date range: {summary['date_range']['start']} to {summary['date_range']['end']}")
    print(f"  tav (annual average temp): {tav:.2f} °C")
    print(f"  amp (annual amplitude): {amp:.2f} °C")
    print(f"  Output directory: {output_dir}")
    print(f"{'='*70}")
    
    return summary

## Section 6: Execute Processing

In [None]:
# Execute main processing
print("\n" + "="*70)
print("STARTING PROCESSING")
print("="*70)
print(f"Model: {MODEL}")
print(f"Scenario: {SCENARIO}")
print(f"Coordinates: ({LATITUDE:.6f}, {LONGITUDE:.6f})")
print(f"Variables to process: {len(VARIABLES)} ({', '.join(VARIABLES)})")
print(f"CMIP6 Base Directory: {CMIP6_BASE_DIR}")
print(f"Output directory: {OUTPUT_DIR}")
print("="*70 + "\n")

summary = process_coordinate(
    model=MODEL,
    scenario=SCENARIO,
    latitude=LATITUDE,
    longitude=LONGITUDE,
    variables=VARIABLES,
    cmip6_base_dir=CMIP6_BASE_DIR,
    output_dir=OUTPUT_DIR,
    tolerance=COORD_TOLERANCE
)

if summary:
    print("\n" + "="*70)
    print("✓ PROCESSING COMPLETED SUCCESSFULLY!")
    print("="*70)
    print(f"\nOutput files created:")
    print(f"  - MET file: {MODEL.replace(' ', '_')}_{SCENARIO}_{LATITUDE:.2f}_{LONGITUDE:.2f}.met")
    print(f"  - CSV file: {MODEL.replace(' ', '_')}_{SCENARIO}_{LATITUDE:.2f}_{LONGITUDE:.2f}.csv")
    print(f"  - Individual variable CSV cache files: {len(summary.get('variables_extracted', []))} files")
    print(f"\nAll files saved to: {OUTPUT_DIR}")
else:
    print("\n" + "="*70)
    print("✗ PROCESSING FAILED")
    print("="*70)
    print("Please check error messages above.")


STARTING PROCESSING
Model: ACCESS CM2
Scenario: obs
Coordinates: (-31.450000, 117.550000)
Variables to process: 4 (tasmax, tasmin, pr, rsds)
CMIP6 Base Directory: C:\Users\ibian\Desktop\ClimAdapt\CMIP6
Output directory: C:\Users\ibian\Desktop\ClimAdapt\Anameka\-31.45_117.55

Processing Coordinate: (-31.450000, 117.550000)
Model: ACCESS CM2, Scenario: obs

Data directory: C:\Users\ibian\Desktop\ClimAdapt\CMIP6\ACCESS CM2 obs

Processing variable: tasmax
  [INFO] Checking cache: ACCESS_CM2_obs_-31.45_117.55_tasmax.csv
  [INFO] Cache file found and valid - loaded 10,957 records
  [INFO] Date range: 1985-01-01 09:00:00 to 2014-12-31 09:00:00

Processing variable: tasmin
  [INFO] Checking cache: ACCESS_CM2_obs_-31.45_117.55_tasmin.csv
  [INFO] Cache file found and valid - loaded 10,957 records
  [INFO] Date range: 1985-01-01 09:00:00 to 2014-12-31 09:00:00

Processing variable: pr
  [INFO] Checking cache: ACCESS_CM2_obs_-31.45_117.55_pr.csv
  [INFO] Cache file found and valid - loaded 10,9