# 02 | Neural Hydrology CARAVAN Setup

## Purpose
Configure and launch LSTM training on the CARAVAN dataset using globally snow-dominated basins.
This notebook covers:
1. Identifying snow-dominated basins from CARAVAN attributes
2. Verifying the timeseries data format required by neuralhydrology
3. Writing the neuralhydrology config (`.yml`)
4. Writing and submitting the SLURM training job

**Outputs produced by this notebook:**
- `data/caravan_snow_basins.txt` — basin ID list for training/validation/test
- `configs/caravan_snow_scenario1.yml` — neuralhydrology training config
- `train_caravan_snow.slurm` — SLURM job script

**Shared storage:**
- CARAVAN data: `/uufs/chpc.utah.edu/common/home/johnsonrc-group1/CARAVAN/CARAVAN_data`
- Project directory: `/uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project`

**Note for new users:** 
> Any path containing `Meyer` should be updated to your own
> project directory before running this notebook. The one path you will need to change
> is `PROJECT_DIR` in Cell 3 — all other paths are derived from it automatically.
> The shared CARAVAN data directory (`johnsonrc-group1`) does not need to change (assuming you are a UofU user)

## About CARAVAN

CARAVAN (Kratzert et al., 2023) is a global hydrology dataset designed specifically for large-sample,
data-driven modeling. Key characteristics:

| Property | Details |
|---|---|
| **Version used** | v1.6 |
| **Total basins** | ~16,299 across 7 regions |
| **Regions** | CAMELS (US), CAMELS-AUS, CAMELS-BR, CAMELS-CL, CAMELS-GB, HYSETS (Canada), LamaH (Europe) |
| **Forcing data** | ERA5-Land reanalysis (1981–2020) |
| **Basin attributes** | HydroATLAS catchment characteristics |
| **Streamflow** | Observed daily discharge (mm/day, normalized by area) |
| **Format** | One CSV per basin, neuralhydrology-native |

### Why CARAVAN for Alaska transfer learning?
- ERA5-Land forcing is globally consistent — the **same variables and units** we will extract for Alaska basins
- Training on diverse global snow-dominated catchments exposes the model to a wide range of snow accumulation and melt regimes
- neuralhydrology has built-in CARAVAN support (`dataset: caravan`), simplifying the data loading pipeline

### Snow-dominated basin selection strategy
We use the CARAVAN attribute `frac_snow` (fraction of precipitation falling as snow) with a threshold
of **> 0.30**, consistent with our earlier CAMELS-US analysis. This yields ~3,297 basins globally,
providing a rich and geographically diverse training set for snow-driven streamflow prediction.

In [6]:
import pandas as pd
import numpy as np
from pathlib import Path

# ── Project paths ──────────────────────────────────────────────────────────────
PROJECT_DIR      = Path("/uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project")
CARAVAN_DATA_DIR = Path("/uufs/chpc.utah.edu/common/home/johnsonrc-group1/CARAVAN/CARAVAN_data")

# ── Verify directories exist ───────────────────────────────────────────────────
assert PROJECT_DIR.exists(),      f"Project directory not found: {PROJECT_DIR}"
assert CARAVAN_DATA_DIR.exists(), f"CARAVAN data not found: {CARAVAN_DATA_DIR}"

# ── Ensure output directories exist ───────────────────────────────────────────
(PROJECT_DIR / "data").mkdir(exist_ok=True)
(PROJECT_DIR / "configs").mkdir(exist_ok=True)
(PROJECT_DIR / "results").mkdir(exist_ok=True)

print("✓ Paths configured")
print(f"  Project    : {PROJECT_DIR}")
print(f"  CARAVAN    : {CARAVAN_DATA_DIR}")

✓ Paths configured
  Project    : /uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project
  CARAVAN    : /uufs/chpc.utah.edu/common/home/johnsonrc-group1/CARAVAN/CARAVAN_data


## Step 1: Identify Snow-Dominated Basins

CARAVAN attributes are organized by region, each with several thematic CSV files.
The file we need is `attributes_caravan_{region}.csv`, which contains ERA5-Land derived
climate indices including `frac_snow` — the long-term mean fraction of precipitation
falling as snow.

### Attribute file locations
```
CARAVAN_data/
└── attributes/
    ├── camels/
    │   ├── attributes_caravan_camels.csv       ← contains frac_snow
    │   ├── attributes_hydroatlas_camels.csv    ← basin geomorphology
    │   └── attributes_other_camels.csv         ← gauge metadata, country, etc.
    ├── camelsaus/
    ├── camelsbr/
    ├── camelscl/
    ├── camelsgb/
    ├── hysets/
    └── lamah/
```

### Threshold selection
We use `frac_snow > 0.30`, meaning at least 30% of annual precipitation falls as snow
on average. This is the same threshold used in our CAMELS-US preliminary analysis and
ensures the model sees basins where snow accumulation and melt are a primary driver
of the hydrograph. Basins below this threshold are excluded from training — they would
dilute the snow signal and are not representative of Alaska hydrology.

In [7]:
# ── Settings ───────────────────────────────────────────────────────────────────
SNOW_THRESHOLD = 0.30
REGIONS = ['camels', 'camelsaus', 'camelsbr', 'camelscl', 'camelsgb', 'hysets', 'lamah']
attributes_dir = CARAVAN_DATA_DIR / 'attributes'

# ── Filter snow-dominated basins by region ─────────────────────────────────────
all_snow_basins = []

print(f"Filtering basins with frac_snow > {SNOW_THRESHOLD}")
print("=" * 60)

for region in REGIONS:
    attr_file = attributes_dir / region / f"attributes_caravan_{region}.csv"
    df = pd.read_csv(attr_file, index_col=0)

    snow_basins = df[df['frac_snow'] > SNOW_THRESHOLD].copy()
    snow_basins['region'] = region

    pct = len(snow_basins) / len(df) * 100
    print(f"  {region.upper():12s}: {len(snow_basins):4d} / {len(df):5d} basins ({pct:4.1f}%)")

    all_snow_basins.append(snow_basins)

# ── Combine all regions ────────────────────────────────────────────────────────
snow_dominated = pd.concat(all_snow_basins)

print("=" * 60)
print(f"  {'TOTAL':12s}: {len(snow_dominated):4d} snow-dominated basins\n")

print("frac_snow statistics across selected basins:")
print(f"  Mean   : {snow_dominated['frac_snow'].mean():.3f}")
print(f"  Median : {snow_dominated['frac_snow'].median():.3f}")
print(f"  Min    : {snow_dominated['frac_snow'].min():.3f}")
print(f"  Max    : {snow_dominated['frac_snow'].max():.3f}")

# ── Save basin ID list (plain text, one ID per line) ──────────────────────────
basin_list_file = PROJECT_DIR / 'data' / 'caravan_snow_basins.txt'

with open(basin_list_file, 'w') as f:
    for basin_id in sorted(snow_dominated.index):
        f.write(f"{basin_id}\n")

print(f"\n✓ Basin list saved: {basin_list_file}")
print(f"  {len(snow_dominated):,} basin IDs written (format: region_gageID)")
print(f"  Example IDs: {sorted(snow_dominated.index)[:3]}")

Filtering basins with frac_snow > 0.3
  CAMELS      :  128 /   671 basins (19.1%)
  CAMELSAUS   :    0 /   561 basins ( 0.0%)
  CAMELSBR    :    0 /   870 basins ( 0.0%)
  CAMELSCL    :  144 /   505 basins (28.5%)
  CAMELSGB    :    0 /   671 basins ( 0.0%)
  HYSETS      : 2803 / 12162 basins (23.0%)
  LAMAH       :  222 /   859 basins (25.8%)
  TOTAL       : 3297 snow-dominated basins

frac_snow statistics across selected basins:
  Mean   : 0.453
  Median : 0.437
  Min    : 0.300
  Max    : 0.903

✓ Basin list saved: /uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project/data/caravan_snow_basins.txt
  3,297 basin IDs written (format: region_gageID)
  Example IDs: ['camels_01013500', 'camels_01022500', 'camels_01030500']


## Step 2: Verify CARAVAN Timeseries Format

Before writing the neuralhydrology config, we need to confirm the exact column names
and date format in the CARAVAN timeseries files. This is a one-time sanity check —
neuralhydrology is strict about variable names and will throw an error at training time
if any `dynamic_inputs` or `target_variables` are missing or misnamed.

### Timeseries file locations
```
CARAVAN_data/
└── timeseries/
    └── csv/
        ├── camels/
        │   ├── camels_01013500.csv    ← one file per basin
        │   ├── camels_01022500.csv
        │   └── ...
        ├── camelsaus/
        ├── camelsbr/
        ├── camelscl/
        ├── camelsgb/
        ├── hysets/
        └── lamah/
```

Each file contains daily time steps as rows and ERA5-Land forcing variables + streamflow as columns.
We will verify the following variables are present, as these are what we will pass to the model:

| Role | Variable name |
|---|---|
| Forcing | `total_precipitation_sum` |
| Forcing | `temperature_2m_max` |
| Forcing | `temperature_2m_min` |
| Forcing | `surface_net_solar_radiation_mean` |
| Target | `streamflow` |

In [8]:
# ── Load a sample CAMELS basin timeseries ─────────────────────────────────────
timeseries_dir = CARAVAN_DATA_DIR / 'timeseries' / 'csv'

sample_file = next((timeseries_dir / 'camels').glob('*.csv'))
df_sample = pd.read_csv(sample_file)

print(f"Sample file: {sample_file.name}")
print(f"Shape      : {df_sample.shape}  ({df_sample.shape[0]} days × {df_sample.shape[1]} columns)")
print(f"\nDate column:")
print(f"  dtype  : {df_sample['date'].dtype}")
print(f"  format : {df_sample['date'].iloc[0]}  (YYYY-MM-DD)")
print(f"  range  : {df_sample['date'].iloc[0]}  →  {df_sample['date'].iloc[-1]}")

print(f"\nAll columns ({len(df_sample.columns)}):")
for col in df_sample.columns:
    print(f"  {col}")

# ── Verify required variables are present ─────────────────────────────────────
REQUIRED_INPUTS = [
    'total_precipitation_sum',
    'temperature_2m_max',
    'temperature_2m_min',
    'surface_net_solar_radiation_mean',
]
REQUIRED_TARGET = ['streamflow']

print(f"\n{'='*60}")
print("Required variable check:")
print(f"{'='*60}")

all_present = True
for var in REQUIRED_INPUTS:
    status = '✓' if var in df_sample.columns else '✗  MISSING'
    print(f"  {status}  {var}")
for var in REQUIRED_TARGET:
    status = '✓' if var in df_sample.columns else '✗  MISSING'
    print(f"  {status}  {var}  (target)")

if all_present:
    print(f"\n✓ All required variables confirmed — safe to proceed with config")
else:
    print(f"\n✗ Missing variables detected — update dynamic_inputs before continuing")

Sample file: camels_05495500.csv
Shape      : (26662, 41)  (26662 days × 41 columns)

Date column:
  dtype  : object
  format : 1951-01-01  (YYYY-MM-DD)
  range  : 1951-01-01  →  2023-12-30

All columns (41):
  date
  dewpoint_temperature_2m_max
  dewpoint_temperature_2m_mean
  dewpoint_temperature_2m_min
  potential_evaporation_sum_ERA5_LAND
  potential_evaporation_sum_FAO_PENMAN_MONTEITH
  snow_depth_water_equivalent_max
  snow_depth_water_equivalent_mean
  snow_depth_water_equivalent_min
  streamflow
  surface_net_solar_radiation_max
  surface_net_solar_radiation_mean
  surface_net_solar_radiation_min
  surface_net_thermal_radiation_max
  surface_net_thermal_radiation_mean
  surface_net_thermal_radiation_min
  surface_pressure_max
  surface_pressure_mean
  surface_pressure_min
  temperature_2m_max
  temperature_2m_mean
  temperature_2m_min
  total_precipitation_sum
  u_component_of_wind_10m_max
  u_component_of_wind_10m_mean
  u_component_of_wind_10m_min
  v_component_of_wind_10m_ma

## Step 3: Write the neuralhydrology Training Config

The config is a `.yml` file that fully specifies a neuralhydrology experiment — data, model
architecture, training hyperparameters, and logging. Every training run is reproducible from
its config alone. Below is a section-by-section explanation of what we are setting and why.

---

### Experiment + paths
- `experiment_name` — used as the run folder name prefix under `run_dir`
- `run_dir` — where all outputs (weights, metrics, figures) are written
- `train/validation/test_basin_file` — all three point to our 3,297-basin snow list;
  neuralhydrology splits by **time**, not by basin, so the same basins appear in all three periods

### Training / validation / test periods
We use a classic hydrology split on the ERA5-Land record (1981–2020):

| Split | Period | Purpose |
|---|---|---|
| Train | 1990–2010 | Model learning |
| Validation | 2011–2015 | Epoch selection, early stopping |
| Test | 2016–2020 | Final unbiased evaluation |

We start in 1990 (not 1981) to avoid the ERA5-Land spin-up period and ensure
soil moisture and snow states are well-initialized before training begins.

### Dataset
- `dataset: caravan` — activates neuralhydrology's built-in CARAVAN loader
- `data_dir` — must point to the root `CARAVAN_data/` folder; the loader expects
  the `timeseries/csv/{region}/` and `attributes/{region}/` subdirectory structure

### Model architecture
- `model: cudalstm` — GPU-optimized LSTM implementation; identical to `lstm` but faster
- `hidden_size: 128` — number of LSTM memory cells; balances capacity vs. overfitting
- `initial_forget_bias: 3` — initializes the forget gate to stay open, helping the LSTM
  retain long-term snow accumulation signals early in training
- `output_dropout: 0.4` — dropout applied to the output layer for regularization
- `head: regression` + `output_activation: linear` — direct continuous streamflow prediction

### Dynamic inputs + target
- Four ERA5-Land variables chosen for physical relevance to snow-dominated hydrology:
  precipitation, min/max temperature (drives melt), and solar radiation (energy balance)
- `clip_targets_to_zero: [streamflow]` — prevents the model from predicting negative discharge

### Optimizer + training
- `optimizer: Adam` with a stepped learning rate schedule: 1e-3 → 5e-4 → 1e-4
- `seq_length: 365` — one full year of antecedent conditions fed to the LSTM at each step,
  critical for capturing the full snow accumulation season before spring melt
- `predict_last_n: 1` — only the final timestep of each sequence is used for loss computation
- `clip_gradient_norm: 1.0` — prevents exploding gradients during early training epochs

### Validation + logging
- `validate_every: 1` — compute validation metrics after every epoch
- `validate_n_random_basins: 10` — evaluate on 10 random basins per epoch (fast proxy)
- `save_weights_every: 1` — checkpoint saved each epoch so we can recover the best epoch
- `save_validation_results: True` — writes per-basin metrics CSVs, required for best-epoch selection

In [9]:
from ruamel.yaml import YAML

yaml_handler = YAML()
yaml_handler.default_flow_style = False

config = {
    'experiment_name': 'caravan_snow_global_scenario1',
    'run_dir': str(PROJECT_DIR / 'results' / 'caravan_snow_global_scenario1'),

    'train_basin_file': str(basin_list_file),
    'validation_basin_file': str(basin_list_file),
    'test_basin_file': str(basin_list_file),

    'train_start_date': '01/01/1990',
    'train_end_date': '31/12/2010',
    'validation_start_date': '01/01/2011',
    'validation_end_date': '31/12/2015',
    'test_start_date': '01/01/2016',
    'test_end_date': '31/12/2020',

    'dataset': 'caravan',
    'data_dir': '/uufs/chpc.utah.edu/common/home/johnsonrc-group1/CARAVAN/CARAVAN_data',

    'model': 'cudalstm',
    'head': 'regression',
    'output_activation': 'linear',
    'hidden_size': 128,
    'initial_forget_bias': 3,
    'output_dropout': 0.4,

    'dynamic_inputs': [
        'total_precipitation_sum',
        'temperature_2m_max',
        'temperature_2m_min',
        'surface_net_solar_radiation_mean',
    ],
    'target_variables': ['streamflow'],
    'clip_targets_to_zero': ['streamflow'],

    'optimizer': 'Adam',
    'loss': 'MSE',
    'epochs': 30,
    'learning_rate': {0: 1e-3, 10: 5e-4, 20: 1e-4},
    'batch_size': 256,
    'clip_gradient_norm': 1.0,
    'seq_length': 365,
    'predict_last_n': 1,

    'validate_every': 1,
    'validate_n_random_basins': 10,
    'cache_validation_data': True,
    'metrics': ['NSE', 'MSE', 'KGE', 'Alpha-NSE', 'Beta-NSE'],

    'device': 'cuda:0',
    'num_workers': 8,
    'seed': 42,

    'log_interval': 50,
    'log_tensorboard': True,
    'log_n_figures': 5,
    'save_weights_every': 1,
    'save_validation_results': True,
}

config_file = PROJECT_DIR / 'configs' / 'caravan_snow_scenario1.yml'
with open(config_file, 'w') as f:
    yaml_handler.dump(config, f)

print(f"✓ Config written: {config_file}")
print(f"\nKey settings:")
print(f"  Experiment : {config['experiment_name']}")
print(f"  Basins     : {basin_list_file.name}")
print(f"  data_dir   : {config['data_dir']}")
print(f"  Epochs     : {config['epochs']}")
print(f"  seq_length : {config['seq_length']}")
print(f"  hidden_size: {config['hidden_size']}")

✓ Config written: /uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project/configs/caravan_snow_scenario1.yml

Key settings:
  Experiment : caravan_snow_global_scenario1
  Basins     : caravan_snow_basins.txt
  data_dir   : /uufs/chpc.utah.edu/common/home/johnsonrc-group1/CARAVAN/CARAVAN_data
  Epochs     : 30
  seq_length : 365
  hidden_size: 128


## Step 4: Write the SLURM Training Script

The SLURM script requests compute resources from CHPC's GRANITE cluster and launches
the neuralhydrology training run. Below is a line-by-line explanation of each directive
and command.

---

### SBATCH resource directives

| Directive | Value | Explanation |
|---|---|---|
| `--account` | `rai` | Charge compute time to the RAI allocation |
| `--partition` | `rai-gpu-grn` | GRANITE GPU nodes available to RAI |
| `--qos` | `rai-gpu-grn` | Quality of service tier matching the partition |
| `--nodes` | `1` | Single node — neuralhydrology data parallelism doesn't span nodes |
| `--ntasks` | `1` | One training process |
| `--cpus-per-task` | `16` | CPU cores for DataLoader workers (`num_workers: 8` needs headroom) |
| `--mem` | `64G` | System RAM; CARAVAN loads basin data into memory during training |
| `--gres=gpu` | `1` | One GPU; single-GPU training is stable and sufficient for 30 epochs |
| `--time` | `48:00:00` | 48-hour wall time limit; our 30-epoch run took ~24 hours |
| `--mail-type` | `BEGIN,END,FAIL` | Email notifications at job start, completion, or failure |

---

### Runtime commands

**Module loading**
- `module purge` — clears any inherited environment modules to ensure a clean state
- `module load miniconda3/25.9.1` — loads the Conda installation available on GRANITE
- `module load cuda/12.1` — loads CUDA toolkit matching our PyTorch build

**Environment activation**
- `source activate neuralhydrology` — activates the Conda environment where
  neuralhydrology, PyTorch, and all dependencies are installed

**Training launch**
- `nh-run train --config-file configs/caravan_snow_scenario1.yml` — the neuralhydrology
  CLI entry point; reads the config, initializes the model, and begins training.
  All outputs (weights, metrics, figures, logs) are written to `run_dir` as specified in the config.

In [10]:
# ── Ensure results directory exists ───────────────────────────────────────────
RESULTS_DIR = PROJECT_DIR / "results"
RESULTS_DIR.mkdir(exist_ok=True)

slurm_script = f"""#!/bin/bash
#SBATCH --account=rai
#SBATCH --partition=rai-gpu-grn
#SBATCH --qos=rai-gpu-grn
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --gres=gpu:1
#SBATCH --time=48:00:00
#SBATCH --job-name=caravan_snow_train
#SBATCH --output=/uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project/results/slurm_%j.out
#SBATCH --error=/uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project/results/slurm_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=kaitlin.meyer@utah.edu

# Print job info
echo "=========================================="
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURM_NODELIST"
echo "Partition: $SLURM_JOB_PARTITION"
echo "QOS: $SLURM_JOB_QOS"
echo "Account: $SLURM_JOB_ACCOUNT"
echo "Start time: $(date)"
echo "=========================================="

# Load modules
module purge
module load miniconda3/25.9.1
module load cuda/12.1

# Activate conda environment
source activate neuralhydrology

# Print environment info
echo "Python: $(which python)"
echo "CUDA available: $(python -c 'import torch; print(torch.cuda.is_available())')"
echo "GPU: $(nvidia-smi --query-gpu=name,memory.total --format=csv,noheader)"

# Run training
cd /uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project
echo "Starting training at $(date)"
nh-run train --config-file configs/caravan_snow_scenario1.yml

echo "=========================================="
echo "Job completed at: $(date)"
echo "=========================================="
"""

# ── Write SLURM script ─────────────────────────────────────────────────────────
slurm_file = PROJECT_DIR / "train_caravan_snow.slurm"
with open(slurm_file, 'w') as f:
    f.write(slurm_script)

print(f"✓ SLURM script written: {slurm_file}")
print(f"\nResources requested:")
print(f"  Account   : rai")
print(f"  Partition : rai-gpu-grn")
print(f"  GPU       : 1x (GRANITE)")
print(f"  Memory    : 64 GB")
print(f"  Wall time : 48 hours")
print(f"\nTo submit:")
print(f"  cd {PROJECT_DIR}")
print(f"  sbatch train_caravan_snow.slurm")
print(f"\nTo monitor:")
print(f"  squeue -u $USER")
print(f"  tail -f results/slurm_<JOBID>.out")

✓ SLURM script written: /uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project/train_caravan_snow.slurm

Resources requested:
  Account   : rai
  Partition : rai-gpu-grn
  GPU       : 1x (GRANITE)
  Memory    : 64 GB
  Wall time : 48 hours

To submit:
  cd /uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project
  sbatch train_caravan_snow.slurm

To monitor:
  squeue -u $USER
  tail -f results/slurm_<JOBID>.out


In [12]:
# ── Pre-submission checklist ───────────────────────────────────────────────────
print("Pre-Submission Checklist:")
print("=" * 60)

checks = {
    'CARAVAN data directory exists'  : CARAVAN_DATA_DIR.exists(),
    'Basin list file exists'         : basin_list_file.exists(),
    'Config file exists'             : config_file.exists(),
    'SLURM script exists'            : slurm_file.exists(),
    'Results directory exists'       : RESULTS_DIR.exists(),
}

all_good = True
for description, status in checks.items():
    symbol = '✓' if status else '✗'
    print(f"  {symbol}  {description}")
    if not status:
        all_good = False

# ── Basin count verification ───────────────────────────────────────────────────
print()
with open(basin_list_file, 'r') as f:
    n_basins = sum(1 for line in f if line.strip())
print(f"  Basin list contains {n_basins:,} basins")

# ── Config spot-check ──────────────────────────────────────────────────────────
print()
print("Config spot-check:")
print(f"  experiment_name : {config['experiment_name']}")
print(f"  data_dir        : {config['data_dir']}")
print(f"  train period    : {config['train_start_date']} → {config['train_end_date']}")
print(f"  epochs          : {config['epochs']}")
print(f"  seq_length      : {config['seq_length']}")
print(f"  dynamic_inputs  : {config['dynamic_inputs']}")
print(f"  target          : {config['target_variables']}")

print()
print("=" * 60)
if all_good:
    print("✓ All checks passed — ready to submit")
    print("=" * 60)
    print(f"\nSubmit with:")
    print(f"  cd {PROJECT_DIR}")
    print(f"  sbatch train_caravan_snow.slurm")
else:
    print("✗ Some checks failed — resolve before submitting")
    print("=" * 60)

Pre-Submission Checklist:
  ✓  CARAVAN data directory exists
  ✓  Basin list file exists
  ✓  Config file exists
  ✓  SLURM script exists
  ✓  Results directory exists

  Basin list contains 3,297 basins

Config spot-check:
  experiment_name : caravan_snow_global_scenario1
  data_dir        : /uufs/chpc.utah.edu/common/home/johnsonrc-group1/CARAVAN/CARAVAN_data
  train period    : 01/01/1990 → 31/12/2010
  epochs          : 30
  seq_length      : 365
  dynamic_inputs  : ['total_precipitation_sum', 'temperature_2m_max', 'temperature_2m_min', 'surface_net_solar_radiation_mean']
  target          : ['streamflow']

✓ All checks passed — ready to submit

Submit with:
  cd /uufs/chpc.utah.edu/common/home/civil-group1/Meyer/neuralhydrology_project
  sbatch train_caravan_snow.slurm
