# Create Basin Lists

**Explain what this notebook does.**

Types of basin splits:
1) Gauged-Caravan: All basins in the Caravan dataset are used for training and testing.
2) Gauged-CAMELS: All basins in each CAMELS dataset are used individually for training and testing, independent of the other CAMELS datasets.
3) Ungauged Random Cross-Validation: The full Caravan basin list is split into random k-fold splits without replacement and without overlap. The user defines the number of splits.
4) Ungauged CAMELS Cross-Validation: Each CAMELS dataset is withheld individualy from training and then used for testing for a model trained on all other CAMELS datasets.

This notebook creates subdirectories for each of these types of basin splits, and then creates basin lists within those subriectories for each train/test split.

## Notebook Setup

### Install Modules

In [1]:
# !pip install zarr xarray
# !pip install fsspec
# !pip install gcsfs

### Import Modules

In [2]:
import os
from pathlib import Path
import random
from sklearn.model_selection import KFold
import xarray as xr

### Helper Function for Writing Basin Lists

In [3]:
def write_basin_list(basins: list[str], filename: Path):
    with open(filename, 'wt') as f:
        for basin in basins:
            f.write(basin + '\n')

## User-Defined Variables

* `BASIN_LIST_DIRECTORY`: Base directory on local filesystem where the basins lists are stored.
* `CARAVAN_DATA_DIRECTORY`: Directory on local file system whre the Caravan dataset is stored.
* `N_KFOLD_SPLITS`: Integer number of random k-fold cross validation splits.
* `RANDOM_SEED`: Allows the k-fold splits to be reproducible. Choose whatever you want. I used 42, as is tradition.

In [4]:
BASIN_LIST_DIRECTORY = '/home/gsnearing/ecmwf/basin_lists/'
CARAVAN_DATA_DIRECTORY = '/home/gsnearing/caravan-data/Caravan-nc/timeseries/netcdf'
N_KFOLD_SPLITS = 10
RANDOM_SEED = 42

In [5]:
# Turn the user-supplied path strings into pathlib.Path objects.
basin_list_dir = Path(BASIN_LIST_DIRECTORY)
caravan_data_dir = Path(CARAVAN_DATA_DIRECTORY)

## Basin List Directory
Create the basin list directroy if it does not already exist.

In [6]:
def delete_txt_files(directory_path: Path):
    """
    Deletes all .txt files from the given directory and all its subdirectories to infinite depth.

    Args:
        directory_path (pathlib.Path): The path to the directory to start deleting from,
                                        provided as a pathlib.Path object.

    Returns:
        list: A list of full paths (as strings) to the .txt files that were successfully deleted.
    """
    deleted_files = []
    if not directory_path.is_dir():
        print(f"Error: Directory '{directory_path}' does not exist or is not a directory.")
        return

    for root_str, _, filenames in os.walk(directory_path):
        root_path = Path(root_str)
        for filename in filenames:
            if filename.endswith(".txt"):
                full_path = root_path / filename
                try:
                    os.remove(full_path)
                    deleted_files.append(str(full_path))
                except OSError as e:
                    print(f"Error deleting {full_path}: {e}")
    return

delete_txt_files(basin_list_dir)   

## Find Basins
This section of the notebook reads the Caravan dataset and the Caravan MultiMet dataset and finds the intersection of basins in both datasets. This intersection is used as the full basin list for all train/test splits.

### Caravan v1.6

In [7]:
caravan_basins = [str(f.stem) for f in caravan_data_dir.rglob('*.nc')]
print(f'There are {len(caravan_basins)} Caravan basins.')

camels_datasets = list(set([basin.split('_')[0] for basin in caravan_basins]))
print(f'There are {len(camels_datasets)} CAMELS datasets.')

There are 24352 Caravan basins.
There are 13 CAMELS datasets.


### MultiMet

In [8]:
def open_zarr(path):
  return xr.open_zarr(
      store=path, chunks=None, storage_options=dict(token='anon')
  )

In [None]:
products = [
    'CPC',
    'IMERG',
    'CHIRPS',
    'ERA5_LAND',
    'CHIRPS_GEFS',
    'HRES',
    'GRAPHCAST',
]
zarr_path_template = 'gs://caravan-multimet/v1.1/{}/timeseries.zarr/'

product_to_dataset = {
    product: open_zarr(zarr_path_template.format(product))
    for product in products
}

In [None]:
product_basins = {}
for product, ds in product_to_dataset.items():
    product_basins[product] = list(ds.basin.values)
    print(f'There are {len(product_basins[product])} basins for {product}.')

reference_product = 'GRAPHCAST'
multimet_basins = product_basins[reference_product]
print(f'\nThere are {len(multimet_basins)} basins in the reference product ({reference_product}).\n\n')

### Intersection

In [None]:
intersection_basins = list(set(multimet_basins).intersection(set(caravan_basins)))
print(f'There are {len(intersection_basins)} basins that appear in both sets.')

## Construct Basin Lists

### Gagued-Caravan

In [None]:
gauged_caravan_directory = basin_list_dir / 'gauged-caravan'
gauged_caravan_directory.mkdir(parents=True, exist_ok=True)

In [None]:
# In the gauged setting, train and test basin lists are the same.
write_basin_list(intersection_basins, gauged_caravan_directory / 'train.txt')
write_basin_list(intersection_basins, gauged_caravan_directory / 'test.txt')

### Gauged-CAMELS

In [None]:
gauged_camels_directory = basin_list_dir / 'gauged-camels'
gauged_camels_directory.mkdir(parents=True, exist_ok=True)

In [None]:
for camels in camels_datasets:
    camels_basins = [b for b in intersection_basins if b.startswith(camels)]
    write_basin_list(camels_basins, gauged_camels_directory / f'{camels}_train.txt')
    write_basin_list(camels_basins, gauged_camels_directory / f'{camels}_test.txt')

### Ungauged Random Cross Validation

In [None]:
ungauged_random_directory = basin_list_dir / 'ungauged-random'
ungauged_random_directory.mkdir(parents=True, exist_ok=True)

In [None]:
kf = KFold(n_splits=N_KFOLD_SPLITS, shuffle=True, random_state=RANDOM_SEED)
for i, (train_index, test_index) in enumerate(kf.split(intersection_basins)):
    train_basins = [intersection_basins[idx] for idx in train_index]
    test_basins = [intersection_basins[idx] for idx in test_index]
    write_basin_list(train_basins, ungauged_random_directory / f'fold_{i+1}_train.txt')
    write_basin_list(test_basins, ungauged_random_directory / f'fold_{i+1}_test.txt')

### Ungauged CAMELS Cross Validation

In [None]:
ungauged_camels_directory = basin_list_dir / 'ungauged-camels'
ungauged_camels_directory.mkdir(parents=True, exist_ok=True)

In [None]:
for camels in camels_datasets:
    uncamels_basins = [b for b in intersection_basins if not b.startswith(f'{camels}_')]
    camels_basins = [b for b in intersection_basins if b.startswith(f'{camels}_')]
    write_basin_list(uncamels_basins, ungauged_camels_directory / f'{camels}_train.txt')
    write_basin_list(camels_basins, ungauged_camels_directory / f'{camels}_test.txt')

### Developement Lists

In [None]:
development_directory = basin_list_dir / 'dev'
development_directory.mkdir(parents=True, exist_ok=True)

In [None]:
random.seed(RANDOM_SEED)
for dev_basins_per_camels_dataset in [1, 3, 25]:
    dev_caravan_basin_list_file = development_directory / f'{dev_basins_per_camels_dataset}_per_camels.txt'
    basins_per_camels = []
    for camels in camels_datasets:
        camels_basins = [b for b in intersection_basins if b.startswith(camels)]
        if not camels_basins:
            print(f'No samples for {camels} ... skipping.')
            continue
        selected_basins = random.sample(camels_basins, dev_basins_per_camels_dataset)
        basins_per_camels.extend(list(selected_basins))
    write_basin_list(basins_per_camels, dev_caravan_basin_list_file)