# Tutorial: **HydroDL LSTM**

---

This notebook is a faithful implementation of the original [HydroDL](https://github.com/mhpi/hydroDL) LSTM model developed by [Dapeng Feng et al. (2020)](https://doi.org/10.1029/2019WR026793), and demonstrates both training and forward simulation in δMG. A pre-trained model is provided for those who only wish to run the model forward.

For explanation of model structure, methodologies, data, and performance metrics, please refer to Feng's publications [below](#publication). If you find this code is useful in your own work, please include the aforementioned citation.

**Note**: If you are new to the δMG framework, we suggest first looking at our [δHBV 1.0 tutorial](./../hydrology/example_dhbv.ipynb).

<br>

### High vs. Low Flow Experts

The HydroDL LSTM comes in two flavors of data processing: one intended to maximize low-flow performance, and one to maximize high-flow performance. By default, the [LSTM config](./../conf/config_lstm.yaml) is set to reproduce the high-flow expert. The differences, along with changes that must be made to the config, are as follows:

1. **Low-Flow Expert**: Precipition model input and runoff target data are normalized by a log-Gaussian like $$v_{norm} = \frac{1}{\sigma} \left(log\left(\sqrt{var + 0.1}\right) - \mu_v\right),$$ and an RMSE loss function is used in training; in the config,
    - `train -> loss_function -> name: RmseLoss`
    - `model -> flow_regime: low`

2. **High-Flow Expert (Default)**: All model input and target data get a Gaussian normalizeation and an NSE loss function is used in training; in the config,
    - `train -> loss_function -> name: NseBatchLoss`
    - `model -> flow_regime: high`


<br>

### Before Running:
- **Environment**: See [setup.md](./../../docs/setup.md) for ENV setup. δMG must be installed to run this notebook.

- **Model**: Download pretrained LSTM model weights from [AWS](https://mhpi-spatial.s3.us-east-2.amazonaws.com/mhpi-release/models/1-lstm_trained.zip). Then update the model config:

    - In [config_lstm.yaml](./../conf/config_lstm.yaml), update `model_dir` with your path to the parent directory containing both trained model weights `cudnnlstmmodel_ep300.pt` **and** normalization file `normalization_statistics.json`.
    - **Note**: make sure this path includes the last closing forward slash: e.g., `./your/path/to/model/`.

- **Data**: Download the CAMELS data extraction from [AWS](https://mhpi-spatial.s3.us-east-2.amazonaws.com/mhpi-release/data/1-camels.zip). Then, update the data configs:

    - In [camels_531.yaml](./../conf/observations/camels_531.yaml) and [camels_671.yaml](./../conf/observations/camels_671.yaml), update...
        1. `data_path` with path to `camels_daymetv2`,
        2. `gage_info` with path to `gage_id.npy`,
        3. `subset_path` with path to `531sub_id.txt` (camels_531 only).

    - The full 671-basin or 531-basin CAMELS datasets can be selected by setting `observations: camels_671` or `camels_531` in the model config, respectively.

- **Hardware**: The HydroDL LSTM requires CUDA support only available with Nvidia GPUs. For those without access, T4 GPUs can be used when running this notebook with δMG on [Google Colab](https://colab.research.google.com/).

<br>

### Publications:

*Dapeng Feng, Kathyrn Lawson, Chaopeng Shen. "Mitigating prediction error of deep learning streamflow models in large data-sparse regions with ensemble modeling and soft data." Geophysical Research Letters (2021). https://doi.org/10.1029/2021GL092999.*

*Dapeng Feng, Kuai Fang, Chaopeng Shen. "Enhancing Streamflow Forecast and Extracting Insights Using Long-Short Term Memory Networks With Data Integration at Continental Scales." Water Resources Research (2020). https://doi.org/10.1029/2019WR026793.*

<br>

### Issues:
For questions, concerns, bugs, etc., please reach out by posting an [issue](https://github.com/mhpi/generic_deltamodel/issues).

---


<br>

## 1. Forward LSTM

After completing [these](#before-running) steps, forward the LSTM with the code block below.

Note:
- The settings defined in the config [config_lstm.yaml](./../conf/config_lstm.yaml) are set to replicate benchmark performance on 531 CAMELS basins.
- While published results are an average of 6 models using different random seeds, we only use one model and seed here for demonstration.

### 1.1 Demonstration

In [None]:
import sys

sys.path.append('../../')

from dmg import ModelHandler
from dmg.core.utils import import_data_loader, print_config, set_randomseed
from example import load_config

# ------------------------------------------#
# Define model settings here.
CONFIG_PATH = '../example/conf/config_lstm.yaml'
# ------------------------------------------#


# 1. Load configuration dictionary of model parameters and options.
config = load_config(CONFIG_PATH)
config['mode'] = 'sim'
print_config(config)

# Set random seed for reproducibility.
set_randomseed(config['seed'])

# 2. Initialize the LSTM.
model = ModelHandler(config, verbose=True)

# 3. Load and initialize a dataset dictionary of normalized NN model inputs.
data_loader_cls = import_data_loader(config['data_loader'])
data_loader = data_loader_cls(config, test_split=True, overwrite=False)

# 4. Forward the model to get the predictions.
output = model(
    data_loader.eval_dataset,
    eval=True,
)

# Denormalize the runoff predictions.
runoff = output['CudnnLstmModel']['runoff']

runoff = data_loader.from_norm(
    output['CudnnLstmModel']['runoff'].cpu().detach().numpy(),
    vars='runoff',
)

print("-------------\n")
print(
    f"Runoff predictions (mm/day) for {runoff.shape[0]} days and "
    f"{runoff.shape[1]} basins ~ \nShowing the first 5 days for "
    f"first basin: \n {runoff[:5, :1]}"
)

### 1.2 Visualizing Model Predictions

After running model inference we can, e.g., view the runoff hydrograph for one of the basins to see we are getting expected outputs.

In [None]:
import numpy as np

from dmg.core.data import txt_to_array
from dmg.core.post import plot_hydrograph
from dmg.core.utils import Dates

# ------------------------------------------#
# Choose a basin by USGS gage ID to plot.
GAGE_ID = 1022500
TARGET = 'runoff'

# Resample to 3-day prediction. Options: 'D', 'W', 'M', 'Y'.
RESAMPLE = '3D'

# Set the paths to the gage ID lists...
GAGE_ID_PATH = config['observations']['gage_info']  # ./gage_id.npy
GAGE_ID_531_PATH = config['observations']['subset_path']  # ./531sub_id.txt
# ------------------------------------------#


# 1. Get the runoff predictions and daily timesteps of the prediction window.
pred = output['CudnnLstmModel'][TARGET]
timesteps = Dates(config['sim'], config['model']['rho']).batch_daily_time_range

# Remove warm-up period to match model output (see Note above.)
timesteps = timesteps[config['model']['warm_up'] :]


# 2. Load the gage ID lists and get the basin index.
gage_ids = np.load(GAGE_ID_PATH, allow_pickle=True)
gage_ids_531 = txt_to_array(GAGE_ID_531_PATH)

print(f"First 20 available gage IDs: \n {gage_ids[:20]} \n")
print(f"First 20 available gage IDs (531 subset): \n {gage_ids_531[:20]} \n")

if config['observations']['name'] == 'camels_671':
    if GAGE_ID in gage_ids:
        basin_idx = list(gage_ids).index(GAGE_ID)
    else:
        raise ValueError(
            f"Basin with gage ID {GAGE_ID} not found in the CAMELS 671 dataset."
        )

elif config['observations']['name'] == 'camels_531':
    if GAGE_ID in gage_ids_531:
        basin_idx = list(gage_ids_531).index(GAGE_ID)
    else:
        raise ValueError(
            f"Basin with gage ID {GAGE_ID} not found in the CAMELS 531 dataset."
        )
else:
    raise ValueError(
        f"Observation data supported: 'camels_671' or 'camels_531'. Got: {config['observations']}"
    )


# 3. Get the data for the chosen basin and plot.
runoff_pred_basin = pred[:, basin_idx].squeeze()

plot_hydrograph(
    timesteps,
    runoff_pred_basin,
    resample=RESAMPLE,
    title=f"Hydrograph for Kerrs Creek (Lexington, VA; Gage {GAGE_ID})",
    ylabel='Runoff (mm/day)',
)

<br>

## 2. Train the HydroDL LSTM

After completing [these](#before-running) steps, train an LSTM with the code block below.

**Note**
- The settings defined in the config [config_lstm.yaml](./../conf/config_lstm.yaml) are set to replicate benchmark performance.
- For model training, set `mode: train` in the config, or modify after config dict has been created (see below).
- `./output/` directory will be generated to store experiment and model files. This location can be adjusted by changing the `output_dir` key in your config. 
    - If you have set `model_dir` in your config, model save files will be stored there.
- Default settings with 300 epochs, batch size 100, and training window from 1 October 1999 to 30 September 2008 should use 3.3GB of vram. Expect training times of 25 minutes on Nvidia A100.

In [None]:
import sys

sys.path.append('../../')

from dmg import ModelHandler
from dmg.core.utils import (
    import_data_loader,
    import_trainer,
    print_config,
    set_randomseed,
)
from example import load_config

# ------------------------------------------#
# Define model settings here.
CONFIG_PATH = '../example/conf/config_lstm.yaml'
# ------------------------------------------#


# 1. Load configuration dictionary of model parameters and options.
config = load_config(CONFIG_PATH)
config['mode'] = 'train'
print_config(config)

# Set random seed for reproducibility.
set_randomseed(config['seed'])

# 2. Initialize the LSTM with a model handler.
model = ModelHandler(config, verbose=True)

# 3. Load and initialize a dataset dictionary of NN model inputs.
data_loader_cls = import_data_loader(config['data_loader'])
data_loader = data_loader_cls(config, test_split=True, overwrite=False)


# 4. Initialize trainer to handle model training.
trainer_cls = import_trainer(config['trainer'])
trainer = trainer_cls(
    config,
    model,
    train_dataset=data_loader.train_dataset,
)

# 5. Start model training.
trainer.train()
print(f"Training complete. Model saved to \n{config['model_path']}")

## 3. Evaluate Model Performance

After completing the training in [Section 2](#2-train-the-hydrodl-lstm), or with the trained model provided, test the LSTM below on evaluation data.

**Note**
- For model evaluation, set `mode: test` in the config, or modify after config dict has been created (see below).
- When evaluating provided models, confirm that `test.test_epoch` in the config corresponds the training epochs completed for the model you want to test (e.g., 300).
- Default settings with batch size 15 and testing window from 1 October 1989 to 30 September 1999 should use 2.3GB of VRAM. Expect evalutation times of 5 seconds on Nvidia A100.

### 3.1 Runoff Simulation

In [None]:
import sys

sys.path.append('../../')

from dmg import ModelHandler
from dmg.core.utils import (
    import_data_loader,
    import_trainer,
    print_config,
    set_randomseed,
)
from example import load_config

# ------------------------------------------#
# Define model settings here.
CONFIG_PATH = '../example/conf/config_lstm.yaml'
# ------------------------------------------#


# 1. Load configuration dictionary of model parameters and options.
config = load_config(CONFIG_PATH)
config['mode'] = 'test'
print_config(config)

set_randomseed(config['seed'])

# 2. Initialize the differentiable HBV 1.1p model (LSTM + HBV 1.1p).
model = ModelHandler(config, verbose=True)

# 3. Load and initialize a dataset dictionary of NN and HBV model inputs.
data_loader_cls = import_data_loader(config['data_loader'])
data_loader = data_loader_cls(config, test_split=True, overwrite=False)

# 4. Initialize trainer to handle model evaluation.
trainer_cls = import_trainer(config['trainer'])
trainer = trainer_cls(
    config,
    model,
    eval_dataset=data_loader.eval_dataset,
    verbose=True,
)

# 5. Start testing the model.
print('Evaluating model...')
trainer.evaluate()
print(f"Metrics and predictions saved to \n{config['sim_dir']}")

### 3.2 Visualize Trained Model Performance

Once the model has been evaluated, a new directory `sim/` will be created in your *output_dir* (default `./output/`). This path will be populated with...

1. Predicted runoff (`runoff.npy`),

2. Runoff observation data for comparison against model predictions (`runoff_obs.npy`).

Your output directory will also be populated with files containing individual basin and basin-aggregated metrics...
2. `metrics.json`, containing evaluation metrics accross the test time range for each gage in the dataset,

3. `metrics_agg.json`, containing evaluation metrics aggregated across all sites (mean, median, standard deviation).

We can use these outputs to visualize the LSTM's performance with a 
1. Cumulative distribution function (CDF) plot, 

2. CONUS map of gage locations and metric (e.g., NSE) performance.

<br>

But first, let's first check the (basin-)aggregated metrics for NSE, KGE, bias, RMSE, and, for both high/low flow regimes, RMSE and absolute percent bias...

In [None]:
import os

from dmg.core.data import load_json
from dmg.core.post import print_metrics


print(f"Evaluation output files: {config['output_dir']} \n")

# 1. Load the basin-aggregated evaluation results.
metrics_path = os.path.join(config['output_dir'], 'metrics_agg.json')
metrics = load_json(metrics_path)
print(f"Available metrics: {metrics.keys()} \n")

# 2. Print the evaluation results.
metric_names = [
    # Choose metrics to show.
    'nse',
    'kge',
    'bias',
    'rmse',
    'rmse_low',
    'rmse_high',
    'flv_abs',
    'fhv_abs',
]
print_metrics(metrics, metric_names, mode='median', precision=3)

### 3.3 CDF Plot

The cumulative distribution function (CDF) plot tells us what percentage (CDF on the y-axis) of basins performed at least better than a given metric on the evaluation data.

An example is given below for NSE, but you can change to your preferred metric (see the output from the previous cell), but note some may require changing *xbounds* in `plot_cdf()`.

In [None]:
from dmg.core.post import plot_cdf

# ------------------------------------------#
# Choose the metric to plot. (See available metrics printed above, or in the metrics_agg.json file).
METRIC = 'nse'
# ------------------------------------------#


# 1. Load the evaluation metrics.
metrics_path = os.path.join(config['output_dir'], 'metrics.json')
metrics = load_json(metrics_path)

# 2. Plot the CDF for NSE.
plot_cdf(
    metrics=[metrics],
    metric_names=[METRIC],
    model_labels=['LSTM'],
    title="CDF of NSE for LSTM",
    xlabel=METRIC.capitalize(),
    figsize=(8, 6),
    xbounds=(0, 1),
    ybounds=(0, 1),
    show_arrow=True,
)

### 3.4 Spatial Plot

This plot shows the locations of each basin in the evaluation data, color-coded by performance on a metric. Here we give a plot for NSE, but as before, this can be changed to your preference. (See above; for metrics not valued between 0 and 1, you will need to set `dynamic_colorbar=True` in `geoplot_single_metric` to ensure proper coding.)

Note, you will need to add paths to the CAMELS shapefile, gage IDs, and 531-gage subset which can be found in the [CAMELS download](#before-running).

In [None]:
import geopandas as gpd
import numpy as np
import pandas as pd

from dmg.core.data import txt_to_array
from dmg.core.post import geoplot_single_metric

# ------------------------------------------#
# Choose the metric to plot. (See available metrics printed above, or in the metrics_agg.json file).
METRIC = 'nse'

# Set the paths to the gage id lists and shapefiles...
GAGE_ID_PATH = config['observations']['gage_info']  # ./gage_id.npy
GAGE_ID_531_PATH = config['observations']['subset_path']  # ./531sub_id.txt
SHAPEFILE_PATH = './your/path/to/camels/loc/camels671.shp'
# ------------------------------------------#


# 1. Load gage ids + basin shapefile with geocoordinates (lat, long) for every gage.
gage_ids = np.load(GAGE_ID_PATH, allow_pickle=True)
gage_ids_531 = txt_to_array(GAGE_ID_531_PATH)
coords = gpd.read_file(SHAPEFILE_PATH)

# 2. Format geocoords for 531- and 671-basin CAMELS sets.
coords_531 = coords[coords['gage_id'].isin(list(gage_ids_531))].copy()

coords['gage_id'] = pd.Categorical(
    coords['gage_id'], categories=list(gage_ids), ordered=True
)
coords_531['gage_id'] = pd.Categorical(
    coords_531['gage_id'], categories=list(gage_ids_531), ordered=True
)

coords = coords.sort_values('gage_id')  # Sort to match order of metrics.
basin_coords_531 = coords_531.sort_values('gage_id')

# 3. Load the evaluation metrics.
metrics_path = os.path.join(config['output_dir'], 'metrics.json')
metrics = load_json(metrics_path)

# 4. Add the evaluation metrics to the basin shapefile.
if config['observations']['name'] == 'camels_671':
    coords[METRIC] = metrics[METRIC]
    full_data = coords
elif config['observations']['name'] == 'camels_531':
    coords_531[METRIC] = metrics[METRIC]
    full_data = coords_531
else:
    raise ValueError(
        f"Observation data supported: 'camels_671' or 'camels_531'. Got: {config['observations']}"
    )

# 5. Plot the evaluation results spatially.
geoplot_single_metric(
    full_data,
    METRIC,
    rf"Spatial Map of {METRIC.upper()} for LSTM on CAMELS "
    f"{config['observations']['name'].split('_')[-1]}",
    dynamic_colorbar=False,
)