#
# HyTEST hydrologic model benchmark assessment: standard metric analyses

**Timothy Hodson and Rich Signell**

## About
This notebook demonstrates a computational workflow for benchmarking daily streamflow simulated from the National Water Model Retrospective version 2.1, and is meant to provide an adaptable template for benchmarking other Earth-system models and datasets. 

The HyTEST benchmark workflow consists of three components:
1. a set of model predictions and observations, or evaluation data, to compare.
1. the spatiotemporal domain over which to compute benchmark results.
1. a set of statistical metrics with which to generate benchmark results. In this notebook, we focus on the standard statistical metric suite version 1.0, sometimes referred to as the traditional statistical metrics.

The workflow loads the model predictions and observations, subsets the data to the specified domain of the benchmark, and finally calculates metrics for the given data over that given domain. Any benchmark result is fully reproducible, given a workflow notebook and the correct versions of each of these three components.

In practice, the datasets may be too large to fit in memory or to transfer, so this notebook will demonstrate several open-source Python libraries for 'moving the computations to the data.' Some of these tools are relatively new, but they are quickly becoming standards within the Earth-science community.

The notebook is organized into a series of helper functions that handle tasks like loading data, configuring compute resources, and computing metrics over a chunk of data. Once these are defined, the analysis can be run in a few lines of code. The output generated from this notebook can serve as the beginning step in a workflow notebook specified to visualizations.

## 0. Setup
### 0.0. Load libraries
Prior to beginning, ensure that following Python librariers are installed and loaded

In [None]:
from dask_jobqueue import SLURMCluster
from dask.distributed import Client, LocalCluster
import dask.bag as db

import xarray as xr
import numpy as np
import pandas as pd
import intake
import dask
import os

### 0.1. Configure cluster
The notebook shows example configurations that might be used for three different computing resources supported by USGS, including Denali, Tallgrass, and Cloud.

First, select the computing resource on which to run your analysis. Options include:
1) denali
2) tallgrass
3) local
4) esip-qhub-gateway-v0.4

In [None]:
resource = 'tallgrass' #denali, tallgrass, local, esip-qhub-gateway-v0.4

How to configure the cluster will vary among these resources, so we've created a helper function to take care of that.

In [None]:
def configure_cluster(resource):
    ''' Helper function to configure cluster
    '''
    if resource == 'denali':
        cluster = LocalCluster(threads_per_worker=1)
        client = Client(cluster)
    
    elif resource == 'tallgrass':
        project = os.environ['SLURM_JOB_ACCOUNT']
        
        cluster = SLURMCluster(processes=1,cores=1, 
            memory='10GB', interface='ib0',
            project=project, walltime='01:00:00',      
            job_extra={'hint': 'multithread'})
        cluster.scale(10)
        client = Client(cluster)
        
    elif resource == 'local':
        import warnings
        warnings.warn("Running locally can result in costly data transfers!\n")
        n_cores = os.cpu_count() # set to match your machine
        cluster = LocalCluster(threads_per_worker=n_cores)
        client = Client(cluster)
        
    elif resource in ['esip-qhub-gateway-v0.4']:   
        import sys
        sys.path.append(os.path.join(os.environ['HOME'],'shared','users','lib'))
        import ebdpy as ebd
        ebd.set_credentials(profile='esip-qhub')

        aws_profile = 'esip-qhub'
        aws_region = 'us-west-2'
        endpoint = f's3.{aws_region}.amazonaws.com'
        ebd.set_credentials(profile=aws_profile, region=aws_region, endpoint=endpoint)
        worker_max = 30
        client,cluster = ebd.start_dask_cluster(profile=aws_profile, worker_max=worker_max, 
                                              region=aws_region, use_existing_cluster=True,
                                              adaptive_scaling=False, wait_for_cluster=False, 
                                              worker_profile='Medium Worker', propagate_env=True)
        
    return client, cluster

## 1. Define performance benchmark
A performance benchmark consists of three components: (1) a set of predictions and observations, (2) the domain over which to benchmark (3) a set of statistical metrics with which to produce benchmark results. The basic workflow is to load the predictions and observations, subset them to the domain of the benchmark, then calculate metrics on the data over that domain.

### 1.0 Load data
Let's begin by introducing [Intake](https://github.com/intake/intake), which is a set of tools for loading and sharing data in data science projects. Data from this project are stored within an Intake catalog. We can inpsect that catalog with the following lines.

In [None]:
url = 'https://raw.githubusercontent.com/USGS-python/hytest-catalogs/main/hytest_intake_catalog.yml'
cat = intake.open_catalog(url)
print(list(cat))

Above, you should see several or more data filenames of models, data hubs, and location descriptions.

- Files ending in "-onprem" are data that are located on HPC resources. Files ending in "-cloud", pertain to data housed on cloud, and "-esip" pertain to data housed in qhub. 
- Files with 'conus404' in the description pertain to the CONUS404 dataset. Uncalibrated, calibrated? DOI to cite?
 - Files with 'nwm21' pertain to the National Water Model retrospective version 2.1, and may have a variable such as 'streamflow' to designate the simulated variable.
- Files with 'nwis' pertain to data from USGS National Water Inventory System (NWIS) which is observational streamflow data, parameter code '00060', stat_cd = "00003" (mean)
- Files with 'streamflow' pertain to the variable that is housed in that data file.

Using the Intake catalog, we define another helper function that loads our data from the appropriate location depending on where the computation will be run.

In [None]:
observations_ds = cat[f'nwis-streamflow-usgs-gages-onprem'].to_dask()
observations_ds

In [None]:
def load_streamflow_data(resource):
    ''' Helper function to load observations and model predictions from Intake.
    
    Some initial preprocessing is also done here, like converting the datasets to the same type.
    '''
    if resource in ['tallgrass','denali']:
        location = 'onprem'
        
    elif resource in ['esip-qhub-gateway-v0.4']:
        location = 'cloud'

    url = 'https://raw.githubusercontent.com/USGS-python/hytest-catalogs/main/hytest_intake_catalog.yml'
    cat = intake.open_catalog(url)

    observations_ds = cat[f'nwis-streamflow-usgs-gages-{location}'].to_dask()
    model_ds = cat[f'nwm21-streamflow-usgs-gages-{location}'].to_dask()
    
    observations = observations_ds['streamflow']
    model = model_ds['streamflow'].astype('float32')

    observations.name = 'observed'
    model.name = 'predicted'
    
    return observations, model

Let's demo that helper, and show how to select data for a single streamgage.

In [None]:
obs, pred = load_streamflow_data(resource)

In [None]:
#obs

In [None]:
#pred

In [None]:
%%time
# time it takes to read a single gage
gage_id = 'USGS-01030350'
obs.sel(gage_id=gage_id).load(scheduler='threads').to_series().tail()

### 1.1 Load benchmark locations
Each benchmark is defined over a specific domain (typically bounded in space and time, with the two related in some aspects). Benchmark spatiotemporal domains are published to ScienceBase within the [HyTEST directory](https://www.sciencebase.gov/catalog/item/61dd751ed34ed7929401a4bd), or they can be defined within the notebook if a user chooses. For this example, we use the Cobalt gages, avaliable for download on ScienceBase ([Foks et al., 2022](https://doi.org/10.5066/P972P42Z)).

In [None]:
import fsspec
#fs = fsspec.filesystem('https', anon=True)
url = 'https://raw.githubusercontent.com/USGS-python/hytest-evaluation-workflows/main/misc/streamflow_gages_v1_n5390.csv'
benchmark_ds = pd.read_csv(url, dtype={'site_no':str, 'huc_cd':str, 'reachcode':str, 'comid':str, 'gagesII_class':str, 'aggecoregion': str}).set_index('site_no').to_xarray()

In [None]:
# Format the site_no
benchmark_ds['site_no'] = [f'USGS-{site}' for site in benchmark_ds['site_no'].values]
benchmark_gages = benchmark_ds['site_no'].values.tolist()

### 1.2. Load statistical metrics
This demo computes a benchmark for the National Water Model (NWM) Retrospective version 2.1 using a suite of traditional metrics (referred to as the "standard statistical suite version 1.0"). In practice these may be loaded from external libraries. Ideally, the future version of this notebook would be configured to load a specific version of that library, which would be used to uniquely identify the benchmark to run. Additionally, future specific metrics packages could be loaded with `httpimport`. 

Below is the definition of the version 1.0 standard statistical suite (10 metrics total; Towler et al., in draft) that is calculated in this notebook, with references.

| Metric | Reference |
| :----------- | :----------- |
| Nash-Sutcliffe efficiency (NSE) | Nash, J. E., & Sutcliffe, J. V. (1970). River flow forecasting through conceptual models part I—A discussion of principles. Journal of hydrology, 10(3), 282-290. |
| Kling-Gupta efficiency (KGE) | Gupta, H. V., Kling, H., Yilmaz, K. K., & Martinez, G. F. (2009). Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. Journal of hydrology, 377(1-2), 80-91. |
| logNSE | Oudin, L., Andréassian, V., Mathevet, T., Perrin, C., & Michel, C. (2006). Dynamic averaging of rainfall‐runoff model simulations from complementary model parameterizations. Water Resources Research, 42(7). |
| percent bias | a measure of the mean tendency of simulated values to be greater or less than associated observed values, units of percent |
| ratio of standard deviation |standard deviation of simulated values divided by the standard deviation of observed values|
| Pearson Correlation | K. Pearson (1896, 1900, 1920) |
|Spearman Correlation | Charles Spearman (1904, 1910) |
|percent bias in midsegment slope of the flow-duration curve (FDC) between Q20-Q70 | Yilmaz, K. K., Gupta, H. V., & Wagener, T. (2008). A process‐based diagnostic approach to model evaluation: Application to the NWS distributed hydrologic model. Water Resources Research, 44(9). |
|percent bias in FDC low-segment volume (Q0-Q30)| Yilmaz, K. K., Gupta, H. V., & Wagener, T. (2008). A process‐based diagnostic approach to model evaluation: Application to the NWS distributed hydrologic model. Water Resources Research, 44(9). |
|percent bias in FDC high-segment volume (Q98-Q100) | Yilmaz, K. K., Gupta, H. V., & Wagener, T. (2008). A process‐based diagnostic approach to model evaluation: Application to the NWS distributed hydrologic model. Water Resources Research, 44(9). | 

In [None]:
'''
Global variance of observed data: dscore scorecard development
'''
def mse(obs):
    """
    Calculate the global variance
    
    Args:
        obs: numpy array of observed values
    Returns:
        variance
    """
    return np.mean((obs - mod) ** 2)

In [None]:
'''
A selection of traditional/standard statistical suite of metrics
'''

import numpy as np

def mse(obs, mod):
    """
    Calculate the mean squared error (MSE)
    
    Args:
        obs: numpy array of observed values
        mod: numpy array of modeled values
    Returns:
        mean squared error
    """
    return np.mean((obs - mod) ** 2)


def nse(obs, mod):
    """
    Calculate the Nash-Sutcliffe Efficiency (NSE)
    (https://www.sciencedirect.com/science/article/pii/0022169470902556?via%3Dihub)
    
    Args:
        obs: numpy array of observed values
        mod: numpy array of modeled values
    Returns:
        Nash-Sutcliffe Efficiency
    """
    return 1 - (mse(obs, mod) / np.var(obs))


def pbias(obs, mod):
    """
    Calculate the percent bias
    
    Args:
        obs: numpy array of observed values
        mod: numpy array of modeled values
    Returns:
        Percent bias
    """
    return 100 * ((np.sum(mod - obs)) / (np.sum(obs)))


def pbias_percentile(obs, model, percentile, fun):
    """
    Calculate the percent bias for a percentile bin
    
    Args:
        obs: numpy array of observed values
        mod: numpy array of modeled values
        percentile: float
        fun: comparison function (e.g., np.greater)
    Returns:
        Percent bias for bin
    """
    threshold = np.percentile(obs, q=percentile)
    i = fun(obs, threshold)
    
    return pbias(obs[i], model[i])
    

def pearson_r(obs, mod):
    """
    Calculate Pearson's r
    
    Args:
        obs: numpy array of observed values
        mod: numpy array of modeled values
    Returns:
        Pearson's r
    """
    #return np.cov(mod, obs) / np.sqrt( np.var(mod) * np.var(obs))
    return np.corrcoef(mod, obs)[0,1]


def spearman_r(obs, mod):
    """
    Calculate Spearman's r
    
    Args:
        obs: numpy array of observed values
        mod: numpy array of modeled values
    Returns:
        Spearman's r
    """
    return pearson_r(np.argsort(mod), np.argsort(obs))


def kge(obs, mod):
    """
    Calculate the Kling-Gupta Efficiency (KGE)
    (https://www.sciencedirect.com/science/article/pii/S0022169409004843)
    
    Args:
        obs: numpy array of observed values
        mod: numpy array of modeled values
    Returns:
        Kling-Gupta Efficiency
    """
    r = pearson_r(obs, mod)
    alpha = sd_ratio(obs, mod)
    beta = np.sum(mod) / np.sum(obs)
    return 1 - np.sqrt((r-1)**2 + (alpha-1)**2 + (beta-1)**2)


def sd_ratio(obs, mod):
    """
    Calculate the standard deviation ratio of the model predictions and observations
    
    Args:
        obs: numpy array of observed values
        mod: numpy array of modeled values
    Returns:
        Standard deviation ratio   
    """
    return np.std(mod) / np.std(obs)

### 1.3 Define benchmark function
For each streamgage, we will compute a series of performance metrics, and the results from streamgage will be appended into a single dataframe, with one row per gage and one column for metric. To paralleize this task, we create the helper function `compute_benchmark()`, which computes the benchmark for a particular streamgage. In parallel, each worker in the cluster is assigned a gage, loads the data for that gage, computes the benchmark, and all the results are gathered. In this example, the `compute_benchmark()` function converts the data to a `pandas.Series`, resamples the model and observation data to the same timeseries, then computes a series of metrics from the resampled data. Each metric is stored as an entry in another `pandas.Series` named scores, which is returned by `compute_benchmark()` upon completion.

In [None]:
def compute_benchmark(gage_id, observations, predictions):
    obs1 = observations.sel(gage_id=gage_id).load(scheduler='single-threaded').to_series()
    mod1 = predictions.sel(gage_id=gage_id).load(scheduler='single-threaded').to_series().resample('1D', offset='5h').mean() # Resampling could be done in preanalysis
    
    # make sure the indices match
    obs1.index = obs1.index.to_period('D')
    mod1.index = mod1.index.to_period('D')

    # merge obs and predictions and drop nans.
    df = pd.merge(obs1, mod1, left_index=True, right_index=True).dropna(how='any')
    obs1 = df['observed']
    mod1 = df['predicted']
    
    # compute log flow for use in log NSE
    threshold = 0.01
    log_obs = np.log(obs1.where(obs1 > threshold, threshold))
    log_model = np.log(mod1.where(mod1 > threshold, threshold))
    
    scores = pd.Series(dtype='float')
    scores['nse'] = nse(obs1, mod1)
    scores['log_nse'] = nse(log_obs, log_model)
    scores['kge'] = kge(obs1, mod1)
    
    scores['pbias'] = pbias(obs1, mod1)
    scores['pearson_r'] = pearson_r(obs1, mod1)
    scores['spearman_r'] = spearman_r(obs1, mod1)
    scores['sd_ratio'] = sd_ratio(obs1, mod1)
    
    # compute high flow and low flow bias (Yilmaz et al., 2008)
    #   Yilmaz, K. K., Gupta, H. V., & Wagener, T. (2008). 
    #   A process‐based diagnostic approach to model evaluation: 
    #   Application to the NWS distributed hydrologic model. 
    #   Water Resources Research, 44(9).
    high_percentile = 98
    low_percentile = 30
    
    scores['pbias_q' + str(high_percentile)] = pbias_percentile(obs1, mod1, high_percentile, np.greater)
    scores['pbias_q' + str(low_percentile)] = pbias_percentile(obs1, mod1, high_percentile, np.less_equal)
    scores.name = gage_id
    
    #compute slope of the FDC curve (Yilmaz et al 2008)
    #   Yilmaz, K. K., Gupta, H. V., & Wagener, T. (2008). 
    #   A process‐based diagnostic approach to model evaluation: 
    #   Application to the NWS distributed hydrologic model. 
    #   Water Resources Research, 44(9).
    
    return scores

Run `compute_metrics()` and verify the output:

In [None]:
mod, obs = load_streamflow_data(resource)

In [None]:
%%time
# run for a single site using 1 core
gage_id = 'USGS-01030350'
compute_benchmark(gage_id, obs, mod)

## 2. Compute benchmark results

We will define one final function, that wraps `compute_benchmark()` in a `try` statement. That way, if an error occurs at a particular streamgage, the other streamgages will be unaffected. 

WARNING: While developing your code, we recommend against sequestering errors inside a `try`, because error messages are extremely useful when debugging code.

In [None]:
def try_compute_benchmark(gage_id):
    """Wrapper function
    """
    try:
        return compute_benchmark(gage_id, obs, mod)
    except:
        return None

### 2.0 Setup cluster
Using our helper function, we can setup our analysis in two lines.

In [None]:
client, cluster = configure_cluster(resource)
mod, obs = load_streamflow_data(resource)

In [None]:
#cluster.scale(20)

In [None]:
client

### 2.1 Distribute with Dask bag - (applicable if on Cloud???)
Now to parallelize, we create a Dask bag from the list `benchmark_gages`, and pass `try_compute_benchmark` to each element in the bag.

In [None]:
b = db.from_sequence(benchmark_gages, npartitions=40)
b1 = b.map(try_compute_benchmark)

In [None]:
#b = db.from_sequence(urls[:90], npartitions=3)
#b1 = b.map(gen_json)

In [None]:
%%time
from dask.distributed import performance_report
with performance_report(filename="dask-report-whole.html"):
    results = b1.compute(retries=10)

Finally, concatenate the results from each gage into a single dataframe

In [None]:
results = [i for i in results if i is not None] # Drop entries where compute_metrics failed

df_results = pd.concat(results, axis=1)
df_results = df_results.T
df_results.index.name = 'site_no'
#df_results.index = df_results.index.astype('<U15')
ds_results = df_results.to_xarray()
ds_results

### 2.2 Save results to disk
Save results to disk as a 'csv' file format, which can then be uploaded to USGS ScienceBase.
This file will save to your HOME directory if you are working in the HPC environment.

In [None]:
ds_results.to_dataframe().to_csv('nwm_v2.1_streamflow_benchmark_test.csv')

Then as a NetCDF, which we will later use for visualization. Let's add latitude and longitude from the benchmark data before writing the results to NetCDF

In [None]:
ds_results.merge(benchmark_ds, join='inner').to_netcdf('nwm_v2.1_streamflow_benchmark_test.nc')

The end.

In [None]:
client.close(); cluster.close()