# National Water Model Benchmarking Analysis Workflow

:::{note}

_This notebook adapted from originals by Timothy Hodson and Rich Signell. See that upstream work at:_
* https://github.com/thodson-usgs/dscore
* https://github.com/USGS-python/hytest-evaluation-workflows/

:::


## Essential Benchmark Components
This benchmark notebook will present a workflow which follows a canonical model for Essential Benchmark Components: 
1) A set of predictions and matching observations; 
2) The domain (e.g. space or time) over which to benchmark;
3) A set of statistical metrics with which to benchmark. 

Let's get started.

## Step 0: Load libraries and configure Python computing environment

In [1]:
# Python libraries we will need...
import pandas as pd
import logging


Will this ananlysis need parallelism?  Likely yes.  

The following cell will configure a cluster environment suited to the server hosting this notebook. 

You may let our `configure_cluster()` helper try to guess the cluster config for you, or you can 
explicitly name a config matching where you are running this notebook. 

In [5]:
import sys
sys.path.append(r'/shared/users/lib')
from HyTest.helpers import configure_cluster
(client, cluster) = configure_cluster('cloud')
client

Region: us-west-2
No Cluster running.
Starting new cluster.
{}
Setting Cluster Environment Variable AWS_DEFAULT_REGION us-west-2
Setting Fixed Scaling workers=30
Reconnect client to clear cache
client.dashboard_link (for new browser tab/window or dashboard searchbar in Jupyterhub):
https://jupyter.qhub.esipfed.org/gateway/clusters/dev.05078e244814404eb476a957d96e41d3/status
Propagating environment variables to workers
Using environment: users/pangeo


0,1
Connection method: Cluster object,Cluster type: dask_gateway.GatewayCluster
Dashboard: https://jupyter.qhub.esipfed.org/gateway/clusters/dev.05078e244814404eb476a957d96e41d3/status,


## Step 1: Load Data

Essential Benchmark Components: 
1) A set of predictions and matching observations,  <span style="color:red; font-size:large"><<--You are here</span>
2) The domain over which to benchmark 
3) A set of statistical metrics with which to benchmark. 

Finding and loading data is made easier for this particular workflow, in tht most of it has been pre-processed and stored in a cloud-friendly format. That data store is indexed by an **intake catalog**.  Learn more about `intake` [here](../L2/xx_Intake.ipynb)

In [6]:
import intake
cat = intake.open_catalog(r'https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml')
print("Available datasets: \n", "\n".join(cat.keys()))

Available datasets: 
 conus404-hourly-onprem
conus404-hourly-cloud
conus404-daily-onprem
conus404-daily-cloud
nwis-streamflow-usgs-gages-onprem
nwis-streamflow-usgs-gages-cloud
nwm21-streamflow-usgs-gages-onprem
nwm21-streamflow-usgs-gages-cloud
nwm21-streamflow-cloud
nwm21-scores
lcmap-cloud
conus404-hourly-cloud-dev


The above list represents the processed datasets available for benchmarking.  If a dataset
you want is not in that list, 

**DO THIS** TODO: Define this. 

or load the 
data manually using [other means](/dev/null). 
If you load data from a source other than this list, you can jump to [Step 2: Restrict to a Domain](#step-2-restrict-to-a-domain)

Note that the interesting datasets in the cataloged dataset above are duplicated: Some are `-onprem` 
and some are `-cloud`. Same data, but the storage location and access protocol will be different. You 
will definitely want to use the correct copy of the data for your computing environment.  
* `onprem` : Direct access via the `caldera` filesystem from _denali_ or _tallgrass_
* `cloud` : Network access via S3 bucket, suitable for consumption on cloud-hosted jupyter servers. You could also access using any network-attached computer, but the amount of data will likely saturate your connection.  Use in the cloud (e.g. ESIP QHub)

So... are you on-prem? 

In [7]:
import platform
onprem = (platform.node() in ['denali', 'tallgrass'])
if onprem:
    print("Yes : -onprem")
    obs_data_src='nwis-streamflow-usgs-gages-onprem'
    mod_data_src='nwm21-streamflow-usgs-gages-onprem'
else:
    print("Not onprem; use '-cloud' data source")
    obs_data_src='nwis-streamflow-usgs-gages-cloud'
    mod_data_src='nwm21-streamflow-usgs-gages-cloud'
print("Observed : ", obs_data_src)
print("Modeled  : ", mod_data_src)

Not onprem; use '-cloud' data source
Observed :  nwis-streamflow-usgs-gages-cloud
Modeled  :  nwm21-streamflow-usgs-gages-cloud


In [8]:
variable_of_interest = 'streamflow'
try:
    obs = cat[obs_data_src].to_dask()
    mod = cat[mod_data_src].to_dask()
except KeyError:
    print("Something wrong with dataset names.")
    raise

try:
    obs_data = obs[variable_of_interest]
    mod_data = mod[variable_of_interest].astype('float32')
except KeyError:
    print(f"{variable_of_interest} was not found in these data.")

obs_data.name = 'observed'
mod_data.name = 'predicted'    

## Step 2: Restrict to a Domain

Essential Benchmark Components: 
1) A set of predictions and matching observations,  
2) The domain over which to benchmark <span style="color:red; font-size:large"><<--You are here</span>
3) A set of statistical metrics with which to benchmark. 

Each benchmark domain is defined over specific bounds (typically space and/or time). 
Benchmark domain definitions are published to Science Base, or they can be defined within the notebook. 

This notebook will use a benchmark domain definition loaded from ESIP's network file 
system (S3). It is essentially a list of stream guages in which we are interested, along with some 
metadata about that gage (watershed, reach code, etc).  We will use this as a spatial selector
to restrict the data to only those gages found within this benchmarking domain.

Because we are using a standardized list, we need to fetch it from its upstream source, which is
an S3 bucket.

:::{sidebar}

Read more about accessing S3 [here](/dev/null)

:::

In [9]:
import fsspec
fs = fsspec.filesystem('s3', anon=True)
try:
    domain_data = pd.read_csv(
        fs.open('s3://esip-qhub-public/usgs/hytest/streamflow_benchmark_sites_v09.csv'), 
        dtype={'site_no':str, 'huc_cd':str, 'reachcode':str, 'comid':str },
        index_col='site_no'
        )
except:
    print(f"Could not open the benchmark data ... AWS problem?")
    raise
# Re-format the gage_id/site_no string value.  ex:   "1000000"  ==> "USGS-1000000"
domain_data.rename(index=lambda x: f'USGS-{x}', inplace=True)
print(f"{len(domain_data.index)} gages in this benchmark")

5520 gages in this benchmark


So now we have a domain dataset representing the stream gages (unique `site_no` values) identifying the locations making up the spatial domain of this benchmark. While we have good meta-data for these gages (lat/lon location, HUC8 code, etc), we really will only use the list of `site_no` values to select locations out of the raw data.

## Step 3: Compute Metrics

Essential Benchmark Components: 
1) A set of predictions and matching observations,  
2) The domain over which to benchmark 
3) A set of statistical metrics with which to benchmark. <span style="color:red; font-size:large"><<--You are here</span>

The code to calculate the various metrics has been standardized [here](./NWM_StandardSuite_v1.ipynb). You can 
use these or write your own.  To import and use these standard definitions, run this cell:

In [10]:
%run ./NWM_StandardSuite_v1.ipynb

Whether you use these functions or your own, we need to put all metric computation into a special all-encompasing 
benchmarking function--a single call which can be assigned to each gage in our domain list. This sort of arrangement 
is well-suited to parallelism with `dask`. 

If this is done well, the process will benefit enormously from task parallelism -- each gage can be given its own 
CPU to run on.  After all are done, the various results will be collected and assembled into a composite dataset. 

To achieve this, we need a single 'atomic' function that can execute independently. It will take the gage identifier 
as input and return a list of metrics.

In [19]:
import logging
logging.basicConfig(level=logging.DEBUG, force=True)

In [26]:
## Wrapper function -- this func will be called once per gage_id, each call on its own dask worker
def compute_benchmark(gage_id):
    try:
        ## obs_data and mod_data should be globals...
        obs = obs_data.sel(gage_id=gage_id).load(scheduler='single-threaded').to_series()
        mod = mod_data.sel(gage_id=gage_id).load(scheduler='single-threaded').to_series().resample('1D', offset='5h').mean() 
        
        # make sure the indices match
        obs.index = obs.index.to_period('D')
        mod.index = mod.index.to_period('D')

        # merge obs and predictions; drop NaNs.
        gage_df = pd.merge(obs, mod, left_index=True, right_index=True).dropna(how='any')
        
        scores = pd.Series(
            data={
                'NSE': NSE(gage_df.observed, gage_df.predicted),
                'KGE': KGE(gage_df.observed, gage_df.predicted),
                'logNSE': logNSE(gage_df.observed, gage_df.predicted),
                'pbias': pbias(gage_df.observed, gage_df.predicted),
                'rSD': rSD(gage_df.observed, gage_df.predicted),
                'pearson': pearson_r(gage_df.observed, gage_df.predicted),
                'spearman': spearman_r(gage_df.observed, gage_df.predicted), 
                'pBiasFMS': pBiasFMS(gage_df.observed, gage_df.predicted),
                'pBiasFLV': pBiasFLV(gage_df.observed, gage_df.predicted),
                'pBiasFHV': pBiasFHV(gage_df.observed, gage_df.predicted)
            },
            name=gage_id,
            dtype='float64'
        )
        return scores
    except Exception as e:#<-- this is an extremely broad way to catch exceptions.  We only do it this way to ensure 
            #    that a failure on one benchmark (for a single stream gage) will not halt the entire run. 
        logging.info("Benchmark failed for %s", gage_id)
        return None

Let's test to be sure this `compute_benchmark()` function will return data for a single gage

In [25]:
compute_benchmark('USGS-01030350')

NSE          0.610186
KGE          0.581806
logNSE       0.437533
pbias      -12.679162
rSD          0.655655
pearson      0.799410
spearman     0.859122
pBiasFMS   -34.154380
pBiasFLV    90.474838
pBiasFHV   -43.865916
Name: USGS-01030350, dtype: float64

We now need to set up a way to farm out this function, once per gage ID. `dask` will do this
using a dask '_bag_'.  

:::{sidebar}

Read more about task parallelism with Dask and how we are using dask bags [here](/dev/null)

:::

In [34]:
import dask.bag as db
bag = db.from_sequence(domain_data.index.tolist()).map(compute_benchmark)
results = bag.compute() 

With that big task done, we don't need `dask` parallelism any more. Let's shut down the cluster:

In [35]:
client.close(); del client
cluster.close(); del cluster

  self.scheduler_comm.close_rpc()


## Assemble the results
The `bag` now contains a collection of return values (one per call to `compute_benchmark()`).  We can massage that into a table/dataframe for easier processing: 

In [36]:
results = [i for i in results if i is not None] # Drop entries where compute_benchmark failed
results_df = pd.concat(results, axis=1).T
results_df.index.name = 'site_no'
#ds_results = df_results.to_xarray()
results_df

Unnamed: 0_level_0,NSE,KGE,logNSE,pbias,rSD,pearson,spearman,pBiasFMS,pBiasFLV,pBiasFHV
site_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
USGS-01011000,0.689088,0.662586,0.597192,-19.883495,0.774751,0.846457,0.818826,14.807269,61.287969,-36.838683
USGS-01013500,0.620608,0.480716,0.752852,-22.970420,0.552287,0.871763,0.873238,-14.653922,60.804260,-51.751721
USGS-01015800,0.695682,0.661854,0.764120,-13.975070,0.732493,0.847514,0.862823,-5.586971,43.496275,-40.215325
USGS-01017000,0.676277,0.662182,0.728323,-13.596089,0.739596,0.833192,0.836719,-2.559546,50.629818,-40.769273
USGS-01017550,-0.007024,0.488803,0.187784,25.384879,1.159282,0.585860,0.778948,20.736102,15.696818,-33.564579
...,...,...,...,...,...,...,...,...,...,...
USGS-14369500,0.576253,0.614541,0.538312,21.251125,1.297323,0.877457,0.771774,43.800607,112.890399,13.504782
USGS-14372300,0.747763,0.854819,0.462288,5.255767,0.960921,0.870431,0.889055,48.189721,-26.875940,-18.711446
USGS-14375100,-1.256368,-0.270989,-0.519085,96.341803,1.701510,0.558273,0.330279,6.331031,735.452175,8.474573
USGS-14377100,0.751601,0.691753,0.905648,14.723495,1.262068,0.931750,0.955250,-4.176343,22.406697,19.312102


This dataframe/table can be saved to disk as a CSV. It will be used for visualizations in [other notebooks](NWM_Benchmark_Visualization.ipynb).

In [None]:
results_df.to_csv('NWM_v2.1_streamflow_benchmark.csv') ##<--- change this to a personalized filename
#ds_results.merge(benchmark_ds, join='inner').to_netcdf('nwm_v2.1_streamflow_benchmark.nc')