# National Water Model Benchmarking Workflow (traditional metrics)

**NOTE**: 
_This notebook adapted from originals by Timothy Hodson and Rich Signell. See that upstream work at:_
* https://github.com/thodson-usgs/dscore
* https://github.com/USGS-python/hytest-evaluation-workflows/


---
See the [overview](./NWM_traditional.ipynb) in this folder for a summary of the benchmarking notebooks. 


<details>
  <summary>Guide to pre-requisites and learning outcomes...&lt;click to expand&gt;</summary>
  
  <table>
    <tr>
      <td>Pre-Requisites
      <td>To get the most out of this notebook, you should already have an understanding of these topics: 
        <ul>
        <li>pre-req one
        <li>pre-req two
        </ul>
    <tr>
      <td>Expected Results
      <td>At the end of this notebook, you should be able to: 
        <ul>
        <li>outcome one
        <li>outcome two
        </ul>
  </table>
</details>

## Essential Benchmark Components
This benchmark notebook will present a workflow which follows a canonical model for Essential Benchmark Components: 
1) A set of predictions and matching observations; 
2) The domain (e.g. space or time) over which to benchmark;
3) A set of statistical metrics with which to benchmark. 

Let's get started.

## Step 0: Load libraries and configure Python computing environment

In [None]:
# Get access to helper library
%run ../../setup.ipynb

# Python libraries we will need...
import pandas as pd
import logging
from HyTEST.benchmarks.NWMStandardSuite import NWMStandardSuite



Will this ananlysis need parallelism?  Likely yes.  

The following cell will configure a cluster environment suited to the server hosting this notebook. 

You may let our `configure_cluster()` helper try to guess the cluster config for you, or you can 
explicitly name a config matching where you are running this notebook. 

In [None]:
from HyTEST.helpers import configure_cluster
(client, cluster) = configure_cluster('local')
client

## Step 1: Load Data

Essential Benchmark Components: 
1) A set of predictions and matching observations,  <span style="color:red; font-size:large"><<--You are here</span>
2) The domain over which to benchmark 
3) A set of statistical metrics with which to benchmark. 

Finding and loading data is made easier for this particular workflow, in tht most of it has been pre-processed and stored in a cloud-friendly format. That data store is indexed by an **intake catalog**.  Learn more about `intake` [here](../L2/xx_Intake.ipynb)

In [None]:
import intake
try:
    # __hytest_intake_catalog_URL__ should have been defined within 'setup'... or you can override here:
    url = __hytest_intake_catalog_URL__ 
    # to over-ride, edit and un-comment the following:
    # url = r"http://path/to/your/custom/catalog.yml"
except NameError:
    # A default fall-back catalog
    url = r'https://raw.githubusercontent.com/nhm-usgs/data-pipeline-helpers/main/hytest/hytest_intake_catalog.yml'

data_catalog = intake.open_catalog(url); del url
print("Available datasets: \n", "\n".join(data_catalog.keys()))

The above list represents the processed datasets available for benchmarking.  If a dataset
you want is not in that list, you can contact Rich Signell to get it added, or load the 
data manually using [other means](../../L2/LoadingData.ipynb). If you load data from a 
source other than this list, you can jump to [Step 2: Restrict to a Domain](#step-2-restrict-to-a-domain)

Note that the interesting datasets in the cataloged dataset above are duplicated: Some are `-onprem` 
and some are `-cloud`. Same data, but the storage location and access protocol will be different. You 
will definitely want to use the correct copy of the data for your computing environment.  
* `onprem` : Direct access via the `caldera` filesystem from _denali_ or _tallgrass_
* `cloud` : Network access via S3 bucket, suitable for consumption on cloud-hosted jupyter servers. You could also access using any network-attached computer, but the amount of data will likely saturate your connection.  Use in the cloud (e.g. ESIP QHub)

So... are you on-prem? 

In [None]:
from HyTEST.helpers import onprem
if onprem():
    print("Yes : -onprem")  #<-- likely you are running on tallgrass or denali
    obs_data_src='nwis-streamflow-usgs-gages-onprem'
    mod_data_src='nwm21-streamflow-usgs-gages-onprem'
else:
    print("Not onprem; use '-cloud' data source")
    obs_data_src='nwis-streamflow-usgs-gages-cloud'
    mod_data_src='nwm21-streamflow-usgs-gages-cloud'

Now that we have that sorted... let's get data into memory:

In [None]:
variable_of_interest = 'streamflow'
try:
    obs = data_catalog[obs_data_src].to_dask()
    mod = data_catalog[mod_data_src].to_dask()
except KeyError:
    print("Something wrong with dataset names.")
    raise

try:
    obs_data = obs[variable_of_interest]
    mod_data = mod[variable_of_interest].astype('float32')
except KeyError:
    print(f"{variable_of_interest} was not found in these data.")

obs_data.name = 'observed'
mod_data.name = 'predicted'    

## Step 2: Restrict to a Domain

Essential Benchmark Components: 
1) A set of predictions and matching observations,  
2) The domain over which to benchmark <span style="color:red; font-size:large"><<--You are here</span>
3) A set of statistical metrics with which to benchmark. 

Each benchmark domain is defined over specific bounds (typically space and/or time). 
Benchmark domain definitions are published to Science Base, or they can be defined within the notebook. 

This notebook will use a benchmark domain definition loaded from ESIP's network file 
system (S3). It is essentially a list of stream guages in which we are interested, along with some 
metadata about that gage (watershed, reach code, etc).  We will use this as a spatial selector
to restrict the data to only those gages found within this benchmarking domain.

Because we are using a standardized list, we need to fetch it from its upstream source, which is
an S3 bucket:

In [None]:
import fsspec
fs = fsspec.filesystem('s3', anon=True)
url = 's3://esip-qhub-public/usgs/hytest/streamflow_benchmark_sites_v09.csv'
try:
    domain_data = pd.read_csv(
        fs.open(url), 
        dtype={'site_no':str, 'huc_cd':str, 'reachcode':str, 'comid':str },
        index_col='site_no'
        )
except:
    print(f"Could not open the file at {url}... AWS problem?")
    raise
# Re-format the gage_id/site_no string value.  ex:   "1000000"  ==> "USGS-1000000"
domain_data.rename(index=lambda x: f'USGS-{x}', inplace=True)
print(f"{len(domain_data.index)} gages in this benchmark")

So now we have a domain dataset representing the stream gages (unique `site_no` values) identifying the locations making up the spatial domain of this benchmark. While we have good meta-data for these gages (lat/lon location, HUC8 code, etc), we really will only use the list of `site_no` values to pull values out of the raw data.

## Step 3: Compute Metrics

Essential Benchmark Components: 
1) A set of predictions and matching observations,  
2) The domain over which to benchmark 
3) A set of statistical metrics with which to benchmark. <span style="color:red; font-size:large"><<--You are here</span>

The code to calculate the various metrics has been standardized and moved into a 'helper' library. You can use these or write your
own.  Either way, we need to put all metric computation into a special all-encompasing benchmarking function--a single call which can be assigned to each gage in our domain list. This sort of arrangement is well-suited to parallelism with `dask`. If this is done well, the process will benefit enormously from task parallelism -- each gage can be given its own CPU to run on.  After all are done, the various results will be collected and assembled into a composite dataset. 

To achieve this, we need a single 'atomic' function that can execute independently. It will take the gage identifier as input and return a list of metrics.

In [None]:
## Wrapper function -- this func will be called once per gage_id, each call on its own dask worker
def compute_benchmark(gage_id):
    try:
        ## obs_data and mod_data should be globals...
        obs = obs_data.sel(gage_id=gage_id).load(scheduler='single-threaded').to_series()
        mod = mod_data.sel(gage_id=gage_id).load(scheduler='single-threaded').to_series().resample('1D', offset='5h').mean() 
        
        # make sure the indices match
        obs.index = obs.index.to_period('D')
        mod.index = mod.index.to_period('D')

        # merge obs and predictions and drop NaNs.
        gage_df = pd.merge(obs, mod, left_index=True, right_index=True).dropna(how='any')
        
        nwm = NWMStandardSuite.from_df(gage_df, 'observations', 'predictions')
        #     ^^^^^^^^^^^^^^^^ This is the 'helper' to calc all of the metrics we want.
        scores = nwm.suite()
        scores.name = gage_id
        return scores
    except: #<-- this is an extremely broad way to catch exceptions.  We only do it this way to ensure 
            #    that a failure on one benchmark (for a single stream gage) will not halt the entire run. 
        logging.info("Benchmark failed for %s", gage_id)
        return None

Let's test to be sure this `compute_benchmark()` function will return data for a single gage

In [None]:
compute_benchmark('USGS-01030350')

We now need to set up a way to farm out this function, once per gage ID. `dask` will do this.  

For each gage ID in a list (the list we got in Step 2, above), `dask`` will call this function with that gage 
ID as a parameter.  We'll use a dask `bag` (a type of parallelizable collection), and use it to 
`map()` a function to the bag contents contents. 

In [None]:
import dask.bag as db

bag = db.from_sequence(domain_data['site_no'].tolist()).map(compute_benchmark)
#     ^^^^^^^^^^^^^^^^      ^^^^^^^^^^                  ^^^          ^^^--The function to connect/map to each list item.
#     Creates the bag     with this list as contents    map <==> one-to-one connection

results = bag.compute() # dask will not actually compute results until you ask. 

With that big task done, we don't need `dask` parallelism any more. Let's shut down the cluster:

In [None]:
client.close(); del client
cluster.close(); del cluster

## Assemble the results
The `bag` now contains a collection of return values (one per call to `compute_benchmark()`).  We can massage that into a table/dataframe for easier processing: 

In [None]:
results = [i for i in results if i is not None] # Drop entries where compute_benchmark failed

results_df = pd.concat(results, axis=1).T
results_df.index.name = 'site_no'
#ds_results = df_results.to_xarray()
results_df

This dataframe can be saved to disk -- it will be used for visualizations in other notebooks.

In [None]:
results_df.to_csv('NWM_v2.1_streamflow_benchmark.csv')
#ds_results.merge(benchmark_ds, join='inner').to_netcdf('nwm_v2.1_streamflow_benchmark.nc')

In [None]:
print(fingerprint())