# Analysis - Streamflow Drought
## Streamflow Drought Benchmarking Tutorial

<img src='../../../doc/assets/Eval_Analysis.svg' width=600>


## Essential Benchmark Components
This benchmark notebook will present a workflow which follows a canonical model for Essential Benchmark Components: 
1) A set of predictions and matching observations; 
2) The domain (e.g. space or time) over which to benchmark;
3) A set of statistical metrics with which to benchmark. 

#### Let's get started.

## Step 0: Load required libraries

In [None]:
import pandas as pd
import numpy as np
import logging
import os
import intake
from sklearn.metrics import cohen_kappa_score

## Step 1: Load Streamflow Data

Finding and loading data is made easier for this particular workflow (the _streamflow_ variable), in that most of it has been 
pre-processed and stored in a cloud-friendly format.  That process is described in the [data preparation](01_Data_Prep.ipynb)
notebook. We will proceed here using the already-prepared data for _streamflow_, which is included in the HyTEST **intake catalog**. You can learn more about the python library `intake` [here](../../../dataset_catalog/README.md).


In [None]:
# preview what is in the HyTEST Catalog (if you have a dataset you'd like to add, please reach out on the HyTEST Github Page).
hytest_cat = intake.open_catalog(r'https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml')
list(hytest_cat)

The above list represents the processed datasets available for benchmarking and exploring in general.  
If a dataset you want is not in that list, you can load data directly from another source.

If you load data from a source other than this list, you can jump to **Step 2: Restrict to a Domain**

Note that the datasets in the catalog above are duplicated: Some are `-onprem` , some are `-cloud`, and some are some are `-osn`. Same data, but the storage location and access protocol will be different. You will definitely want to use the correct copy of the data for your computing environment. The idea is to "bring the compute to the data" to reduce cost and increase efficiency.
* `onprem` : Direct access via the USGS internal `caldera` filesystem from _denali_ or _tallgrass_, and now on _hovenweep_.
* `osn` : Network access via OSN pod, which uses the S3 API, suitable for consumption on any jupyter server (onprem, cloud, or local)
* `cloud` : Network access via S3 bucket, suitable for consumption on cloud-hosted jupyter servers. You could also access using any network-attached computer, but the amount of data will likely saturate your connection.  Use in the cloud is okay (e.g. ESIP Nebari or HyTEST Nebari).

Before we select a dataset from the main catalog, we will explore the subcatalogs for pre-processed streamflow data:

In [None]:
# Accessing HyTEST subcatalogs to find pre-processed datasets for benchmarking and benchmarking results
bench_cat = hytest_cat['benchmarks-catalog']
print('Benchmarks Catalog within HyTEST Catalog:')
list(bench_cat)

In [None]:
# Accessing HyTEST subcatalogs to find pre-processed datasets for streamflow benchmarking and benchmarking results
stream_cat = bench_cat['streamflow-benchmarks-catalog']
print('Streamflow Catalog within Benchmarks Catalog under the HyTEST Catalog Umbrella:')
list(stream_cat)

In [None]:
# Let's explore what this dataset is! Looks like modeled output from NHM-PRMS version 1.0 forced with Daymet (byMuskObs calibration)
stream_cat['nhmprms-conus-v1_0-daymetv3-byhrumuskobs-osn']

We see under the description that this is daily streamflow data from the particular modeling application we suspected. This is a subset of what is available in the model archive (see [Hay and LaFontaine, 2020](https://doi.org/10.5066/P9PGZE0S)) based on the description. 

##### For now we will work with National Water Model v2.1 data because it exists on OSN and in zarr format (same format as our pre-saved observational data).

In [None]:
# National Water Model version 2.1 on OSN
hytest_cat['nwm21-streamflow-usgs-gages-osn']

The observational and modeled datasets of interest exist `-on-prem` and on `-osn` as we can see from the dataset label in the `hytest_cat` above. If on-prem, we will change the variable to `TRUE` to access data close to the compute.

In [None]:
onprem = False # change to True if on the supercomputer(s)

if onprem:
    print("Yes : -onprem")
    obs_data_src='nwis-streamflow-usgs-gages-onprem'
    mod_data_src='nwm21-streamflow-usgs-gages-onprem'
else:
    print("Not onprem; use '-osn' data source")
    obs_data_src='nwis-streamflow-usgs-gages-osn'
    mod_data_src='nwm21-streamflow-usgs-gages-osn'
print("Observed : ", obs_data_src)
print("Modeled  : ", mod_data_src)

Next we identify the variable of interest (`variable_of_interest = 'streamflow'`) from our datasets. Depending on the data you use, you may need to modify this according the variable of interest to you.

In [None]:
variable_of_interest = 'streamflow'

try:
    obs = hytest_cat[obs_data_src].to_dask()
    mod = hytest_cat[mod_data_src].to_dask()
    
except KeyError:
    print("Something wrong with dataset names.")
    raise

try:
    obs_data = obs[variable_of_interest]
    mod_data = mod[variable_of_interest].astype('float32')
except KeyError:
    print(f"{variable_of_interest} was not found in these data.")

obs_data.name = 'observed'
mod_data.name = 'modeled'

In [None]:
# preview observational data, check units on the variable streamflow, check timestep.
obs
#obs_data

## Uncomment below to preview modeled data
#mod

## Step 2: Restrict to a domain (spatial and temporal)

Each benchmark domain is defined over specific bounds (typically space and/or time). 
Benchmark domain definitions can be defined within the notebook, or sourced from 
elsewhere. For example, the _Cobalt_ gages, which is available for download on ScienceBase ([Foks et al., 2022](https://doi.org/10.5066/P972P42Z)), is used for the general streamflow benchmarking tutorials also available within the HyTEST repository. 
 
For this example, we use the set of gages from a data release [Simeone et al. (2024)](https://doi.org/10.5066/P9P4DHZE), avaliable for download on ScienceBase 
[here](https://doi.org/10.3390/w16202996). These gages are a subset of the _Cobalt_ gages [(Foks et al., 2022)](https://doi.org/10.5066/P972P42Z), but have been selected 
for their longer period of record which is necessary for streamflow drought evaluation. We will use this as a spatial selector to restrict the modeled and simulated data to only the gages found within this benchmarking domain.

Reference:
- Simeone, C.E., Staub, L.E., Kolb, K.R., and Foks, S.S., 2024, Results of benchmarking National Water Model v2.1 simulations of streamflow drought duration, severity, deficit, and occurrence in the conterminous United States: U.S. Geological Survey data release, https://doi.org/10.5066/P9P4DHZE

In [None]:
## This is coming from ScienceBase, so if there is an error - it could be USGS ScienceBase is offline for repair.
sites = pd.read_csv(
    'https://www.sciencebase.gov/catalog/file/get/6410bbc9d34e22162d3e163e?f=__disk__01%2F51%2F49%2F015149051bb29d52b78cbff550385696bc5bbf31',
    dtype={'site':str, 'class':str, 'region':str}, 
    index_col='site')

# Re-format the gage_id/site_no string value.  ex:   "1000000"  ==> "USGS-1000000"
sites.rename(index=lambda x: f'USGS-{x}', inplace=True)
print(f"{len(sites.index)} gages in this benchmark")
sites.head(4)

So now we have a domain dataset representing the streamgages (unique `site` values) identifying the locations making up the spatial domain of this benchmark. While we have good metadata for these gages (lat/lon location, aridity, etc), we really will only use the list of `site` values to select locations out of the raw data. This select happens in our function `compute_benchmark` below.

### Restrict temporal range
The temporal range can be restricted using `slice` (example shown below). For this tutorial, we restrict the temporal range in our main function `compute_benchmark` below.

```
start_date = "1984-04-01"
end_date = "2016-03-31"
obs_data = obs_data.sel(time = slice(start_date, end_date))
mod_data = mod_data.sel(time = slice(start_date, end_date))
```

## Step 3: Define Metrics

The code to calculate the various metrics has been standardized 
in `hytest/evaluation/Metrics_Streamflow_Drought_v1.ipynb`. 

You can use these functions or write your own.  To import and use these standard definitions, we run the following snippet in the next code cell:

```
%run ../../Metrics_Streamflow_Drought_v1.ipynb
```

In [None]:
# source the ipynb in this one to access functions
%run ../../Metrics_Streamflow_Drought_v1.ipynb

Whether you use these functions or your own, we need to put all metric computation into a special all-encompasing 
benchmarking function--a single call which can be assigned to each gage in our domain list. This sort of arrangement 
is well-suited to parallelism, especially with `dask`. 

If this is done well, the process will benefit enormously from task parallelism -- each gage can be given its own 
CPU to run on.  After all are done, the various results will be collected and assembled into a composite dataset. To achieve this, we need a single 'atomic' function that can execute independently. It will take the gage identifier as input and return the list of metrics specified in the function below.

## Step 4: Experiment For One Gage

We will try out some of the functions that are sourced above. We will identify a gage in our dataset, a beginning and end date, and units (very important).

In [None]:
# Define gage id
gage_id = "USGS-01011000"

# Define start and end dates
start_date = "1984-04-01"
end_date = "2016-03-31"

# Load each timeseries to a pandas series, trim time, resample if necessary
obs = (obs_data.sel(gage_id=gage_id, time=slice(start_date, end_date)).load(scheduler='single-threaded').to_series())
mod = (mod_data.sel(gage_id=gage_id, time=slice(start_date, end_date)).
       load(scheduler='single-threaded').to_series().
       resample('1D', offset='5h').mean()) # Note time shift here is specific to National Water Model v2.1 data.

# Make sure the indices match
obs.index = obs.index.to_period('D')
mod.index = mod.index.to_period('D')

# Merge obs and predictions; drop NaNs.
gage_df = pd.merge(obs, mod, left_index=True, right_index=True).dropna(how='any')
gage_df.head()

Many metrics are sensitive or can be easily influenced by higher streamflows or non-normal distributions. In drought studies, it is common to use evaluation metrics on standardized drought indices or streamflow percentiles instead of directly on streamflow values [(Van Loon, 2015)](https://doi.org/10.1002/wat2.1085). Here we transform the streamflow data into percentiles according to a 'fixed' and 'variable' threshold system for identifying drought periods. We use the 5th, 10th, 20th, and 30th percentiles.

In [None]:
# Percentiles
df_pct = calculate_percentiles(gage_df, units = "cms")
df_pct.head()
len(df_pct) # length of the timeseries

This next function expands drought events into a TRUE/FALSE timeseries of drought/non-drought

In [None]:
# Boolean threshold - flagging whether or timeseries is below threshold
df_bool = calculate_site_boolean_threshold_only(df_pct)
df_bool.head()
# len(df_bool)  # This is the timeseries * 4 pct_types (fixed and variable, for obs and mod) * 4 threshold percentiles (5, 10, 20, 30)

Next we calculate Cohen's kappa [(Cohen, 1960)](https://doi.org/10.1177/001316446002000104)  to understand how well our models are picking up periods of drought, it is a way to measure the agreement between categorical variables. [Landis and Koch (1977)](https://doi.org/10.2307/2529310) provide a guide for interpretation of Cohen's kappa values. 

In [None]:
# Cohen's Kappa
df_accuracy = site_cohens_kappa(df_bool, gage_id)
df_accuracy

# 'site' = fixed
# Julian Day ('jd') = variable

Next we calculate the Spearman's correlation, which starts out similarly to the standard Spearman’s r, where the observed and modeled streamflow are each ranked by magnitude across the entire study period. After calculating ranks, however, we subset the data into periods with observed droughts at the various threshold levels of interest. For example, we examine the lowest 20% of flow values for the 20th percentile drought threshold.

In [None]:
# Calculate Spearmans R
df_spearmans = spearman_r(df_pct, gage_id, thresholds = [5,10,20,30])
df_spearmans

Next the bias, percent bias, and ratio of standard deviations (rSD) is calculated (observed flows below the drought threshold versus the modeled flows below the drought threshold (percentiles, not volume).Bias has units of flow, percent bias has units of percent, rSD is unitless.

In [None]:
# Calculate bias 
df_bias_dist = bias_dist(df_pct, gage_id, thresholds = [5,10,20,30])
df_bias_dist

In [None]:
## Calculate site properties
properties, annual_stats = calculate_site_properties(df_pct, gage_id, 
                                                     thresholds = [5, 10, 20 , 30], 
                                                     percent_type_list = ['weibull_jd_obs', 'weibull_jd_mod', 'weibull_site_obs', 'weibull_site_mod'],
                                                     flow_name_list = ['observed', 'modeled', 'observed', 'modeled'], 
                                                     start_cy = 1985, end_cy = 2016)

#annual_stats.to_csv('01011000_annual_stats.csv')
#properties.to_csv('01011000_properties.csv')
properties.head()

In [None]:
print('columns:', annual_stats['measure'].unique())

In [None]:
annual_stats.head()

Finally we calculate the NMAE between drought signatures: duration, severity, intensity. 

For each timeseries, drought characteristics are examined on a Climate Year (Apr 1 - Mar 31) basis for each percentile threshold and fixed/variable method - so each year has a value - then we examine the NMAE between observational values and the modeled values of these drought characteristics (column 'Measures' below).

In [None]:
## Calculate error on drought signatures
sum_df = annual_signatures(annual_stats, gage_id)
subset_df = sum_df[sum_df['measure'] == 'maximum_drought_intensity']
subset_df

## Step 5: Define a function for multiple gages

In [None]:
## Wrapper function -- this func will be called once per gage_id, each call on its own dask worker
def compute_benchmark(gage_id):
    try:
        # restrict temporal range (if needed) - retain data only between start and end dates
        # note that observational data may be more sparse than modeled data and may not span every day
        # for streamflow drought, we focus on Climate Years (Apr 1 - Mar 31)
        start_date = "1984-04-01"
        end_date = "2016-03-31"

        # Daily observational data
        obs = (obs_data
               .sel(gage_id=gage_id, time=slice(start_date, end_date))
               .load(scheduler='single-threaded')
               .to_series())
        
        mod = (mod_data
               .sel(gage_id=gage_id, time=slice(start_date, end_date))
               .load(scheduler='single-threaded')
               .to_series()
               .resample('1D', offset='5h').
               mean()) # Note time shift here is specific to National Water Model v2.1 data.

        # make sure the indices match
        obs.index = obs.index.to_period('D')
        mod.index = mod.index.to_period('D')

        # merge obs and predictions; drop NaNs.
        gage_df = pd.merge(obs, mod, left_index=True, right_index=True).dropna(how='any')

        # streamflow percentiles are computed with the Weibull plotting position (r/(n + 1), where r is rank and n is the number of data
        # see sourced metrics notebook above for further detail
        df_pct = calculate_percentiles(gage_df, units = "cms")
        filename = f'percentiles_{gage_id}.csv'
        file_path = os.path.join(output_folder, filename)
        df_pct.to_csv(file_path, index=False)

        # Boolean threshold
        df_bool_threshold_only = calculate_site_boolean_threshold_only(df_pct)
        filename = f'drought_boolean_{gage_id}.csv'
        file_path = os.path.join(output_folder, filename)
        df_pct.to_csv(file_path, index=False)

        # Cohen's Kappa
        df_accuracy = site_cohens_kappa(df_bool_threshold_only, gage_id)
        filename = f'cohens_kappa_{gage_id}.csv'
        file_path = os.path.join(output_folder, filename)
        df_pct.to_csv(file_path, index=False)

        # Calculate Spearmans R
        df_spearmans = spearman_r(df_pct, gage_id, thresholds = [5,10,20,30])
        filename = f'spearman_{gage_id}.csv'
        file_path = os.path.join(output_folder, filename)
        df_pct.to_csv(file_path, index=False)

        # Calculate bias 
        df_bias_dist = bias_dist(df_pct, gage_id, thresholds = [5,10,20,30])
        filename = f'bias_dist_{gage_id}.csv'
        file_path = os.path.join(output_folder, filename)
        df_pct.to_csv(file_path, index=False)

        # Add or remove decomposition functions below
        return gage_id
        
    except Exception as e:#<-- this is an extremely broad way to catch exceptions.  We only do it this way to ensure 
            #    that a failure on one benchmark (for a single stream gage) will not halt the entire run. 
        logging.info("Benchmark failed for %s", gage_id)
        return None

Let's test to be sure this `compute_benchmark()` function will return data for a single gage

In [None]:
# Create output directory
output_folder = 'output'
os.makedirs(output_folder, exist_ok=True)

In [None]:
compute_benchmark('USGS-01011000')

In [None]:
# Convert our dataframe of sites to a list
site_list = sites.index.tolist()
site_list[1:3]

# Run function for several sites
for site_id in site_list[1:3]:
    print(site_id)
    compute_benchmark(site_id)

### Note: the section below is 'coming soon to a theater near you'
Unfortunately, the dask cluster generation isn't working for us on Nebari at the moment. 

## Execute the Analysis 
We will be doing a lot of work in paralallel, using workers within a 'cluster'.  
The details of cluster configuration are handled for us by 'helper' notebooks, below. 
You can override their function by doing your own cluster configuration if you like. 

In [None]:
## uncomment the lines below to read in your AWS credentials if you want to access data from a requester-pays bucket (-cloud)
# os.environ['AWS_PROFILE'] = 'osn-hytest-scratch' # 'default'
# %run ../../../environment_set_up/Help_AWS_Credentials.ipynb

In [None]:
## Start up a distributed cluster of workers
## NOTE: This cluster configuration is VERY specific to the JupyterHub cloud environment on USGS internal HyTEST Nebari or ESIP Nebari
## For other dask cluster configurations, see other 'Start_Dask' notebooks under hytest/environment_set_up
#%run ../../../environment_set_up/Start_Dask_Cluster_Nebari.ipynb

We verified above that the `compute_benchmark` works on the "hosted" server (where this
notebook is being executed. As a sanity check before we give the cluster of workers a lot 
to do, let's verify that we can have a remote worker process a gage by submitting work
to one in isolation: 

In [None]:
#client.submit(compute_benchmark, 'USGS-01030350').result()

Now that we've got a benchmark function, and can prove that it works in remote workers 
within the cluster, we can dispatch a fleet of workers to process our data in parallel.
We will make use of `dask` to do this using a dask '_bag_'.  
>Read more about task parallelism with Dask and how we are using dask bags [here](../../../essential_reading/Parallel_Dask.ipynb)


In [None]:
# Set up a dask bag with the contents being a list of the streamflow gages used by Simeone et al (2024).
#import dask.bag as db

# For the first 100 gages within the streamflow gage list, try to map the compute benchmark function over them.
# Remove '[0:100]' to run full list (may take awhile depending on cores/workers available)
#bag = db.from_sequence(sites.index.tolist()[0:100]).map(compute_benchmark)
#results = bag.compute() 

# preview first few gages with results
#print("Number of gages:", len(results))
#results[0:2]

With that big task done, we don't need `dask` parallelism any more. Let's shut down the cluster - this is important to do so others can use the compute and so we don't spend additional funds without a purpose.

In [None]:
#client.close(); del client
#cluster.close(); del cluster

## Assemble the results
The `bag` now contains a collection of return values (one per call to `compute_benchmark()`).  We can modify that into a table/dataframe for easier processing: 

This dataframe/table can be saved to disk as a CSV and can be used for visualizations!

In [None]:
# results_df.to_csv('example_streamflow_drought_benchmarking_results.csv') ##<--- change this to a personalized filename if desired