# Model Evaluation :: Data Preparation

As a part of the generalized evaluation workflow: 

<img src='./Eval_PreProc.svg' width=600>

The pre-processing step is needed in order to align the two datasets for analysis.  The specific 
steps needed to prepare a given dataset may differ, depending on the source and the variable of
interest. 

Some steps might include: 

* Organizing the time-series index such that the time steps for both simulated and observed are congruent
    * This may involve interpolation to estimate a more granular time-step than is found in the source data
    * More often, an agregating function is used to 'down-sample' the dataset to a coarser time step (days vs hours).
* Coordinate aggregation units between simulated and observed 
    * Gridded data may be sampled per HUC-12, HUC-6, etc. to match modeled data indexed by these units. 
    * Index formats may be adjusted (e.g. a 'gage_id' may be 'USGS-01104200' in one data set, vs '01104200' in another)
* Re-Chunking the data to make time-series analysis more efficient (see [here](/dev/null) for a primer on re-chunking).

## Streamflow Data Prep

This document shows one approach to preparing the _streamflow_ data for subsequent analysis (That analysis is outlined [here](./02_Analysis_NWM.ipynb)).

Streamflow analysis will compare time-series of two aligned datasets: 
* 'observed' data values obtained from [NWIS](https://nwis.waterdata.usgs.gov/nwis) 
* 'modeled' data extracted from the [NWM](https://registry.opendata.aws/nwm-archive/)

These data soruces are accessed using different methods.  We will pull data from their respective sources, reshape and optimize the data structures, then write that data to storage to make later analysis easier. 

## Sourcing the 'modeled' Data
Modeled data for this demonstration tutorial will be sourced from the S3 bucket `nhgf-development`.

In [1]:
import os
import fsspec
import dask
import xarray as xr

os.environ['AWS_PROFILE'] = 'nhgf-development'  # credentials are required
fs = fsspec.filesystem('s3') # create reference to a S3 filesystem driver.

modeled = xr.open_zarr(
    fs.get_mapper('s3://nhgf-development/nwm/chanobs.zarr'),  #<<< example data source of NWM2.1  CHANOBS_DOMAIN
    consolidated=False, 
    chunks={}, 
    drop_variables='velocity' #<< not using this var; just 'streamflow'
)
modeled

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 31.23 kiB 31.23 kiB Shape (7994,) (7994,) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",7994  1,

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 31.23 kiB 31.23 kiB Shape (7994,) (7994,) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",7994  1,

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 31.23 kiB 31.23 kiB Shape (7994,) (7994,) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",7994  1,

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,2 Tasks,1 Chunks
Type,int32,numpy.ndarray
"Array Chunk Bytes 31.23 kiB 31.23 kiB Shape (7994,) (7994,) Count 2 Tasks 1 Chunks Type int32 numpy.ndarray",7994  1,

Unnamed: 0,Array,Chunk
Bytes,31.23 kiB,31.23 kiB
Shape,"(7994,)","(7994,)"
Count,2 Tasks,1 Chunks
Type,int32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,21.88 GiB,2.80 MiB
Shape,"(367439, 7994)","(367439, 1)"
Count,7995 Tasks,7994 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 21.88 GiB 2.80 MiB Shape (367439, 7994) (367439, 1) Count 7995 Tasks 7994 Chunks Type float64 numpy.ndarray",7994  367439,

Unnamed: 0,Array,Chunk
Bytes,21.88 GiB,2.80 MiB
Shape,"(367439, 7994)","(367439, 1)"
Count,7995 Tasks,7994 Chunks
Type,float64,numpy.ndarray



**Source Data as Template**

This source data set will establish the indices and boundaries for the data we will eventually pull from the NWIS stream gage network.
The two dimensions of this data are the **Gage ID** and **Time**.  We'll use these dimensions to fetch the 'observed' data later. The 
endpoints and range of these dimensions will establish that future query.    

In [2]:
## Gage IDs
import re
_gages = [gage_id.lstrip() for gage_id in modeled['gage_id'].values.astype('str')]
GAGES = [g for g in _gages if re.search('USGS-\d+$', g) ]  # note the regex search pattern

# Time bounds
start_time = modeled.time.values.min()
stop_time = modeled.time.values.max()
DATE_RANGE=(start_time, stop_time)

Some important notes for each of those bounds: 
* **GAGES** / Gage IDs :: The list of 7994 gage IDs in the model dataset include some values which the NWIS does not 
  recognize and will not accept. We need to remove them. 
  * Gage IDs of the form `USGS-\d+` (A string starting 'USGS-' and ending in an arbitrary number of digits)
    are processed by NWIS data requests.
  * There are roughly 350 gage IDs in the modeled dataset with letters embedded in the string of digits after the 'USGS-'. These
    will be rejecte when we try to call NWIS data service to obtain streamflow history for that location.
  * This is the reason behind the regular expression search (`re.search`) to select only gage_id of the correct format. 
  * After selecting the NWIS-compliant gage IDs, the `GAGES` list contains 7647 gages.
* **DATE_RANGE** / Dates :: This defines the temporal range for the historical data will will fetch from NWIS. 
  * The NWM modeled data includes time values stepped hourly.
  * The historical streamflow data is stepped daily.
  * We will resample later to make sure the indices match. 

## Sourcing the 'observed' data

Now that we have information about the list of gages and the date range covered by the model, we 
can use that to query NWIS for matching data points for this same range of dates and station IDs. 
Because NWIS data is structured a little differently than the modeled streamflow, we'll need to 
re-arrange the data a little after fetching. 

In addition, a call to NWIS for historical data can be time consuming -- and we will do it roughly 
7500 times.  We will eventually set up a mechanism to do these requests in parallel, once we've 
established how the data restructuring should happen. 

The first step in that process is to make a NWIS request for just a couple of gages to see how 
the return data is structured.  We'll use that information to create the plan by which the full 
dataset is to be fetched and reorganized. 

In [4]:
from pygeohydro import NWIS
nwis = NWIS()
## Fetch data for a couple of gages to see how NWIS formats a response
observed = nwis.get_streamflow(GAGES[0:2], DATE_RANGE, to_xarray=True) 
observed

## Re-arrange the response data
We requested two stream gages rather than just one, to ensure that the dataset is multi-dimensional 
(as the final request will be). 

Now that we can see the data provided by `NWIS`, we can begin to shape the plan to match its 
structure to the modeled data. 

**Rename Vars** -- First step is to adjust some variable names in order to match the
modeled data above ('_discharge_' --> '_streamflow_', etc).  

In [5]:
observed = (observed
            .rename_dims({'station_id':'gage_id'})
            .rename({'discharge':'streamflow', 'station_id':'gage_id'})
           )
observed


  observed = (observed



**Chunking**  -- The next step in organizing this data is to define how it will be 'chunked' when written to storage. 

:::{sidebar}
See [here](/dev/null) for a primer in chunking.
:::

We'll do this by creating an 'empty' dataset using our sample of observed values as a template, with the
chunking plan laid out as we like. Once we have the template sorted out, we can then populate the template
with  data fetched from NWIS for all of the gages of interest. 

In [6]:
source_dataset = observed
template = (xr.zeros_like(source_dataset)  # DataSet just like 'observed'
             .chunk()           
             .isel(gage_id=0, drop=True)      # temporarily remove gage_id as a dimension and coordinate
             .expand_dims(gage_id=len(GAGES), axis=-1) # add it back, reserving space for the full size of GAGES
             .assign_coords({'gage_id': GAGES}) # add coordinate to match dimension
             .chunk({                         # define chunk sizes
                 'time': len(observed.time),  # all time vals in one chunk
                 'gage_id': 1}                # one gage_id per chunk
             )
           )
template

Unnamed: 0,Array,Chunk
Bytes,893.22 MiB,119.61 kiB
Shape,"(15310, 7647)","(15310, 1)"
Count,15298 Tasks,7647 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 893.22 MiB 119.61 kiB Shape (15310, 7647) (15310, 1) Count 15298 Tasks 7647 Chunks Type float64 numpy.ndarray",7647  15310,

Unnamed: 0,Array,Chunk
Bytes,893.22 MiB,119.61 kiB
Shape,"(15310, 7647)","(15310, 1)"
Count,15298 Tasks,7647 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.25 MiB,172 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 1.25 MiB 172 B Shape (7647,) (1,) Count 15297 Tasks 7647 Chunks Type numpy.ndarray",7647  1,

Unnamed: 0,Array,Chunk
Bytes,1.25 MiB,172 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,59.74 kiB,8 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 59.74 kiB 8 B Shape (7647,) (1,) Count 15297 Tasks 7647 Chunks Type float64 numpy.ndarray",7647  1,

Unnamed: 0,Array,Chunk
Bytes,59.74 kiB,8 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,59.74 kiB,8 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 59.74 kiB 8 B Shape (7647,) (1,) Count 15297 Tasks 7647 Chunks Type float64 numpy.ndarray",7647  1,

Unnamed: 0,Array,Chunk
Bytes,59.74 kiB,8 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,59.74 kiB,8 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 59.74 kiB 8 B Shape (7647,) (1,) Count 15297 Tasks 7647 Chunks Type float64 numpy.ndarray",7647  1,

Unnamed: 0,Array,Chunk
Bytes,59.74 kiB,8 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,59.74 kiB,8 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,int64,numpy.ndarray
"Array Chunk Bytes 59.74 kiB 8 B Shape (7647,) (1,) Count 15297 Tasks 7647 Chunks Type int64 numpy.ndarray",7647  1,

Unnamed: 0,Array,Chunk
Bytes,59.74 kiB,8 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,int64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,179.23 kiB,24 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 179.23 kiB 24 B Shape (7647,) (1,) Count 15297 Tasks 7647 Chunks Type numpy.ndarray",7647  1,

Unnamed: 0,Array,Chunk
Bytes,179.23 kiB,24 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,238.97 kiB,32 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 238.97 kiB 32 B Shape (7647,) (1,) Count 15297 Tasks 7647 Chunks Type numpy.ndarray",7647  1,

Unnamed: 0,Array,Chunk
Bytes,238.97 kiB,32 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,298.71 kiB,40 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 298.71 kiB 40 B Shape (7647,) (1,) Count 15297 Tasks 7647 Chunks Type numpy.ndarray",7647  1,

Unnamed: 0,Array,Chunk
Bytes,298.71 kiB,40 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,numpy.ndarray,

Unnamed: 0,Array,Chunk
Bytes,298.71 kiB,40 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,numpy.ndarray,
"Array Chunk Bytes 298.71 kiB 40 B Shape (7647,) (1,) Count 15297 Tasks 7647 Chunks Type numpy.ndarray",7647  1,

Unnamed: 0,Array,Chunk
Bytes,298.71 kiB,40 B
Shape,"(7647,)","(1,)"
Count,15297 Tasks,7647 Chunks
Type,numpy.ndarray,


**Formalize Template** -- We can now write this empty template to storage, which will codify the 
metadata and create a landing zone for the upcoming parallelized calls to the NWIS service. 

In [7]:
import numpy as np

outfile = fs.get_mapper('s3://nhgf-development/workspace/tutorial/nwis_out_gzt5142.zarr')
template.to_zarr(
    outfile,
    compute=False,                                 # << will not write data; just metadata
    encoding =  {                                  # encodings sets data types for the disk store
        'station_nm':  dict( _FillValue='',          dtype='<U43'), 
        'alt_acy_va':  dict( _FillValue=-2147483647, dtype=np.int32),
        'alt_va':      dict( _FillValue=9.96921e+36, dtype=np.float32),
        'dec_lat_va':  dict( _FillValue=None,        dtype=np.float32),
        'dec_long_va': dict( _FillValue=None,        dtype=np.float32),
        'streamflow':  dict( _FillValue=9.96921e+36, dtype=np.float32)
    },
    consolidated=True,                             # Consolidate metadata
    mode='w'
)

Delayed('_finalize_store-0e8225d6-5fb2-4aea-b70d-418b265defde')

## Fetch All NWIS Data
Now that we have established the names, encoding, chunking, etc of the output data, we can now
fetch the streamflow data froM NWIS and populate the data store.  This will involve over 7000
calls to the NWIS service (one per gage), so we will do that in parallel. 

To set up for parallel, we will establish an encapuslated routine to process a single stream
gage.  We will then farm out that routine to workers -- a worker will process a single gage
from the list. 

In [15]:
# Global: 
n_timesteps = len(observed.time)
time_steps = observed.time.values
# observed.to_zarr(outfile,  region={'time':slice(0, n_timesteps), 'gage_id':slice(0, 2)})

def write_one_gage(n):
    """ 
    Writes one gage's data to the existing zarr file.  Uses the NWIS API call to fetch data.
    
    Arguments: 
    n   : integer
       the index into the GAGES array identifying which gage to fetch and write. 
    """
    site_id = GAGES[n]
    try:
        _obs = nwis.get_streamflow(site_id, DATE_RANGE, to_xarray=True).interp(time=time_steps)
        _obs = _obs.rename_dims({'station_id':'gage_id'}).rename({'station_id':'gage_id','discharge':'streamflow'})
        ## We must force the returned data into the datatype that we stored to disk. 
        _obs['station_nm'] = xr.DataArray(data=_obs['station_nm'].values.astype('<U43'), dims='gage_id')
        _obs['alt_datum_cd'] = xr.DataArray(data=_obs['alt_datum_cd'].values.astype('<U6'), dims='gage_id')
 
        _obs.to_zarr(
            outfile, 
            region={ #<<< Specifying a region lets us 'insert' data to a specific place in the dataset. 
                'time': slice(0, n_timesteps), 
                'gage_id': slice(n,n+1)
                }
            )
        return n
    except: # This is an extremely broad way to catch exceptions... and in general is to be avoided. 
            # We do it this way in this case to protect the parallel run. it allows a single write_one_gage() 
            # to fail silently without affecting the rest of the run.
        return None

In [13]:
# test to see if it will write data for one gage... pick a random item from the GAGES list
write_one_gage(3)

  _obs = _obs.rename_dims({'station_id':'gage_id'}).rename({'station_id':'gage_id','discharge':'streamflow'})


3

**Start Cluster** -- Finally, we are ready to run the above routine in a clustered compute environment.
As this routine will write to an S3 bucket, the cluster should be configured with the **SAME CREDENTIALS**
as the bucket access profile. 

In [12]:
# This cluster configuration is VERY specific to the JupyterHub cloud environment on ESIP/QHUB
import sys
sys.path.append('/shared/users/lib')
try:
    import ebdpy as ebd
except ImportError as exc:
    raise ImportError("EBD library not found!!") from exc

    ebd.set_credentials(profile='nhgf-development')
aws_profile = 'nhgf-development'
aws_region = 'us-west-2'
endpoint = f's3.{aws_region}.amazonaws.com'
ebd.set_credentials(profile=aws_profile, region=aws_region, endpoint=endpoint)
worker_max = 30
client, cluster = ebd.start_dask_cluster(
    profile=aws_profile,
    worker_max=worker_max,
    region=aws_region,
    use_existing_cluster=True,
    adaptive_scaling=False,
    wait_for_cluster=False,
    worker_profile='Medium Worker',
    propagate_env=True
)

Region: us-west-2
No Cluster running.
Starting new cluster.
{}
Setting Cluster Environment Variable AWS_DEFAULT_REGION us-west-2
Setting Fixed Scaling workers=30
Reconnect client to clear cache
client.dashboard_link (for new browser tab/window or dashboard searchbar in Jupyterhub):
https://jupyter.qhub.esipfed.org/gateway/clusters/dev.b81d21f066a44eed9da52e3110dfe38f/status
Propagating environment variables to workers
Using environment: users/pangeo


**RUN** -- Using the Dask `delayed()` function, we can dispatch many copies of `write_one_gage()` at once. The
Dask scheduler will disperse these to workers in the cluster.  When all are done, the `compute()` will return. 

In [51]:
results = []
for (i, gage_id) in enumerate(GAGES):
    results.append(dask.delayed(write_one_gage)(i))
    
_ = dask.compute(*results, retries=10)

(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

In [14]:
## Done with parallel work... shut down the cluster
client.close()
cluster.close()

  self.scheduler_comm.close_rpc()


## Verify

At this point, the data should be written to the `outfile`. We can open it with a standard `open_dataset`
using the _zarr_ data engine.  Let's do that to see what we got: 

In [None]:
dst = xr.open_dataset(outfile, engine='zarr', chunks={}, backend_kwargs=dict(consolidated=True))
dst