# Expanding the Workflow
## Different Data
Let's say we now want to get our streamflow data from the NWM Retrospective... maybe to compare with the USGS data.  Let's write a function to do that.

## Quick Sidebar: Zarr
[Zarr](https://zarr.dev/) is an technology that allows for indexed access to large, online datasets. The National Water Model, and NOAA in general, have used ZARR extensively to make it easier to download portions of their datasets without having to download large netCDF files or large sets of netCDF files. You can think of Zarr working together with sets of netCDF files so that users can download just portions of the netCDF formatted dataset.

Before we write the function, let's explore how to do this with some Jupyter cells

In [2]:
from s3fs import S3FileSystem, S3Map
import xarray as xr

# https://registry.opendata.aws/nwm-archive/
bucket = 's3://noaa-nwm-retrospective-3-0-pds/CONUS/zarr'
fs = S3FileSystem(anon=True)
ds = xr.open_dataset(S3Map(f"{bucket}/chrtout.zarr", s3=fs), engine='zarr')
ds


## Quick Sidebar: xarray
[xarray](https://xarray.dev/) is Python library standardizes access to a number of underlying array-like data formats including netCDF, Zarr, and GRIB. xarray is useful because it standardizes the code used to work with these file formats instead of using their own specific Python libraries, each of which has their own syntax and methodologies for manipulating the data. It also works seemlessly with a number of other common Python libraries including pandas, NumPy, and matplotlib for visualization.

In [4]:
ds = ds.sel({'feature_id': 166176984})
ds

In [5]:
ds = ds.sel(time=slice('2016-06-01T00:00:00', '2016-11-01T00:00:00'))
ds

In [24]:
ds.get('streamflow').to_pandas()

time
2016-06-01 00:00:00     8.220000
2016-06-01 01:00:00     8.220000
2016-06-01 02:00:00     8.210000
2016-06-01 03:00:00     8.200000
2016-06-01 04:00:00     8.200000
                         ...    
2016-10-31 20:00:00    59.589999
2016-10-31 21:00:00    59.669999
2016-10-31 22:00:00    59.729999
2016-10-31 23:00:00    59.759999
2016-11-01 00:00:00    59.749999
Length: 3673, dtype: float64

In [27]:
ds.get('streamflow').to_pandas().rename('streamflow (m^3 s^-1)')

time
2016-06-01 00:00:00     8.220000
2016-06-01 01:00:00     8.220000
2016-06-01 02:00:00     8.210000
2016-06-01 03:00:00     8.200000
2016-06-01 04:00:00     8.200000
                         ...    
2016-10-31 20:00:00    59.589999
2016-10-31 21:00:00    59.669999
2016-10-31 22:00:00    59.729999
2016-10-31 23:00:00    59.759999
2016-11-01 00:00:00    59.749999
Name: streamflow (m^3 s^-1), Length: 3673, dtype: float64

In [29]:
def acquire_streamflow_nwm_retrospective(site, start, end):
    bucket = 's3://noaa-nwm-retrospective-3-0-pds/CONUS/zarr'
    fs = S3FileSystem(anon=True)
    ds = xr.open_dataset(S3Map(f"{bucket}/chrtout.zarr", s3=fs), engine='zarr')

    ds = ds.sel({'feature_id': site})
    ds = ds.sel(time=slice(f'{start}T00:00:00', f'{end}T00:00:00'))

    return ds.get('streamflow').to_pandas().rename('streamflow (m^3 s^-1)')

acquire_streamflow_nwm_retrospective(166176984, '2016-06-01', '2016-11-01')

time
2016-06-01 00:00:00     8.220000
2016-06-01 01:00:00     8.220000
2016-06-01 02:00:00     8.210000
2016-06-01 03:00:00     8.200000
2016-06-01 04:00:00     8.200000
                         ...    
2016-10-31 20:00:00    59.589999
2016-10-31 21:00:00    59.669999
2016-10-31 22:00:00    59.729999
2016-10-31 23:00:00    59.759999
2016-11-01 00:00:00    59.749999
Name: streamflow (m^3 s^-1), Length: 3673, dtype: float64

## Putting our New Function Into the Workflow

Now that we have a new "Data Acquire / Filter" function that matches the interface we've defined, let's drop it into our workflow...


In [30]:
import dataretrieval.nwis as nwis
from s3fs import S3FileSystem, S3Map
import xarray as xr

def acquire_streamflow_nwis_iv(site, start, end):
    df = nwis.get_record(sites=site, service='iv', start=start, end=end, parameterCD='00060')
    # https://help.waterdata.usgs.gov/parameter_cd?group_cd=PHY
    return df['00060'].rename('streamflow (ft^3/s)')

def acquire_streamflow_nwm_retrospective(site, start, end):
    bucket = 's3://noaa-nwm-retrospective-3-0-pds/CONUS/zarr'
    fs = S3FileSystem(anon=True)
    ds = xr.open_dataset(S3Map(f"{bucket}/chrtout.zarr", s3=fs), engine='zarr')

    ds = ds.sel({'feature_id': site})
    ds = ds.sel(time=slice(f'{start}T00:00:00', f'{end}T00:00:00'))

    return ds.get('streamflow').to_pandas().rename('streamflow (m^3 s^-1)')

def resample_to_daily(df):
    return df.resample('1D').mean()

def visualize_summary_statistics(df):
    print(df.describe())

# Acquire / Filter
#df = acquire_streamflow_nwis_iv(site='04294000', start='2022-06-01', end='2022-11-01')
df = acquire_streamflow_nwm_retrospective(166176984, '2016-06-01', '2016-11-01')

# Manipulate
daily = resample_to_daily(df)

# Visualize
visualize_summary_statistics(daily)

count    154.000000
mean      15.004651
std       17.237395
min        3.477500
25%        6.116250
50%       10.826250
75%       16.147500
max      121.737914
Name: streamflow (m^3 s^-1), dtype: float64


## Process Notes
I want to take a second to point out how we used Jupyter here to explore the NWM Retrospective data, and print out our dataset as we filtered it along the way to make sure we were doing what we intended. This iteration and exploration is an essential pieces of the software development and data science methodologies. You need to make sure along the way that the data you are getting is the data you think you are getting. Then, once you are confident in your methodology, you can put it in a .py file as a function and parameterize it for future use.

## Other Potential Data Acquisition Functions
Initially, I was going to go over a number of additional data acquisition modules for datasets like AORC, GFS, CFS, and other 2D datasets and forcing data. But, then I decided to focus more on workflow structure and best practices for workflow construction. Plus, as the Python ecosystem continues to develop, we are getting more and more of these data acquisition modules written for us, like the USGS dataretrieval Python module. A number of additional resources for data acquisition are provided in the Wrap Up.