# Rechunk NWM 1km gridded output

The NWM 1km gridded output is available on AWS S3 as hourly NetCDF files, so we we would like to rechunk them to be read by the Zarr library.

One approach would be to download all the files and work on them locally, like Pangeo forge often does. We will try a different approach, of first kerchunking the NetCDF files to make them more performant, then running rechunker on the kerchunked dataset. In both steps, we will use a Dask Gateway cluster, with workers writing directly to S3.


## Chunking and rechunking resources
*  ["Making Earth Science data more accessible"](https://www.slideserve.com/kiaria/making-earth-science-data-more-accessible-experience-with-chunking-and-compression), AMS presentation slides by Russ Rew, Unidata (2013)
*  ["Rechunker: The missing link for chunked array analytics"](https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11), Medium blog post by Ryan Abernathy, Columbia University (2020)
*  ["Rechunker" Python Library Documentation](https://rechunker.readthedocs.io/en/latest/)

In [None]:
import sys
import fsspec
import xarray as xr
import hvplot.xarray
import zarr

In [None]:
print("Python : ", sys.version)
print("fsspec : ", fsspec.__version__)
print("xarray : ", xr.__version__)
print("zarr   : ", zarr.__version__)

In [None]:
fs = fsspec.filesystem('s3', anon=True)

In [None]:
flist = fs.ls('s3://noaa-nwm-retrospective-2-1-pds/')
flist

In [None]:
flist = fs.glob('noaa-nwm-retrospective-2-1-pds/model_output/*')
print(flist[0])
print(flist[-1])

In [None]:
flist = fs.glob('noaa-nwm-retrospective-2-1-pds/model_output/1979/*LDAS*')
flist[0]

In [None]:
flist = fs.glob('noaa-nwm-retrospective-2-1-pds/model_output/2020/*LDAS*')
flist[-1]

Okay, so at this point we've learned that we have 3-hourly output over roughly 40 years

In [None]:
# %%time
# flist = fs.glob('noaa-nwm-retrospective-2-1-pds/model_output/*/*LDAS*')   # this is slow
40 * 365 * 24 / 3

In [None]:
flist[0]

So about 117,000 NetCDF files! 

Let's check one out.  Although it's not super efficient, we can open a NetCDF file on S3 as a virtual file object with `fs.open(s3_url_of_netcdf_file)`.  If we open a dataset in xarray using `chunks=` we are telling xarray to use Dask, and `chunks={}` means use the native chunking in the NetCDF file

In [None]:
ds = xr.open_dataset(fs.open(flist[0]), chunks={})

In [None]:
ds

In [None]:
ds.data_vars

In [None]:
ds = ds[['ACCET', 'SNEQV', 'FSNO']]

In [None]:
ds.data_vars

In [None]:
ds['ACCET']

In [None]:
ds.ACCET

The data is chunked as full spatial domain and 1 time step, with about 135MB chunk size.   This is actually great for visualization of maps at specific time steps or for calculations that involve the entire dataset. So kerchunking this data would be a nice first step. 

In [None]:
%%time
da = ds.ACCET.load()

In [None]:
da

In [None]:
da.hvplot(x='x', y='y', rasterize=True, cmap='turbo', data_aspect=1)