# US hit by 1-in-1,000-year flood

James Munroe, jmunroe@2i2c.org

In [2]:
import os
import shutil
import fsspec
import ujson
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
import xarray as xr
import dask
import hvplot.xarray

## From the evening news

I was listening to the news tonight and learned that Dallas, TX is currently experiencing significant flooding.  For example, the Washington Post reports:

https://www.washingtonpost.com/nation/2022/08/22/dallas-texas-flash-floods/

> In some isolated areas, the rainfall totals would be considered a 1-in-1,000-year flood — a remarkable reversal given the dramatic drought that Dallas had faced for months. Several rainfall gauges recorded more than 10 inches. A record-breaking 3.01 inches of rain was also recorded in one hour at Dallas-Fort Worth International Airport.

> The downpour marked the latest such flood in the past few weeks across the United States. In one week alone, three 1-in-1,000-year rain events occurred, inundating St. Louis, eastern Kentucky and southeastern Illinois. The term, often considered controversial in part because it’s misunderstood, is used to describe a rainfall event that is expected once every 1,000 years, meaning it has just a 0.1 percent chance of happening in any given year — but such events can occur much more frequently.

> ...

> One rain gauge in Harris County, Tex., tallied more than 14.9 inches of rain within just a 12-hour period, more than 40 percent of the area’s yearly rainfall, according to Jeff Lindner, a meteorologist for the county. Such rates of precipitation are nearly impossible for soils — not to mention impervious paved surfaces — to absorb without runoff that can cause flash flooding.

## National Water Model and the AWI-CIROH JupyterHub

Considering that the AWI-CIROH now has a 2i2c managed JupyterHub running on Google Compute Platform (GCP) and a signifcant amount of [National Water Model](https://water.noaa.gov/about/nwm) data has already been made available on a bucket, I will explore this dataset by looking that some of the historical data for regions that have experienced intense rainfall and flooding recently.

Hourly data is available from 2018-09-17 to 2022-08-22 and growing every day.

A 2020 [blog post](https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935) on *Cloud-Performant NetCDF4/HDF5 with Zarr, Fsspec, and Intake* by Rich Signell (USGS), Martin Durant (Anaconda) and Aleksandar Jelenak (HDF Group) demonstrated how to read data from the NWM on Amazon Web Services. Let's see if we can make this work with this data on GCP.

In [104]:
from dask.distributed import Client, LocalCluster

cluster = LocalCluster()
cluster

Tab(children=(HTML(value='<div class="jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-outpu…

In [105]:
client = Client(cluster)
cluster

Tab(children=(HTML(value='<div class="jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-outpu…

In [31]:
fs = fsspec.filesystem('gcs', anon=True)

In [94]:
best_hour = 'f001'
var = 'land'

Make a list of all hours for August 22, 2022.

In [211]:
flist = []
for day in range(22, 23):
    for i in range(24):
        flist.append(f'gcs://national-water-model/nwm.202208{day:02d}/short_range/nwm.t{i:02d}z.short_range.{var}.{best_hour}.conus.nc')

In [212]:
fs2 = fsspec.filesystem('')

In [213]:
json_dir = 'jsons/'

if not os.path.exists(json_dir):
    os.makedirs(json_dir)

In [214]:
so = dict(mode='rb', anon=True, default_fill_cache=False, default_cache_type='first') # args to fs.open()
# default_fill_cache=False avoids caching data in between file chunks to lowers memory usage.

In [215]:
def gen_json(u):
    with fs.open(u, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
        p = u.split('/')
        date = p[3]
        fname = p[5]
        outf = f'{json_dir}{date}.{fname}.json'
        with open(outf, 'wb') as f:
            f.write(ujson.dumps(h5chunks.translate()).encode());

In [216]:
%%time
results = dask.compute(*[dask.delayed(gen_json)(u) for u in flist], retries=10)

CPU times: user 2.37 s, sys: 244 ms, total: 2.61 s
Wall time: 7.85 s


### Combine multiple kerchunk'd datasets into a single logical aggregate dataset

In [217]:
json_list = fs2.glob(f'{json_dir}/*.json')
json_list = sorted(json_list)

In [218]:
len(json_list)

24

In [220]:
mzz = MultiZarrToZarr(json_list,
        remote_protocol='gcs',
        remote_options={'anon':True},
        concat_dims=['time'],
        identical_dims = ['x', 'y'],
    )

In [221]:
%%time
mzz.translate('nwm.json')

CPU times: user 145 ms, sys: 14.5 ms, total: 159 ms
Wall time: 167 ms


In [222]:
backend_args = { "consolidated": False,
                 "storage_options": { "fo": 'nwm.json',
                                "remote_protocol": "gcs", 
                                "remote_options": {'anon':True} }}
ds = xr.open_dataset(
    "reference://", engine="zarr",
    backend_kwargs=backend_args
)

  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(


In [223]:
ds

In [227]:
ds.SOILSAT_TOP.hvplot('x', 'y', rasterize=True)

We can focus on a 50 km x 50 km region (approximately Dallas county)

In [225]:
dallas_soilsat = ds.SOILSAT_TOP.sel(x=slice(0e0, 50e3), y = slice(-800e3, -750e3))

In [226]:
dallas_soilsat.mean(dim=['x', 'y']).hvplot()