# NCAR Earth System Data Science WIP Talk
### Lucas Sterzinger -- Atmospheric Science PhD Candidate at UC Davis
* [Twitter](https://twitter.com/lucassterzinger)
* [GitHub](https://github.com/lsterzinger)
* [Website](https://lucassterzinger.com)

#  Motivation:
* NetCDF is not cloud optimized
* Other formats, like Zarr, aim to 

# What do I mean when I say "Cloud Optimized"?
![Move to cloud diagram](images/cloud-move.png)

In traditional scientific workflows, data is archived in a repository and downloaded to a separate computer for analysis (left). However, datasets are becoming much too large to fit on personal computers, and transferring full datasets from an archive to a seperate machine can use lots of bandwidth.

In a cloud environment, the data can live in object storage (e.g. AWS S3), and analysis can be done in an adjacent compute instances, allowing for low-latency and high-bandwith access to the dataset.

## Why NetCDF doesn't work well in this workflow

NetCDF is probably the most common binary data format for atmospheric/earth sciences, and has a lot of official and community support. However, the NetCDF format requires either a) loading the entire dataset in order to access the header/metadata and retreive a chunk of data.

![NetCDF File Object](images/single_file_object.png)

## The Zarr Solution
The [Zarr data format](https://zarr.readthedocs.io/en/stable/) alleviates this problem by storing the metadata and chunks in seperate files that can be accessed as-needed and in parallel.

![Zarr](images/zarr.png)

### Import `fsspec-reference-maker` and make sure it's at the latest version (`0.0.3` at the time of writing)

In [None]:
import fsspec_reference_maker
fsspec_reference_maker.__version__

In [None]:
import xarray as xr
import matplotlib.pyplot as plt
from fsspec_reference_maker.hdf import SingleHdf5ToZarr
from fsspec_reference_maker.combine import MultiZarrToZarr
import fsspec

### Setup an S3 filesystem for listing GOES files on S3

In [None]:
fs = fsspec.filesystem('s3', anon=True)

In [None]:
flist = fs.glob("s3://noaa-goes16/ABI-L2-SSTF/2020/210/*/*.nc")

### Prepend `s3://` to the URLS

In [None]:
flist = ['s3://' + f for f in flist]

### Start a dask cluster

In [None]:
from dask.distributed import Client
client = Client()
client

In [None]:
import dask.bag as db
flist_bag = db.from_sequence(flist)
flist_bag

### Definte function to return a reference dictionary for a given S3 file URL

In [None]:
def gen_ref(f):
    so = dict(
        mode="rb", anon=True, default_fill_cache=False, default_cache_type="none"
    )

    with fsspec.open(f, **so) as infile:
        return SingleHdf5ToZarr(infile, f, inline_threshold=300).translate()

### Map `gen_ref` to each member of `flist_bag` and compute

In [None]:
dicts = flist_bag.map(gen_ref).compute()

### Use `MultiZarrToZarr` to combine the 24 individual references into a single reference

In [None]:
mzz = MultiZarrToZarr(
    dicts,
    remote_protocol='s3',
    remote_options={'anon':True},
    xarray_open_kwargs={
        "decode_cf" : False,
        "mask_and_scale" : False,
        "decode_times" : False,
        "decode_timedelta" : False,
        "use_cftime" : False,
        "decode_coords" : False
    },
    xarray_concat_args={'dim' : 't'}
)

References can be saved to a file (`combined.json`) or passed back as a dictionary (`mzz_dict`)

In [None]:
mzz.translate('combined.json')
# mzz_dict = mzz.translate()

***

# Read the referenced files with `fsspec` and `xarray`

In [None]:
fs2 = fsspec.filesystem('reference', fo="./combined.json", remote_protocol='s3', remote_options=dict(anon=True), skip_instance_cache=True)
ds = xr.open_dataset(fs2.get_mapper(""), engine='zarr')
ds

In [None]:
import metpy

In [None]:
ds1 = ds.metpy.parse_cf()

In [None]:
import hvplot.xarray

In [None]:
ds_latlon = ds1.metpy.assign_latitude_longitude()

In [None]:
ds_latlon

In [None]:
ds.metpy.assign_latitude_longitude

In [None]:
mask_lat

In [None]:
mask_lat = (ds_latlon.latitude > 0) & (ds_latlon.latitude < 50)
mask_lon = (ds_latlon.longitude >-90) & (ds_latlon.latitude < -60)

ds3 = ds.where((ds_latlon.latitude > 0) & (ds_latlon.latitude < 50) & (ds_latlon.longitude >-90) & (ds_latlon.latitude < -60))

In [None]:
ds3 = ds3.metpy.parse_cf()

In [None]:
ds3.SST.isel(t=0)

In [None]:
ds3.SST.isel(t=0).hvplot.quadmesh()