# `fsspec-reference-maker` tutorial

Created July 2021 by [Lucas Sterzinger](mailto:lsterzinger@ucdavis.edu) ([Twitter](https://twitter.com/lucassterzinger)) as part of the NCAR [Summer Internship in Parallel Computational Science (SIParCS)](https://www2.cisl.ucar.edu/siparcs)

If any part of this tutorial is now out of date, please feel free to open a pull request with a fix

### Import ReferenceMaker

In [None]:
from fsspec_reference_maker.hdf import SingleHdf5ToZarr 
from fsspec_reference_maker.combine import MultiZarrToZarr

In [None]:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import s3fs
import datetime as dt
import zipfile
import logging
import fsspec
import ujson
from tqdm import tqdm
from glob import glob
import os

### Dask makes some of the processing faster, but it is not required to run this tutorial.
This code block starts 8 local dask workers

In [None]:
import dask
from dask.distributed import Client

client = Client(n_workers=8)
client

## Create metadata JSONs

### This function returns a list of S3 files for a given satellite, year, and day

In [None]:
def get_file_list(sat,lyr,idyjl):
    # arguments
    # sat   goes-east,goes-west,himawari
    # lyr   year
    # idyjl day of year
    
    d = dt.datetime(lyr,1,1) + dt.timedelta(days=idyjl)
    fs = s3fs.S3FileSystem(anon=True) #connect to s3 bucket!

    #create strings for the year and julian day
    imon,idym=d.month,d.day
    syr,sjdy,smon,sdym = str(lyr).zfill(4),str(idyjl).zfill(3),str(imon).zfill(2),str(idym).zfill(2)
    
    #use glob to list all the files in the directory
    if sat=='goes-east':
        file_location,var = fs.glob('s3://noaa-goes16/ABI-L2-SSTF/'+syr+'/'+sjdy+'/*/*.nc'),'SST'
    if sat=='goes-west':
        file_location,var = fs.glob('s3://noaa-goes17/ABI-L2-SSTF/'+syr+'/'+sjdy+'/*/*.nc'),'SST'
    
    return file_location

### Get data for the 210th day of 2020 (July 28, 2020)

In [None]:
flist = get_file_list("goes-east", 2020, 210)
urls = ["s3://" + f for f in flist]

## Create Reference JSONS
### This function creates JSON metadata files for each of the S3 files in the local `jsons/` directory

These files point to the S3 location of the netCDF files, and only need to be created once. Tihs process took me about 10 minutes to generate the JSONs for 24 files. This function could easily be made to run in parallel for faster performance

In [None]:
def gen_json(u):
    so = dict(
        mode="rb", anon=True, default_fill_cache=False, default_cache_type="none"
    )
    with fsspec.open(u, **so) as inf:
        h5chunks = SingleHdf5ToZarr(inf, u, inline_threshold=300)
        with open(f"jsons/{u.split('/')[-1]}.json", 'wb') as outf:
            outf.write(ujson.dumps(h5chunks.translate()).encode())


In [None]:
# Create json/ folder if it doesn't already exist
import pathlib
pathlib.Path('./jsons/').mkdir(exist_ok=True)

Run the `gen_json()` function defined above with dask. This function can be run in parallel with each worker creating a single file. If you do not want to use dask, replace the uncommented line with the commented block below it.

In [None]:
dask.compute(*[dask.delayed(gen_json)(u) for u in urls]);

# If not using dask, use
# for u in tqdm(urls):
#     gen_json(u)

***
## Read remote netCDF files with xarray and fsspec

### First, create a list of JSON files

In [None]:
json_list = sorted(glob("./jsons/*.json"))

### Then, loop over the files and use `fsspec.get_mapper()` to create mappers for each file object, creating a list of mappers

In [None]:
m_list = []
for j in tqdm(json_list):
    with open(j) as f:
        m_list.append(fsspec.get_mapper("reference://", 
                        fo=ujson.load(f),
                        remote_protocol='s3',
                        remote_options={'anon':True}))

### Now, the mapper list can be passed directly to xarray.open_mfdataset() as long as the engine is specified as "zarr"


In [None]:
%%time
ds = xr.open_mfdataset(m_list, engine='zarr', combine='nested', concat_dim='t', 
                        coords='minimal', data_vars='minimal', compat='override',
                        parallel=True)
ds

# One JSON, multiple data files

The 1-file, 1-JSON method above doesn't scale well for larger datasets.
Instead, we can combine multiple JSONS into a single JSON describing the whole dataset.

In [None]:
mzz = MultiZarrToZarr(
    json_list,
    remote_protocol="s3",
    remote_options={'anon':True},
    xarray_open_kwargs={
        "decode_cf" : False,
        "mask_and_scale" : False,
        "decode_times" : False,
        "decode_timedelta" : False,
        "use_cftime" : False,
        "decode_coords" : False
    },
    xarray_concat_args={
        'data_vars' : 'minimal',
        'coords' : 'minimal',
        'compat' : 'override',
        'join' : 'override', 
        'combine_attrs' : 'override',
        'dim' : 't'
    }
)

This part actually writes the metadata to a file

In [None]:
%%time
mzz.translate("combined.json")

## Now, use the `combined.json` file to open the multifile dataset

In [None]:
%%time
fs = fsspec.filesystem("reference", fo="./combined.json", remote_protocol="s3", 
                        remote_options={"anon":True}, skip_instance_cache=True)
m = fs.get_mapper("")
ds = xr.open_dataset(m, engine='zarr')

ds

### Take a subset of the data (in this case, the Gulf Stream)

### Select a single time with `.isel(t=14)`

In [None]:
%%time
subset = ds.sel(x=slice(-0.01,0.07215601),y=slice(0.12,0.09))  #reduce to GS region

masked = subset.SST.where(subset.DQF==0)

masked.isel(t=14).plot(vmin=14+273.15,vmax=30+273.15,cmap='inferno')

### Plot a mean along the time axis (1-day average)

In [None]:
%%time
subset = ds.sel(x=slice(-0.01,0.07215601),y=slice(0.12,0.09))  #reduce to GS region

masked = subset.SST.where(subset.DQF==0)

masked.mean("t", skipna=True).plot(vmin=14+273.15,vmax=30+273.15,cmap='inferno')

### For details on how to plot GOES data on a lat/lon grid, see [this blog post I wrote](https://lsterzinger.medium.com/add-lat-lon-coordinates-to-goes-16-goes-17-l2-data-and-plot-with-cartopy-27f07879157f)