### Import `kerchunk` and make sure it's at the latest version

In [None]:
import xarray as xr
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
import fsspec
from glob import glob
import json
import earthaccess

__Important: Due to NASA cloud data limitations, this notebook MUST be run in AWS us-west-2__

## Authenticate to EarthData Login

In [None]:
auth = earthaccess.login(persist=True)

### Find granules according to search parameters, and get S3 URLS

In [None]:
granules = earthaccess.search_data(short_name='M2T1NXSLV', version='5.12.4', temporal=('2024-03-01', '2024-03-31'))

In [None]:
urls = [granule.data_links(access='direct', in_region=True)[0] for granule in granules]
urls

Setup an `s3fs` session with authentication

In [None]:
fs = earthaccess.get_s3fs_session(daac='GES_DISC')

### Opening directly with xarray is slow, even in the same region:

In [None]:
%%time
ds = xr.open_mfdataset(
    [fs.open(url) for url in urls], 
)
ds

## Example of creating a single reference

In [None]:
u = urls[0]
u

Open the file object with `fs.open()`, and process with Kerchunk. The `SingleHdf5ToZarr` class takes in the file object and its url as required arguments. The `inline_threshold` parameter sets the number of bytes a chunk must be to be stored directly in the metadata file (instead of a referenced byte-range). 

`.translate()` returns a `dict` of extracted metadata for use with `ReferenceFileSystem`. This can be aggregated with other metadata or written to disk.

In [None]:
%%time
with fs.open(u) as infile:
    reference = SingleHdf5ToZarr(infile, u).translate()

In [None]:
from IPython.display import JSON

In [None]:
JSON(reference)

## Create references for all files in `flist`

## With a Dask cluster
[Dask](https://dask.org/) is a python package that allows for easily parallelizing python code. This section starts a local client (using whatever processors are available on the current machine). This can also be done just as easily using [Dask clusters](https://docs.dask.org/en/stable/deploying.html). 

In [None]:
import dask
from dask.distributed import Client

client = Client()
client

## Definte function to return a reference dictionary for a given S3 file URL

This function does the following:
1. Use `fs.open()` to open the file given by `url`
2. Using `kerchunk.SingleHdf5ToZarr()` and supplying the file object `infile` and URL `f`, generate reference with `.translate()`

The returned object is a dictionary

In [None]:
def gen_ref(url):
    with fs.open(url) as inf:
        return SingleHdf5ToZarr(inf, url).translate()

### Map `gen_ref` to each member of `flist_bag` and compute
Dask bag is a way to map a function to a set of inputs. This next couple blocks of code tell Dask to take all the files in `flist`, break them up into the same amount of partitions and map each partition to the `gen_ref()` function -- essentially mapping each file path to `gen_ref()`. Calling `bag.compute()` on this runs `gen_ref()` in parallel with as many workers as are available in Dask client.

_Note: if running interactively on Binder, this will take a while since only one worker is available and the references will have to be generated in serial. See option for loading from jsons below_

In [None]:
dicts_delayed = [dask.delayed(gen_ref)(url) for url in urls]
dicts_delayed

In [None]:
%time dicts = dask.compute(*dicts_delayed)

Now, each url in `flist` has been used to generate a dictionary of reference data in `dicts`

### _Save/load references to/from JSON files (optional)_
The individual dictionaries can be saved as JSON files if desired

In [None]:
import os
os.makedirs('./jsons', exist_ok=True)

In [None]:
import ast
for d in dicts:
    # Generate name from corresponding URL:
    # Grab URL, strip everything but the filename, 
    # and replace .nc with .json
    json_name = ast.literal_eval(d['refs']['.zattrs'])['Filename'].replace('.nc4', '.json')
        
    with open(f'./jsons/{json_name}', 'w') as outf:
        outf.write(json.dumps(d))

***
### Use `MultiZarrToZarr` to combine the 31 individual references into a single reference
In this example we passed a list of reference dictionaries, but you can also give it a list of `.json` filepaths (commented out)

Need to get S3 access credentials - earthaccess makes this easy.

In [None]:
creds = earthaccess.get_s3_credentials(daac='GES_DISC')

In [None]:
r_opts = {'anon':False,          
          'key':creds['accessKeyId'], 
          'secret':creds['secretAccessKey'], 
          'token':creds['sessionToken']}    #ncfiles

mzz = MultiZarrToZarr(
    './jsons/*.json',
    concat_dims='time',
    remote_options=r_opts,
    coo_map={'time' :'cf:time'},
)

References can be saved to a file (`combined.json`) or passed back as a dictionary (`mzz_dict`)

In [None]:
%time mzz.translate('./combined.json')
# mzz_dict = mzz.translate()