# Generate Kerchunk Files from ESGF Data Hosted Through Globus
As a proof of concept, let's create a kerchunk reference file for CMIP6 data stored at Argonne National Laboratory, one data node in the Earth System Grid Federation

## Imports

In [1]:
import xarray as xr
import fsspec
from fsspec.implementations.http import HTTPFileSystem
from kerchunk.hdf import SingleHdf5ToZarr
from pathlib import Path
import os
import ujson

## Read the dataset using xarray - without kerchunk

In [2]:
url  = "https://g-52ba3.fd635.8443.data.globus.org//css03_data/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/abrupt-4xCO2/r9i1p1f1/Omon/epsi100/gn/v20180914/epsi100_Omon_IPSL-CM6A-LR_abrupt-4xCO2_r9i1p1f1_gn_185009-185508.nc"

In [3]:
%%time
ds = xr.open_dataset(f"{url}#mode=bytes")
ds

CPU times: user 180 ms, sys: 61.3 ms, total: 241 ms
Wall time: 56 s


## Create a Kerchunk Reference File
Let's create a kerchunk reference file, reading data using https and save the kerchunk file locally.

In [4]:
fs = fsspec.filesystem('https') #Filesystem for the ESGF files

fs2 = fsspec.filesystem('')  #local file system to save final jsons to

so = dict() # args to fs.open()
# default_fill_cache=False avoids caching data in between file chunks to lowers memory usage.

def gen_json(file_url):
    with fs.open(f"{file_url}", **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, file_url, inline_threshold=300)
        # inline threshold adjusts the Size below which binary blocks are included directly in the output
        # a higher inline threshold can result in a larger json file but faster loading time
        variable = file_url.split('/')[-1].split('.')[0]
        month = file_url.split('/')[2]
        outf = f'{month}_{variable}.json' #file name to save json to
        with fs2.open(outf, 'wb') as f:
            f.write(ujson.dumps(h5chunks.translate()).encode());

### Actually generate the kerchunk file using the url

In [5]:
gen_json(url)
!ls *.json

g-52ba3.fd635.8443.data.globus.org_epsi100_Omon_IPSL-CM6A-LR_abrupt-4xCO2_r9i1p1f1_gn_185009-185508.json


## Read the Kerchunk File Back in and Compare with IO time from http
Notice how much faster we can now read in the file!

In [6]:
%%time
ds = xr.open_dataset("reference://",
                     engine="zarr",
                     backend_kwargs={
                         "consolidated": False,
                         "storage_options": {
                             "fo": 'g-52ba3.fd635.8443.data.globus.org_epsi100_Omon_IPSL-CM6A-LR_abrupt-4xCO2_r9i1p1f1_gn_185009-185508.json',
                             "remote_protocol": "https",}
                     }
                    )
ds

CPU times: user 56.1 ms, sys: 12 ms, total: 68.1 ms
Wall time: 2.77 s
