# Create Kerchunk Reference Files from ARISE Data on AWS

## Imports

In [3]:
import fsspec
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
from distributed import Client, LocalCluster
import dask
import ujson 
import nc_time_axis
import glob
import xarray as xr

## Spin up a Cluster
Let's spin up a Dask Cluster on our local machine! This will help compute our reference files in parallel.

In [4]:
cluster = LocalCluster()
client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 12,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:59148,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 12
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:59169,Total threads: 3
Dashboard: http://127.0.0.1:59173/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:59153,
Local directory: /Users/mgrover/git_repos/cloud-for-climate/notebooks/dask-worker-space/worker-ihnwtxn5,Local directory: /Users/mgrover/git_repos/cloud-for-climate/notebooks/dask-worker-space/worker-ihnwtxn5

0,1
Comm: tcp://127.0.0.1:59172,Total threads: 3
Dashboard: http://127.0.0.1:59176/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:59154,
Local directory: /Users/mgrover/git_repos/cloud-for-climate/notebooks/dask-worker-space/worker-k9cqvfin,Local directory: /Users/mgrover/git_repos/cloud-for-climate/notebooks/dask-worker-space/worker-k9cqvfin

0,1
Comm: tcp://127.0.0.1:59170,Total threads: 3
Dashboard: http://127.0.0.1:59174/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:59152,
Local directory: /Users/mgrover/git_repos/cloud-for-climate/notebooks/dask-worker-space/worker-l3mrmt5k,Local directory: /Users/mgrover/git_repos/cloud-for-climate/notebooks/dask-worker-space/worker-l3mrmt5k

0,1
Comm: tcp://127.0.0.1:59171,Total threads: 3
Dashboard: http://127.0.0.1:59175/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:59151,
Local directory: /Users/mgrover/git_repos/cloud-for-climate/notebooks/dask-worker-space/worker-rpinhfj7,Local directory: /Users/mgrover/git_repos/cloud-for-climate/notebooks/dask-worker-space/worker-rpinhfj7


## Create our Reference Files
This is a process you will only need to do once for each file on Amazon S3

In [19]:
fs = fsspec.filesystem('s3',
                       skip_instance_cache=True)

### Setup our AWS Credentials - **Do This Before Running this Section**

We need to set which bucket to use -  before running this notebook or running throught this analysis, make sure to setup your credentials (email Brian Dobbins from NCAR if you need the credentials) using the [AWS Command Line Interface (CLI)](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

Once you install the CLI, go to the command line and run:

```bash
aws configure
```

Which will prompt you for the credentials.

### List Files on the Bucket

In [20]:
bucket = 'sl-ncar-test-bucket'

In [None]:
files = fs.glob(f's3://sl-ncar-test-bucket/proc/tseries/month_1/*')

We need to add the `s3://` portion in front of each of these paths since they are on AWS

In [27]:
urls = ["s3://" + f for f in files]

so = dict(mode='rb', anon=True, default_fill_cache=False, default_cache_type='first')

### Setup a Function to Generate our Reference Files

In [28]:
def gen_json(u):
    with fs.open(u, **so) as infile:
        h5chunks = SingleHdf5ToZarr(infile, u, inline_threshold=300)
        outf = f"jsons/{u.split('/')[-1]}.json"
        print(outf)
        with open(outf, 'wb') as f:
            f.write(ujson.dumps(h5chunks.translate()).encode());

Now that we have our function to operate on each file, let's compute this in parallel using dask

In [None]:
%%time
dask.compute(*[dask.delayed(gen_json)(u) for u in urls], retries=10);

## Read back in the jsons (**Start here if you already have the reference files**)
Let's start by listing all the jsons we wish to combine into a single dataset (by default, let's combine all the variables)

Let' just start with the last 10 variables.

In [5]:
furls = sorted(glob.glob('jsons/*'))[-10:]

Now that we have all of our reference files, we can combine them into a single Zarr dataset

In [7]:
mzz = MultiZarrToZarr(
    furls,
    remote_protocol="s3",
    concat_dims=["time"]
)

# Combine the zarr file into something we can read into xarray
out = mzz.translate('merged-data.json')

### Test Loading the dataset into an xarray dataset
Now that we have our kerchunk zarr reference dataset, we can read this into xarray!

In [8]:
ds = xr.open_dataset(
    "reference://",
    engine="zarr",
    backend_kwargs={
        "storage_options": {
            "fo": "merged-data.json",
            "remote_protocol": "s3",
        },
        "consolidated": False
    },
    chunks={}
)