In [4]:
import json
import fsspec
import kerchunk.hdf
import kerchunk.combine
import xarray as xr

We are looking at *daily* NClimGrid data (temps and precipitation). The days are wrapped up into monthly NetCDF files. This makes the files different sizes, which may be related to the problems I'm having.

In [7]:
def make_full_path(variable, year, month):
    return f'https://nclimgridwesteurope.blob.core.windows.net/nclimgrid/nclimgrid-daily/beta/by-month/{year}/{month:02d}/{variable}-{year}{month:02d}-grd-scaled.nc'

def del_original_encoding(ds):
    for var in ds:
        del ds[var].encoding['chunks']
    return ds

Successful when starting with month of July. You can also successfully span multiple years with this start month.
Note: Years prior to 1970 use a different file naming structure, so do not use them here.

In [8]:
urls = [make_full_path('prcp', year, month) for year in range(1970, 1971) for month in range(7, 13)]

# Kerchunk the reference data
translated = []
for url in urls:
    with fsspec.open(url) as fobj:
        h5chunks = kerchunk.hdf.SingleHdf5ToZarr(fobj, url)
        translated.append(h5chunks.translate())

# Concatenate along time
mzz = kerchunk.combine.MultiZarrToZarr(
    translated,
    remote_protocol='https',
    xarray_concat_args={
        "dim": "time"
    }
)

# Convert and export to json
output_file = f'daily-prcp.json'
mzz.translate(output_file)

# Take a look
with open(output_file) as json_file:
    d = json.load(json_file)
rfs = fsspec.filesystem("reference", fo=d)
m = rfs.get_mapper("")
ds = xr.open_dataset(m, engine='zarr', backend_kwargs={'consolidated': False})
ds

Now change the start month to August. No dice. Also does not work with a start month of January, which is the time series start month.

In [9]:
urls = [make_full_path('prcp', year, month) for year in range(1970, 1971) for month in range(8, 13)]

# Kerchunk the reference data
translated = []
for url in urls:
    with fsspec.open(url) as fobj:
        h5chunks = kerchunk.hdf.SingleHdf5ToZarr(fobj, url)
        translated.append(h5chunks.translate())

# Concatenate along time
mzz = kerchunk.combine.MultiZarrToZarr(
    translated,
    remote_protocol='https',
    xarray_concat_args={
        "dim": "time"
    }
)

# Convert and export to json
output_file = f'daily-prcp.json'
mzz.translate(output_file)

NotImplementedError: Specified zarr chunks encoding['chunks']=(1, 144, 337) for variable named 'prcp' would overlap multiple dask chunks ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (144, 3, 141, 6, 138, 9, 135, 12, 8), (337, 6, 331, 12, 325, 18, 319, 24, 13)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

As suggested in the Error, you can delete the original (netcdf) chunk encoding that comes along for the ride. See https://github.com/pydata/xarray/issues/5219. 

In [10]:
urls = [make_full_path('prcp', year, month) for year in range(1970, 1971) for month in range(8, 13)]

# Kerchunk the reference data
translated = []
for url in urls:
    with fsspec.open(url) as fobj:
        h5chunks = kerchunk.hdf.SingleHdf5ToZarr(fobj, url)
        translated.append(h5chunks.translate())

# Concatenate along time
mzz = kerchunk.combine.MultiZarrToZarr(
    translated,
    remote_protocol='https',
    preprocess=del_original_encoding,
    xarray_concat_args={
        "dim": "time"
    }
)

# Convert and export to json
output_file = f'daily-prcp.json'
mzz.translate(output_file)

ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'prcp' has incompatible dask chunks: ((1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (144, 3, 141, 6, 138, 9, 135, 12, 8), (337, 6, 331, 12, 325, 18, 319, 24, 13)). Consider rechunking using `chunk()`.

However, doing so results in a different error. Evidently Zarr needs all chunks to be the same size except the last one, but Dask does not. I don't understand how Dask chunks are part of this.
- If you look at chunk sizes for a successful dataset, they are regular. So the issue lies somewhere in chunk sizing. 
- I'm going to attempt to better understand chunk sizes with respect to NetCDF, Zarr, and Dask. 

I also edited the `to_zarr` call on line 180 of Kerchunk's `combine.py` to include `safe_chunks=False`, but resulted in a different error.

It's not clear to me why the different chunk sizes matter for kerchunk. It's just a json with offsets and lengths into the binary data, no? We are not creating an actual Zarr array.