High-res zarr products - build tracking thread #38

cisaacstern · 2023-08-18T23:27:32Z

I am currently building zarr products for the high-res data. Opening this thread so we have a public place to track progress on these efforts. By way of background:

When complete, the output product will be two zarr data objects on the LEAP Google Cloud Storage, one for mli and one for mlo , totaling ~48 TB together. These datasets will be publicly available to everyone on the internet with no egress costs. If accessed from a cloud compute node (e,g., the LEAP JupyterHub), this will allow users of the data to access the full high res data product directly, without downloading anything.
Here is the data ingestion + transformation code I am using to create these zarr stores. This code leverages the pangeo-forge-recipes Python package, which uses Apache Beam as its distributed parallel computation framework, here's something I wrote recently on Beam, for those interested.
Once these zarr stores are complete (currently, I've been debugging the long-running compute jobs), I'll devote some effort to contributing some data loading code + examples to the github repo that demonstrates how to access them.

My second full-scale attempt at running these jobs has now been running for a little over 2 days:

The first time I tried this they crashed after 3 days, and I think I fixed the bug that caused that crash. So if this attempt just works, they'll be done by early next week I'd guess. If these jobs crash, I'll restart them early next week and then maybe the next shot we'd have is for end of next week (budgeting a couple days per attempt).

The text was updated successfully, but these errors were encountered:

cisaacstern · 2023-08-21T23:49:10Z

Monday update: of the two jobs left running over the weekend, the mlo job apparently succeeded, whereas the mli job failed:

Still working on debugging the cause of the mli failure. As for mlo, the output dataset can be opened as shown below. A few caveats:

🙂 Please do not take this to be an official release of the Zarr dataset. This is an early preview, more validation work is required before we consider this canonical.
⏳ Loading the dataset with xarray takes ~4 min (on my local laptop, maybe faster on a data-adjacent compute node, e.g. the LEAP hub). This is admittedly very fast compared to the alternative of downloading all ~13 TB, but not as fast as I'd like. I have some ideas as to why this is and will open/link related issues momentarily.

And a few notes on things that seem to have worked (please correct me if anything here seems inaccurate):

📆 Time is parsed into an indexable coordinate (as opposed to a data variable, as it exists in the original NetCDF files). And 210240 timesteps are present, which is the expected number of time steps, as represented here.
💾 Dataset totals ~13.4 TB (uncompressed), which is a plausible size for the aggregate mlo data: 210240 time steps x 61 MB per file = ~12.8 TB on disk, which gives a compression ratio of just under 1.05. This matches basically exactly with a compression ratio calculated for a single file of the mlo source data.
📝 Any attributes listed in this google sheet have been added to the variables.
🔢 As shown in the Details section below, chunksize is (2, 60, 21600) for time, ncol, and lev dimensions respectively. This means ~120 MB per chunk for mlo.

The mlo (prelim/preview only, no guarantees yet! 😄 ) dataset can be loaded as follows:

import xarray as xr
path = "gs://leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5882522942-1/climsim-highres-mlo.zarr"
ds = xr.open_dataset(path, engine="zarr", chunks={})  # requires `gcsfs`, takes ~4 mins on my laptop
ds.nbytes / 1e12  # -> 13.36924905984 TB
len(ds.time)  # -> 210240
ds.state_t.attrs  # -> {'long_name': 'Air temperature', 'units': 'K'}
ds

<xarray.Dataset>
Dimensions:         (time: 210240, ncol: 21600, lev: 60)
Coordinates:
  * time            (time) object 0001-02-01 00:00:00 ... 0009-01-31 23:40:00
Dimensions without coordinates: ncol, lev
Data variables: (12/16)
    cam_out_FLWDS   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_NETSW   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_PRECC   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_PRECSC  (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_SOLL    (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_SOLLD   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    ...              ...
    state_q0003     (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_t         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_u         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_v         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    tod             (time) int32 dask.array<chunksize=(2,), meta=np.ndarray>
    ymd             (time) int32 dask.array<chunksize=(2,), meta=np.ndarray>
Attributes:
    calendar:  NO_LEAP
    fv_nphys:  2
    ne:        30

cisaacstern self-assigned this Aug 18, 2023

cisaacstern mentioned this issue Aug 28, 2023

Consolidating existing stores? leap-stc/cmip6-leap-feedstock#22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High-res zarr products - build tracking thread #38

High-res zarr products - build tracking thread #38

cisaacstern commented Aug 18, 2023 •

edited

cisaacstern commented Aug 21, 2023 •

edited

High-res zarr products - build tracking thread #38

High-res zarr products - build tracking thread #38

Comments

cisaacstern commented Aug 18, 2023 • edited

cisaacstern commented Aug 21, 2023 • edited

cisaacstern commented Aug 18, 2023 •

edited

cisaacstern commented Aug 21, 2023 •

edited