Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-res zarr products - build tracking thread #38

Open
cisaacstern opened this issue Aug 18, 2023 · 1 comment
Open

High-res zarr products - build tracking thread #38

cisaacstern opened this issue Aug 18, 2023 · 1 comment
Assignees

Comments

@cisaacstern
Copy link
Contributor

cisaacstern commented Aug 18, 2023

I am currently building zarr products for the high-res data. Opening this thread so we have a public place to track progress on these efforts. By way of background:

  • When complete, the output product will be two zarr data objects on the LEAP Google Cloud Storage, one for mli and one for mlo , totaling ~48 TB together. These datasets will be publicly available to everyone on the internet with no egress costs. If accessed from a cloud compute node (e,g., the LEAP JupyterHub), this will allow users of the data to access the full high res data product directly, without downloading anything.
  • Here is the data ingestion + transformation code I am using to create these zarr stores. This code leverages the pangeo-forge-recipes Python package, which uses Apache Beam as its distributed parallel computation framework, here's something I wrote recently on Beam, for those interested.
  • Once these zarr stores are complete (currently, I've been debugging the long-running compute jobs), I'll devote some effort to contributing some data loading code + examples to the github repo that demonstrates how to access them.

My second full-scale attempt at running these jobs has now been running for a little over 2 days:

image

The first time I tried this they crashed after 3 days, and I think I fixed the bug that caused that crash. So if this attempt just works, they'll be done by early next week I'd guess. If these jobs crash, I'll restart them early next week and then maybe the next shot we'd have is for end of next week (budgeting a couple days per attempt).

@cisaacstern cisaacstern self-assigned this Aug 18, 2023
@cisaacstern
Copy link
Contributor Author

cisaacstern commented Aug 21, 2023

Monday update: of the two jobs left running over the weekend, the mlo job apparently succeeded, whereas the mli job failed:

Screen Shot 2023-08-21 at 4 01 17 PM

Still working on debugging the cause of the mli failure. As for mlo, the output dataset can be opened as shown below. A few caveats:

  • 🙂 Please do not take this to be an official release of the Zarr dataset. This is an early preview, more validation work is required before we consider this canonical.
  • ⏳ Loading the dataset with xarray takes ~4 min (on my local laptop, maybe faster on a data-adjacent compute node, e.g. the LEAP hub). This is admittedly very fast compared to the alternative of downloading all ~13 TB, but not as fast as I'd like. I have some ideas as to why this is and will open/link related issues momentarily.

And a few notes on things that seem to have worked (please correct me if anything here seems inaccurate):

  • 📆 Time is parsed into an indexable coordinate (as opposed to a data variable, as it exists in the original NetCDF files). And 210240 timesteps are present, which is the expected number of time steps, as represented here.
  • 💾 Dataset totals ~13.4 TB (uncompressed), which is a plausible size for the aggregate mlo data: 210240 time steps x 61 MB per file = ~12.8 TB on disk, which gives a compression ratio of just under 1.05. This matches basically exactly with a compression ratio calculated for a single file of the mlo source data.
  • 📝 Any attributes listed in this google sheet have been added to the variables.
  • 🔢 As shown in the Details section below, chunksize is (2, 60, 21600) for time, ncol, and lev dimensions respectively. This means ~120 MB per chunk for mlo.

The mlo (prelim/preview only, no guarantees yet! 😄 ) dataset can be loaded as follows:

import xarray as xr
path = "gs://leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5882522942-1/climsim-highres-mlo.zarr"
ds = xr.open_dataset(path, engine="zarr", chunks={})  # requires `gcsfs`, takes ~4 mins on my laptop
ds.nbytes / 1e12  # -> 13.36924905984 TB
len(ds.time)  # -> 210240
ds.state_t.attrs  # -> {'long_name': 'Air temperature', 'units': 'K'}
ds
<xarray.Dataset>
Dimensions:         (time: 210240, ncol: 21600, lev: 60)
Coordinates:
  * time            (time) object 0001-02-01 00:00:00 ... 0009-01-31 23:40:00
Dimensions without coordinates: ncol, lev
Data variables: (12/16)
    cam_out_FLWDS   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_NETSW   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_PRECC   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_PRECSC  (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_SOLL    (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    cam_out_SOLLD   (time, ncol) float64 dask.array<chunksize=(2, 21600), meta=np.ndarray>
    ...              ...
    state_q0003     (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_t         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_u         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    state_v         (time, lev, ncol) float64 dask.array<chunksize=(2, 60, 21600), meta=np.ndarray>
    tod             (time) int32 dask.array<chunksize=(2,), meta=np.ndarray>
    ymd             (time) int32 dask.array<chunksize=(2,), meta=np.ndarray>
Attributes:
    calendar:  NO_LEAP
    fv_nphys:  2
    ne:        30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant