Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST]: HighResMIP HadGEM vars for ocean sound speed calculation #72

Closed
rsignell opened this issue Nov 26, 2023 · 19 comments
Closed
Labels
cant find urls in progress request Requests for new data to be ingested to the cloud

Comments

@rsignell
Copy link

rsignell commented Nov 26, 2023

List of requested idds

'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.so.gn.v20200514',
 'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.so.gn.v20200514',
 'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
 'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514'
​

Description

We are working on Climate Change impacts on sound speed, and this dataset is believed to be the best for this purpose. This is about 3TB of data files.

@rsignell rsignell added the request Requests for new data to be ingested to the cloud label Nov 26, 2023
@rsignell
Copy link
Author

ping @gzt5142

@jbusecke
Copy link
Collaborator

Ok I merged #73, lets see how that goes...thumbs pressed

@rsignell
Copy link
Author

rsignell commented Nov 29, 2023

When we processed these with Kerchunk we did them as two datasets: hires-future and hist-1950. Each of these had two variables, so and thetao. So for each dataset we used kerchunk to combine each variable along the time dimension, and then merge_vars to create a single JSON for each of the two datasets.

Should we have split the issue into two different issues, with:

'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.so.gn.v20200514',
 'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514'

in one and

 'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.so.gn.v20200514',
 'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514',

in the other?

@jbusecke
Copy link
Collaborator

@rsignell no that should not be an issue. We process every dataset separately anyways. There is something else going on. Will look into this after the talk 😆

@cisaacstern
Copy link
Contributor

The regular hypercube error appears to be the same issue discussed in pangeo-forge/pangeo-forge-recipes#520. If so, this would seem to be a corner case bug relating to certain chunking scenarios. As documented on the linked issue, the next step there is for @tom-the-hill to open a PR with a minimal reproducer failing test.

@rsignell
Copy link
Author

rsignell commented Jan 5, 2024

@cisaacstern or @jbusecke Any updates here?

@jbusecke
Copy link
Collaborator

Unfortunately we are still blocked by the above bug in PGF-recipes. The good news is that once that is fixed we might be able to get the data straight into the public bucket! Is there a concrete deadline?

@rsignell
Copy link
Author

Nope. No deadline. I was just curious about the status. Thanks!

@jbusecke
Copy link
Collaborator

Still working on this @rsignell. It seems like there was a bug in my legacy notebook and the current versions of your requested datasets only have 2 timesteps (#76 describes the culprit), but I just resubmitted them manually (#98) and they are currently running...

@jbusecke
Copy link
Collaborator

Ok I think we have removed the main roadblock here temporarily (gnarly, gnarly stuff really).
So far it seems like we were not getting any urls for your iids from the API, but ill retry frequently. You can check the progress like this:

def zstore_to_iid(zstore: str):
    # this is a bit whacky to account for the different way of storing old/new stores
    return '.'.join(zstore.replace('gs://','').replace('.zarr','').replace('.','/').split('/')[-11:-1])

iids_requested = [
'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.so.gn.v20200514',
 'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.so.gn.v20200514',
 'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
 'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514',
]

import intake
# uncomment/comment lines to swap catalogs
url = "https://storage.googleapis.com/cmip6/cmip6-pgf-ingestion-test/catalog/catalog.json"
col = intake.open_esm_datastore(url)

iids_all= [zstore_to_iid(z) for z in col.df['zstore'].tolist()]
iids_uploaded = [iid for iid in iids_all if iid in iids_requested]
iids_uploaded

Since we got 'some' of the urls during #73, I am currently assuming this will just resolve itself with time, but there might be some bugs either in pangeo-forge-esgf or the ESGF API itself that do not return all the supposedly available urls (I feel that #119 might be similar).

Lets keep an eye on this for now and ill try to investigate more deeply later.

@jbusecke
Copy link
Collaborator

jbusecke commented May 8, 2024

I just checked, and they are all ingested! Please reopen if there are issues on your end.

@jbusecke jbusecke closed this as completed May 8, 2024
@jbusecke jbusecke reopened this May 9, 2024
@jbusecke
Copy link
Collaborator

jbusecke commented May 9, 2024

Meeep. That is not looking great...

image

I hope I can fix these soon (#76 is relevant).

@rsignell
Copy link
Author

rsignell commented May 9, 2024

Darn. Very much still interested in this @jbusecke. Thanks for continuing to push!

@jbusecke
Copy link
Collaborator

jbusecke commented May 9, 2024

Ill get there...

@jbusecke
Copy link
Collaborator

jbusecke commented May 12, 2024

gs://cmip6/cmip6-pgf-ingestion-test/zarr_stores/9044158586_1/CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.so.gn.v20200514.zarr
image

gs://cmip6/cmip6-pgf-ingestion-test/zarr_stores/9044158586_1/CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514.zarr

image

@rsignell
Copy link
Author

Check this out @andreall!!

@jbusecke
Copy link
Collaborator

Just as a headsup, these jobs are absolutely massive, and for now I have to babysit them manually. For whatever reason the other experiment_id did seem unavailable at the time of running these (from the ESGF side), so please feel free to ping me in a few days.

Eventually I think we will be able to handle such large datasets better with more efficient downloading upstream.

@rsignell
Copy link
Author

Yes, the HiRESMIP data is massive. This is only 2 variables!
I'm happy we have contributed a nice stress test. 😸

@rsignell rsignell closed this as completed Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cant find urls in progress request Requests for new data to be ingested to the cloud
Projects
None yet
Development

No branches or pull requests

3 participants