Proposed addition to the GCS bucket, gs://cmip6/ #12

naomi-henderson · 2021-01-22T16:36:53Z

I propose to add any new versions of datasets to our exisiting Google Cloud zarr bucket by adding a new prefix to our CMIP6 collection - gs://cmip6/CMIP6 which solves multiple problems.

This follows the naming of the other CMIP collections, gs://cmip6/CMIP5 and gs://cmip6/CMIP3

This allows us to start using the version in the object names, for example:

 gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/ps/gn, version 20200310

would now be stored in:

   gs://cmip6/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/ps/gn/v20200310

The csv catalog files will now be generated the same as before, but with the URL pointing to the latest version only. The novice user will not be required to find the latest version on their own
The old versions will not need to be removed - so there will be no glitches and misunderstandings as the new versions are added and the existing datasets will stay where they are

The text was updated successfully, but these errors were encountered:

rabernat · 2021-01-22T16:53:17Z

I propose to add any new versions of datasets to our exisiting Google Cloud zarr bucket by adding a new prefix to our CMIP6 collection - gs://cmip6/CMIP6

👍 from me!

rabernat · 2021-01-22T16:53:45Z

I would also not be opposed to moving all the older data to the new prefix as well.

naomi-henderson · 2021-01-22T16:58:07Z

Yes, that would be very easy, since we can copy first, change the csv catalog and THEN remove the old copy

naomi-henderson · 2021-01-22T17:12:03Z

@charlesbluca , this would require a revision to your rclone github actions from GCS to AWS. As you have currently set it up, this new prefix would get added to the 'other' category. That would be fine for now. We can rethink the division into the 20 jobs when it starts bogging down - okay?

charlesbluca · 2021-01-22T18:54:06Z

Sounds good to me! For now I'll continue monitoring the workflow and see how the job handling gs://cmip/CMIP6 performs. If we need to change it around, it shouldn't be too difficult.

Let me know if you decide on copying the older data to the new prefix, as this would probably allow us to continue using the current workflow (as long as we change the affected directory names).

naomi-henderson · 2021-01-22T18:58:13Z

Thanks @charlesbluca, I will certainly let you know if I start moving lots of data!

naomi-henderson · 2021-01-31T13:40:34Z

I am now in the process of making these changes. Since all of our zarr datasets must be copied and then deleted, it will a few months to complete. In the meantime, the URLs in the new csv catalog files are a mixture of the old and new naming conventions - which should not cause trouble to anyone who uses the catalog to find the data.

However, if you find your datasets by assuming a URL naming structure like:

gs://cmip6/[activity_id]/[institution_id]/.../[variable_id]/[grid_label]

then these URLs are all being replaced by:

gs://cmip6/CMIP6/[activity_id]/[institution_id]/.../[variable_id]/[grid_label]/[version_id]

I have been doing some preliminary testing, including an intake-esm test. Hopefully @jbusecke will get a chance to test cmip6_preprocessing to make sure it will not cause too much trouble there.

The new versions are temporarily called pangeo-cmip6-testing.csv and pangeo-cmip6-testing.json, and the "no Quality Control" catalog will now be called pangeo-cmip6-noQC.csv. I will soon overwrite the existing by removing the -testing from the names. The original gs://cmip6/cmip6-zarr-consolidated-stores.csv and gs://cmip6/cmip6-zarr-consolidated-stores-noQC.csv catalogs will contain all of the stores in the original naming scheme, but will get smaller and smaller as the stores are copied and deleted. This will phase out these redundant catalog names.

Apologies for all of the confusion, but this will clean up many discrepancies and simplify our 'CMIP' objects in the precious gs://cmip6 bucket to the following:

gs://cmip6/pangeo-cmip3.csv
gs://cmip6/pangeo-cmip3.json
gs://cmip6/pangeo-cmip5.csv
gs://cmip6/pangeo-cmip5.json
gs://cmip6/pangeo-cmip6.csv
gs://cmip6/pangeo-cmip6-noQC.csv
gs://cmip6/pangeo-cmip6.json
gs://cmip6/CMIP3/
gs://cmip6/CMIP5/
gs://cmip6/CMIP6/

naomi-henderson mentioned this issue Jan 29, 2021

The pangeo-cmip6.json url should be updated jbusecke/xMIP#80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed addition to the GCS bucket, gs://cmip6/ #12

Proposed addition to the GCS bucket, gs://cmip6/ #12

naomi-henderson commented Jan 22, 2021

rabernat commented Jan 22, 2021

rabernat commented Jan 22, 2021

naomi-henderson commented Jan 22, 2021

naomi-henderson commented Jan 22, 2021

charlesbluca commented Jan 22, 2021

naomi-henderson commented Jan 22, 2021

naomi-henderson commented Jan 31, 2021

Proposed addition to the GCS bucket, gs://cmip6/ #12

Proposed addition to the GCS bucket, gs://cmip6/ #12

Comments

naomi-henderson commented Jan 22, 2021

rabernat commented Jan 22, 2021

rabernat commented Jan 22, 2021

naomi-henderson commented Jan 22, 2021

naomi-henderson commented Jan 22, 2021

charlesbluca commented Jan 22, 2021

naomi-henderson commented Jan 22, 2021

naomi-henderson commented Jan 31, 2021