Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Proposed addition to the GCS bucket, gs://cmip6/ #12

Open
naomi-henderson opened this issue Jan 22, 2021 · 7 comments
Open

Proposed addition to the GCS bucket, gs://cmip6/ #12

naomi-henderson opened this issue Jan 22, 2021 · 7 comments

Comments

@naomi-henderson
Copy link
Contributor

I propose to add any new versions of datasets to our exisiting Google Cloud zarr bucket by adding a new prefix to our CMIP6 collection - gs://cmip6/CMIP6 which solves multiple problems.

  • This follows the naming of the other CMIP collections, gs://cmip6/CMIP5 and gs://cmip6/CMIP3

  • This allows us to start using the version in the object names, for example:

     gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/ps/gn, version 20200310
    

would now be stored in:

   gs://cmip6/CMIP6/AerChemMIP/AS-RCEC/TaiESM1/histSST/r1i1p1f1/AERmon/ps/gn/v20200310
  • The csv catalog files will now be generated the same as before, but with the URL pointing to the latest version only. The novice user will not be required to find the latest version on their own
  • The old versions will not need to be removed - so there will be no glitches and misunderstandings as the new versions are added and the existing datasets will stay where they are
@rabernat
Copy link
Contributor

I propose to add any new versions of datasets to our exisiting Google Cloud zarr bucket by adding a new prefix to our CMIP6 collection - gs://cmip6/CMIP6

👍 from me!

@rabernat
Copy link
Contributor

I would also not be opposed to moving all the older data to the new prefix as well.

@naomi-henderson
Copy link
Contributor Author

Yes, that would be very easy, since we can copy first, change the csv catalog and THEN remove the old copy

@naomi-henderson
Copy link
Contributor Author

@charlesbluca , this would require a revision to your rclone github actions from GCS to AWS. As you have currently set it up, this new prefix would get added to the 'other' category. That would be fine for now. We can rethink the division into the 20 jobs when it starts bogging down - okay?

@charlesbluca
Copy link
Member

Sounds good to me! For now I'll continue monitoring the workflow and see how the job handling gs://cmip/CMIP6 performs. If we need to change it around, it shouldn't be too difficult.

Let me know if you decide on copying the older data to the new prefix, as this would probably allow us to continue using the current workflow (as long as we change the affected directory names).

@naomi-henderson
Copy link
Contributor Author

Thanks @charlesbluca, I will certainly let you know if I start moving lots of data!

@naomi-henderson
Copy link
Contributor Author

I am now in the process of making these changes. Since all of our zarr datasets must be copied and then deleted, it will a few months to complete. In the meantime, the URLs in the new csv catalog files are a mixture of the old and new naming conventions - which should not cause trouble to anyone who uses the catalog to find the data.

However, if you find your datasets by assuming a URL naming structure like:

gs://cmip6/[activity_id]/[institution_id]/.../[variable_id]/[grid_label]

then these URLs are all being replaced by:

gs://cmip6/CMIP6/[activity_id]/[institution_id]/.../[variable_id]/[grid_label]/[version_id]

I have been doing some preliminary testing, including an intake-esm test. Hopefully @jbusecke will get a chance to test cmip6_preprocessing to make sure it will not cause too much trouble there.

The new versions are temporarily called pangeo-cmip6-testing.csv and pangeo-cmip6-testing.json, and the "no Quality Control" catalog will now be called pangeo-cmip6-noQC.csv. I will soon overwrite the existing by removing the -testing from the names. The original gs://cmip6/cmip6-zarr-consolidated-stores.csv and gs://cmip6/cmip6-zarr-consolidated-stores-noQC.csv catalogs will contain all of the stores in the original naming scheme, but will get smaller and smaller as the stores are copied and deleted. This will phase out these redundant catalog names.

Apologies for all of the confusion, but this will clean up many discrepancies and simplify our 'CMIP' objects in the precious gs://cmip6 bucket to the following:

gs://cmip6/pangeo-cmip3.csv
gs://cmip6/pangeo-cmip3.json
gs://cmip6/pangeo-cmip5.csv
gs://cmip6/pangeo-cmip5.json
gs://cmip6/pangeo-cmip6.csv
gs://cmip6/pangeo-cmip6-noQC.csv
gs://cmip6/pangeo-cmip6.json
gs://cmip6/CMIP3/
gs://cmip6/CMIP5/
gs://cmip6/CMIP6/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants