Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leverage pyessv #1

Open
huard opened this issue Jan 22, 2020 · 4 comments
Open

Leverage pyessv #1

huard opened this issue Jan 22, 2020 · 4 comments
Assignees

Comments

@huard
Copy link

huard commented Jan 22, 2020

I'm guessing catalog building and vocabulary validation could leverage https://github.com/ES-DOC/pyessv. See in particular parsers such as https://github.com/ES-DOC/pyessv/blob/master/pyessv/parsers/cmip6_dataset_id.py

I have no experience with it myself, so maybe not a good fit.

@sherimickelson
Copy link
Contributor

@huard it looks like intake-esm-datastore has similar parsing methods. The main difference is that pyessv taps into the control vocabulary to check for valid values.

@andersy005, here's a working example that uses pyessv based on an example notebook they provide:

import pyessv

# Set template.
template = '/glade/collections/cmip/CMIP6/{}/{}/{}/{}/{}/{}/{}/{}/tas_Amon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc'

# Set seperator.
seperator = '/'

# Set collections.
collections = (
    'wcrp:cmip6:activity-id',
    'wcrp:cmip6:institution-id',
    'wcrp:cmip6:source-id',
    'wcrp:cmip6:experiment-id',
    'wcrp:cmip6:member-id',
    'wcrp:cmip6:table-id',
    'wcrp:cmip6:variable-id',
    'wcrp:cmip6:grid-label'
    )

# Set parsing stricness = 1 (raw-name).  
strictness = pyessv.PARSING_STRICTNESS_1

# Create parser.
parser = pyessv.create_template_parser(template, collections, strictness, seperator)

# Parsing
print(parser.parse('/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Amon/tas/gn/tas_Amon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc'))

The code returns:

{wcrp:cmip6:activity-id:cmip,
 wcrp:cmip6:experiment-id:historical,
 wcrp:cmip6:grid-label:gn,
 wcrp:cmip6:institution-id:ncar,
 wcrp:cmip6:source-id:cesm2,
 wcrp:cmip6:table-id:amon}

If you change Amon in the path to Gmon, it will let you know that this isn't a valid value. It will also require extra code to get the member_id and version in the directory structure working, but I figure this would give us enough of a base for discussion.

@andersy005 do you see any extra benefits for using this library to help with the parsing?

@andersy005
Copy link
Member

If you change Amon in the path to Gmon, it will let you know that this isn't a valid value. It will also require extra code to get the member_id and version in the directory structure working, but I figure this would give us enough of a base for discussion.

👍 I like the fact that you can enforce the validity of the parsed attributes.

@andersy005 do you see any extra benefits for using this library to help with the parsing?

Our current parsers are fragile due to the lack of verifying that the parsed attributes are valid. I think taking advantage of pyessv's features would make it easy to guarantee that the built catalogs contain attributes that are compliant with valid vocabularies.

@sherimickelson sherimickelson self-assigned this Apr 27, 2020
@andersy005
Copy link
Member

I am transferring this issue to a new repo: https://github.com/NCAR/ecg

@andersy005 andersy005 transferred this issue from NCAR/intake-esm-datastore Jun 1, 2020
@sherimickelson
Copy link
Contributor

@andersy005 Since you've started work on getting the information out of the cmip6 files themselves instead of using file paths, I'm wondering if this is still needed. File paths change and this is where this library is helpful in verifying the file path assumptions that were built in. File attributes are more of an effort to change and are more likely to be correct. All of these attributes are verified to be correct using the same controlled vocabulary that pyessv uses before they can be added to ESGF. If an inconsistency is found between the controlled vocabulary and the file attribute, the file cannot be published to ESGF. The file also cannot be published if one of these attributes is missing. I feel confident with proceeding with harvesting the attributes from the files themselves and not adding the extra check. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants