Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add core functionality for parsing attributes from an open file #3

Merged
merged 21 commits into from
Jun 24, 2020
Merged

Add core functionality for parsing attributes from an open file #3

merged 21 commits into from
Jun 24, 2020

Conversation

andersy005
Copy link
Member

This PR adds a Builder class. This class implements most of the functionality needed to parse attributes from an open file/xarray dataset. The missing feature consists of parsing attributes from a filename/filepath template + global attributes i.e. @sherimickelson's work in NCAR/intake-esm-datastore#70.

@sherimickelson, I added sample files to this repo. I am hoping that we can use them to test your work in NCAR/intake-esm-datastore#70 once it's migrated here.

@andersy005
Copy link
Member Author

Here's a demo notebook

ecgtools/core.py Show resolved Hide resolved
ecgtools/parsers.py Outdated Show resolved Hide resolved
@andersy005 andersy005 requested a review from kmpaul June 8, 2020 18:07
@andersy005
Copy link
Member Author

Note: the latest commit includes the necessary code for creating both the CSV and JSON files.

Copy link
Contributor

@sherimickelson sherimickelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Copy link

@kmpaul kmpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersy005: I think this is coming together, but I think the public API is still a bit confusing. I am welcome to arguments against this, but it seems to me that the most useful form (from the user's perspective...or mine) of what you are doing would be reflected in the following public API:

  • Builder.__init__(): The result of constructing the Builder should be a valid object that knows (as well as possible) that it will run without error.

  • Builder.build(): The result of the Builder.build method should be the construction of the Builder.df and (possibly) Builder.dict (?) attributes, described below:

  • Builder.df: the pre-state of the CSV file, and

  • Builder.dict: the pre-state of the JSON file. Perhaps this should be named something else?

  • Builder.save(): Write the df and dict attributes to files. Perhaps this should be named something else, too?

Does that cover the user behavior you are aiming for?

ecgtools/core.py Outdated Show resolved Hide resolved
ecgtools/core.py Outdated Show resolved Hide resolved
ecgtools/core.py Outdated Show resolved Hide resolved
ecgtools/core.py Outdated Show resolved Hide resolved
@andersy005
Copy link
Member Author

andersy005 commented Jun 11, 2020

@kmpaul, I like the simplicity of the public API suggested above, but I do have a few questions:

Builder.init(): The result of constructing the Builder should be a valid object that knows (as well as possible) that it will run without error.

Could you expand on "it will run without error"? What errors should we check for? I am asking because I think this can be very tricky. For example, if the user provides a root_path, do we check whether they have the right permissions to crawl the sub-directories during the class instantiation step?

Builder.build(): The result of the Builder.build method should be the construction of the Builder.df and (possibly) Builder.dict (?) attributes, described below:

Does this imply that all necessary arguments are handled in __init__()?

Does that cover the user behavior you are aiming for?

One of the things I was aiming for was a Builder object that behaves as a pipeline i.e. one that makes it easy to replace intermediate steps if need be. For example, when building a large catalog, for instance CMIP6 catalog, you may want to run the pipeline end-to-end for the first time only. For the subsequent runs, you would only collect the list of files available, compare this list to the list of files in the existing catalog, and only parse attributes of files that don't appear in the catalog previously built. My understanding of the public API you suggested above is that all steps are tightly coupled, making it a little tricky to replace intermediate steps when necessary. I could be missing something though.

@kmpaul
Copy link

kmpaul commented Jun 11, 2020

@kmpaul, I like the simplicity of the public API suggested above, but I do have a few questions:

Builder.__init__(): The result of constructing the Builder should be a valid object that knows (as well as possible) that it will run without error.

Could you expand on "it will run without error"? What errors should we check for? I am asking because I think this can be very tricky. For example, if the user provides a root_path, do we check whether they have the right permissions to crawl the sub-directories during the class instantiation step?

That's one idea, or you can just check to make sure all of the necessary inputs to the __init__() constructor are valid. If a root_path is given, then it makes sense to make sure that path is valid and you have access to it, but maybe you don't have to crawl the entire tree. Just make sure the __init__() inputs are valid. If the validity checks are too complex, then you should just throw an exception upon error when the operation runs.

Then you need validation unit tests to match.

Builder.build(): The result of the Builder.build method should be the construction of the Builder.df and (possibly) Builder.dict (?) attributes, described below:

Does this imply that all necessary arguments are handled in __init__()?

Not necessarily. The build() method could take arguments of its own, as long as it makes sense for them to be supplied to build() instead of __init__(), right? From my understanding, the build function is really just the parse_files_attributes method and the current create_catalog method, combined. So, as a first pass, you could make the parse_files_attributes and create_catalog methods private and called in sequence from a build method. Then do a little cleanup.

Does that cover the user behavior you are aiming for?

One of the things I was aiming for was a Builder object that behaves as a pipeline i.e. one that makes it easy to replace intermediate steps if need be. For example, when building a large catalog, for instance CMIP6 catalog, you may want to run the pipeline end-to-end for the first time only. For the subsequent runs, you would only collect the list of files available, compare this list to the list of files in the existing catalog, and only parse attributes of files that don't appear in the catalog previously built. My understanding of the public API you suggested above is that all steps are tightly coupled, making it a little tricky to replace intermediate steps when necessary. I could be missing something though.

Yes. I didn't realize this was a use case, but if that use case is to be assumed (and it should), then you are correct. My pitch is tightly coupled.

So, could you add an update method to the public API that addresses your additional use case? And then modify the private helper functions to allow for the update step?

I am envisioning a 2-option use path:

[init] --> [build] --> [save]
[init] --> [update] --> [save]

Does that work? Is that clearer?

@andersy005
Copy link
Member Author

It makes sense now :) Many thanks for the feedback

@kmpaul
Copy link

kmpaul commented Jun 11, 2020

You're welcome. I hope what I'm suggesting is reasonable. I think that if what you are suggesting is the ability to let users build an entire catalog from scratch and to let users update an existing catalog, then it makes sense to call out those 2 scenarios in the names of the API functions. It will be more clear to the user.

@andersy005 andersy005 requested a review from kmpaul June 12, 2020 15:35
@andersy005
Copy link
Member Author

Gentle ping @kmpaul. When you get a chance, this is ready for another round of review.

Copy link

@kmpaul kmpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel like there are steps being done here that don't fit the init --> build --> save form. Do you feel like the esmcol_data is not something that other people want to be able to see before saving it to the JSON file?

ecgtools/core.py Outdated
Comment on lines 313 to 351
esmcol_data = OrderedDict()
if esmcat_version is None:
esmcat_version = '0.0.1'

if cat_id is None:
cat_id = ''

if description is None:
description = ''
esmcol_data['esmcat_version'] = esmcat_version
esmcol_data['id'] = cat_id
esmcol_data['description'] = description
esmcol_data['catalog_file'] = catalog_file
if attributes is None:
attributes = []
for column in self.df.columns:
attributes.append({'column_name': column, 'vocabulary': ''})

esmcol_data['attributes'] = attributes
esmcol_data['assets'] = {'column_name': path_column}
if (data_format is not None) and (format_column is not None):
raise ValueError('data_format and format_column are mutually exclusive')

if data_format is not None:
esmcol_data['assets']['format'] = data_format
else:
esmcol_data['assets']['format_column_name'] = format_column
if groupby_attrs is None:
groupby_attrs = [path_column]

if aggregations is None:
aggregations = []
esmcol_data['aggregation_control'] = {
'variable_column_name': variable_column,
'groupby_attrs': groupby_attrs,
'aggregations': aggregations,
}

self.esmcol_data = esmcol_data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems like a build step...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it

@andersy005 andersy005 requested a review from kmpaul June 19, 2020 18:48
Copy link

@kmpaul kmpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersy005 I like this. I think this is a good public API. I'm happy seeing this merged into the master branch.

@andersy005
Copy link
Member Author

@andersy005 I like this. I think this is a good public API. I'm happy seeing this merged into the master branch.

Hooray 🎉 @kmpaul & @sherimickelson thank you for your feedback!

@andersy005 andersy005 merged commit e5108a4 into ncar-xdev:master Jun 24, 2020
@andersy005 andersy005 deleted the core-functionality branch June 24, 2020 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants