Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to existing functionality #45

Merged
merged 19 commits into from
Mar 19, 2019
Merged

Improvements to existing functionality #45

merged 19 commits into from
Mar 19, 2019

Conversation

andersy005
Copy link
Member

@andersy005 andersy005 commented Mar 15, 2019

  • Update to_xarray() method to allow users to pass keyword arguments to this method
  • Refactor cmip.py in preparation for cmip6 integration as proposed in CMIP6 version of cmip.py #43

@andersy005 andersy005 added this to In progress in Backlog via automation Mar 15, 2019
@andersy005 andersy005 added this to the sprint-mar04-mar17 milestone Mar 15, 2019
@andersy005 andersy005 added the usage question User questions which do not appear to be bugs or enhancements. label Mar 15, 2019
@andersy005 andersy005 changed the title Allow users to pass kwargs to to_xarray() method Update to_xarray() method Mar 17, 2019
@andersy005 andersy005 marked this pull request as ready for review March 17, 2019 23:38
@andersy005
Copy link
Member Author

@matt-long, I've experienced difficulties while trying to merge/concat CMIP data from different institutions. It appears that there are discrepancies among coordinates of different models' outputs. For queries in which there's data from more than one institution, xarray is unable to align the coordinates.

For instance, I saw one variable in which the coordinates in one model are time, lat, lon, bnds and time, rlat, rlon, bnds in another model from a different institution. In addition to

https://github.com/NCAR/intake-esm/blob/67fb828a202a7c7fc15953b63436075554e3ae1c/intake_esm/cmip.py#L240-L244

should we add a check that forbids merging datasets from different institutions? Let me know what would be the right approach.

@matt-long
Copy link
Contributor

@andersy005, I would not expect models from different institutions to be concatenate-able; the different models do indeed have different coordinates. I think we should add a check that ensures prevent this. Another approach would be to return a data structure that includes the datasets from the different models. A dictionary with the institution ID as the key, for instance, could work. This would entail an outer loop, probably best implemented as a method that calls this _open_dataset method.

@andersy005
Copy link
Member Author

@matt-long, this is ready for another look

@andersy005 andersy005 changed the title Update to_xarray() method Improvements to existing functionality Mar 18, 2019
This was referenced Mar 18, 2019
intake_esm/aggregate.py Outdated Show resolved Hide resolved
intake_esm/aggregate.py Show resolved Hide resolved
intake_esm/aggregate.py Outdated Show resolved Hide resolved
intake_esm/cesm.py Outdated Show resolved Hide resolved
intake_esm/cesm.py Show resolved Hide resolved
intake_esm/aggregate.py Show resolved Hide resolved
intake_esm/cmip.py Outdated Show resolved Hide resolved
_ds_dict = {}
grouped = get_subset(self.collection_name, self.collection_type, query).groupby(
'institution'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the code above belong in a _validate_concat_open_dataset method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably fine for now. I was wondering whether you could write a general method for validation. Not sure of the API....

def _validate_concat(check_max_instance=None):

  if check_max_instance is not None:
    for fld, max_inst in check_max_instance.items():
        fld_list = self.query_results[fld].unique()
        if len(fld_list) > max_inst:
            raise ValueError(f'message about {fld}')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Now that we are handling an addition case of datasets that cannot be concatenated by returning a dictionary of dsets in:

https://github.com/NCAR/intake-esm/blob/eb8fdafa0d7b22770bffde968bf36453d9cd5b9a/intake_esm/cmip.py#L254-L264

can you think of other cases that wouldn't be handled by the above code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably need to work thru some use cases to really develop my intuition here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Let me know when you get to play with this.

grouped = get_subset(self.collection_name, self.collection_type, query).groupby(
'institution'
)
for name, group in grouped:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an _open_dataset_groups method that calls the _open_dataset method? Would this allow us to reuse more code between cesm.py and cmip.py?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the advantage of generalizing this. What would be the equivalent of _open_dataset_groups() for cesm data, in other words, what is the equivalent of CMIP institutions in CESM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could think about this as moving toward any arbitrary collection of datasets that should not be concatenated along a dimension. Observations, integrations from different model version or resolutions, for instance. We might want to have a model general dataset_id or something versus insisting on "institution."

Copy link
Member Author

@andersy005 andersy005 Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a stab at generalizing it (I am afraid that it may not be a trivial task though)

intake_esm/config.yaml Outdated Show resolved Hide resolved
intake_esm/cesm.py Outdated Show resolved Hide resolved
Backlog automation moved this from In progress to Reviewer approved Mar 19, 2019
@andersy005 andersy005 merged commit 19b8d18 into intake:master Mar 19, 2019
Backlog automation moved this from Reviewer approved to Done Mar 19, 2019
@andersy005 andersy005 deleted the update-to_xarray branch March 19, 2019 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage question User questions which do not appear to be bugs or enhancements.
Projects
Backlog
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants