Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching with derived variables is broken #427

Closed
3 tasks done
aulemahal opened this issue Dec 21, 2021 · 1 comment · Fixed by #428
Closed
3 tasks done

Searching with derived variables is broken #427

aulemahal opened this issue Dec 21, 2021 · 1 comment · Fixed by #428

Comments

@aulemahal
Copy link
Contributor

aulemahal commented Dec 21, 2021

Here's a quick checklist in what to include:

  • Include a detailed description of the bug or suggestion

  • Output of intake_esm.show_versions()

  • Minimal, self-contained copy-pastable example that generates the issue if possible. Please be concise with code posted. See guidelines below on how to provide a good bug report:

Description

A few aspects of searching a catalog while using the derived variable registry seem to be broken:

  • When searching the root catalog without specifying variables, the returned catalog has lost the derived variable registry.
  • When searching a catalog where the variable column is an iterable, only the variables explicited in the query are added to the "requested_variables" field. If those were derived variables, to_dataset_dict returns empty datasets because it only returns the interestion of the dataset's variables and requested_variables.
  • When searching with a "complex" query that contains a request for a derived variable, the returned catalog contains every row that could be used to construct that derived variable, regardless if they fit the other query terms or not.

What I Did

import intake_esm

dvr = intake_esm.DerivedVariableRegistry()

@dvr.register(variable='sfcWind', query={'variable': ['U', 'V'])
def windspeed(ds):
   ds['sfcWind'] = ds.U**2 + ds.V**2)**0.5
   return ds

cat = intake_esm.esm_datastore('https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json', registry=dvr)

With this setup, I though the three following calls would return the same catalog, but it's not the case.

cat.search(frequency='monthly', variable='sfcWind')
# <aws-cesm1-le catalog with 7 dataset(s) from 14 asset(s)>

cat.search(frequency='monthly').search(variable='sfcWind')
# <aws-cesm1-le catalog with 0 dataset(s) from 0 asset(s)>

cat.search(variable='sfcWind').search(frequency='monthly')
# <aws-cesm1-le catalog with 4 dataset(s) from 8 asset(s)>

I thought the first would give me what I want in this example : monthly datasets of the wind speed. Instead it includes non-monthly datasets too. The last one does include all expected assets, but it has lost its derived variable registry and thus to_dataset_dict will not give me the sfcWind var I'm looking for.

I haven't found a MWE for my second issue, sorry.

I can push modifications to esm_datastore.search that fix all three problems.

Version information: output of intake_esm.show_versions()

Paste the output of intake_esm.show_versions() here:

INSTALLED VERSIONS

cftime: 1.4.1
dask: 2021.09.0
fastprogress: 0.2.7
fsspec: 2021.09.0
gcsfs: 2021.09.0
intake: 0.6.4
intake_esm: 2021.8.17.post43+dirty
netCDF4: 1.5.6
pandas: 1.3.2
requests: 2.26.0
s3fs: 2021.09.0
xarray: 0.19.0
zarr: 2.9.5

@andersy005
Copy link
Member

I can push modifications to esm_datastore.search that fix all three problems.

yes, please 🎉...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants