Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the output of a container as input for a catalog source #715

Open
wachsylon opened this issue Feb 28, 2023 · 8 comments
Open

Use the output of a container as input for a catalog source #715

wachsylon opened this issue Feb 28, 2023 · 8 comments

Comments

@wachsylon
Copy link

Hi,

sorry if the terms I used in the following are not fully correct according to official definitions.

Lets assume I have catalog hi with an entry:

sources:
  freva_cmip5_df:
    args:
      base_url: SOLRURL
      core: latest
      qargs:
        rows: '100'
      query: project:cmip5
    driver: intake_solr.source.SOLRTableSource

hi.freva_cmip5_df.read() returns a DataFrame. Can I use this DataFrame within another catalog entry which uses a DataFrame as input in the args somehow?

In general, can I use the container (dont know if that is the correct name) i.e. the output of a source for another intake catalog entry?

Thanks!

@martindurant
Copy link
Member

Yes you can - please see "derived datasets", also known as transforms.

@wachsylon
Copy link
Author

Thanks for the quick response!

However it seems like there is only support for exactly one source as an argument, right? I cannot distinguish between different sources in the parameters.

Also, the visibility in the docs is not correct, some yaml parts were hidden. I had to look into the raw file.

@wachsylon
Copy link
Author

Here is what I mean:
The intake-esm plugin needs one json and one dataframe input. I define those in the sources. Then I need to specify the relevant transform_kwargs for the main_cat but I do not know how:

sources:
  freva_cmip5_df:
    args:
      base_url: URL
      core: latest
      qargs:
        rows: '100'
      query: project:cmip5
    driver: intake_solr.source.SOLRTableSource
  generic_intake_json:
    args:
      urlpath: generic.json
    description: This is an ESM collection for CMIP5 data accessible on the DKRZ's
    driver:
    - json
  main_cat:
    driver: intake.source.derived.GenericTransform
    args:
      targets:
        - freva_cmip5_df
        - generic_intake_json
      transform: intake.open_esm_datastore
      transform_kwargs:
        esmcol_obj: freva_cmip5_df
        esmcol_data: generic_intake_json

@martindurant
Copy link
Member

Also, the visibility in the docs is not correct, some yaml parts were hidden. I had to look into the raw file.

@blakerosenthal , perhaps another artefact of the change in the docs config?

@martindurant
Copy link
Member

So do I follow, that you would like to take the result of the SORL query or a flat JSON file, and use it to generate an ESM catalog as output? Do you expect the user to choose between the two possible inputs at runtime (or environment variables/something else?); or were you hoping to combine the inputs?

@blakerosenthal
Copy link
Member

Also, the visibility in the docs is not correct, some yaml parts were hidden. I had to look into the raw file.

@blakerosenthal , perhaps another artefact of the change in the docs config?

@wachsylon Could you post a link to the docs where you're seeing the hidden yaml?

@wachsylon
Copy link
Author

So do I follow, that you would like to take the result of the SORL query or a flat JSON file, and use it to generate an ESM catalog as output?

In my use case, my main catalog would use two inputs that are both in the same catalog. I need to specify them as kwargs for the open function. But there is no way to do this. It seems like targets are provided as the first argument. if more then one specified then it is a list rather than multiple args.

@martindurant
Copy link
Member

Correct, the current design of the various types of derived dataset expects one source as input.

However, the source instances do get a reference to the catalog object from which they were made, so you could write a derived class that fetches more than one input. Would you like to give it a go? A PR for the new class in intake.source.derived would be apreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants