Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic catalogs #242

Closed
jacobtomlinson opened this issue Jan 28, 2019 · 7 comments · Fixed by #323
Closed

Dynamic catalogs #242

jacobtomlinson opened this issue Jan 28, 2019 · 7 comments · Fixed by #323
Assignees

Comments

@jacobtomlinson
Copy link
Contributor

I'm currently doing some work with a data API which provides multiple CSV datasets. I ended up writing a quick notebook which gets the file manifest from the API, converts the format into the intake catalog style and outputs a yaml file. We can then use the intake catalog normally.

This API is adding new datasets all the time and I would like to keep my catalog up to date. I could set up a cron job somewhere to periodically generate the manifest. However I was thinking perhaps I could go a step further and write an intake plugin which adds the sources on import.

The API I'm working with is a pretty standard one, so I could envisage writing some kind of manifest which points to the specific implementation of the API. But it would then get the manifest, generate the catalog and add all the datasets to the built in catalog.

Could anyone provide me with some guidance on how I could implement this?

@martindurant
Copy link
Member

This is very similar to the example workflow in intake/intake-thredds#2 , and elsewhere, where a query is made when the cat is instantiated, and then catalogue entries are generated from that. Dynamic catalogues are very important to Intake!

@jacobtomlinson
Copy link
Contributor Author

Ah awesome! I'll take a look at that example.

@martindurant
Copy link
Member

@jacobtomlinson , how are things going with catalogs and your various plugins? Are you still developing, do you need any pointers?

@jacobtomlinson
Copy link
Contributor Author

Things are progressing, although we are focusing on the data sets themselves and zarr at the moment.

@ian-r-rose
Copy link
Contributor

I am also working on a similar idea to what @jacobtomlinson describes over at intake-dcat, and have run into some similar issues. I would like to be able to read in a remote catalog, convert it to something intake-friendly, and then re-export that intake catalog for use. Some specific ideas that I think might be useful to have on the base Catalog class:

  • I was a bit surprised to not be able to find a function on the catalog to export itself to a .yaml file. Individual catalog entries can do that, so I can accomplish this by iterating over the entries, calling yaml() on them, and assembling those into a new catalog, but I think it would be nice to have a convenience method to do this automatically. I haven't really thought through how this would work for nested or remote catalogs, so perhaps there are some complicating considerations that would need to be addressed.
  • I like the search() functionality for producing new catalogs. In a similar vein, I think users will want to produce new catalogs in a more targeted fashion than search(). I think it would be useful to implement some more general-purpose functional-flavored operations to the base Catalog. Something like
Catalog.filter(Callable[[Entry], boolean]) -> Catalog
Catalog.find(Callable[[Entry], boolean]) -> Entry
Catalog.map(Callable[[Entry], Entry]) -> Catalog

The user could then provide their own functions/lambdas for operating on catalogs and producing new ones.

Is there any interest in these ideas? Do they already exist in the interfaces and I have just missed them?

@martindurant
Copy link
Member

These are all good ideas, and along the lines of what I was intending on working on in the immediate future. The iterator methods are interesting; but they basically amount to viewing the cat as an iterable (which it is) and being able to construct a new cat as Catalog(dict-like-of-entries), which you cannot, but should be able to.

You can fully serialise a catalog with .export(), but that is probably not what you are after; I will write a .save() method for output to a YAML file - or it sounds like you already have code to do this. Some care should be taken, that specs that came from YAML end up very similar to the original.

However, you might not want to explicitly save the catalog to YAML, maybe you want the truly dynamic version; or use the persist() mechanism to have periodically updating snapshots of a catalog service which changes only occasionally.

@ian-r-rose
Copy link
Contributor

Yes, I suppose that constructing a new catalog with something like a dictionary comprehensions might be more idiomatic. If we were able to do that, then my suggestions above should be doable with builtin functions.

The code for export() looks close to what I want, but not exactly (though my thinking about best-practices here is a bit of a moving target). For save() I was thinking something like what I have here:
https://github.com/ian-r-rose/intake-dcat/blob/2ae47cefb4362316c418ec151110b661a626d4b6/intake_dcat/catalog.py#L54-L65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants