Dynamic catalogs #242

jacobtomlinson · 2019-01-28T15:53:25Z

I'm currently doing some work with a data API which provides multiple CSV datasets. I ended up writing a quick notebook which gets the file manifest from the API, converts the format into the intake catalog style and outputs a yaml file. We can then use the intake catalog normally.

This API is adding new datasets all the time and I would like to keep my catalog up to date. I could set up a cron job somewhere to periodically generate the manifest. However I was thinking perhaps I could go a step further and write an intake plugin which adds the sources on import.

The API I'm working with is a pretty standard one, so I could envisage writing some kind of manifest which points to the specific implementation of the API. But it would then get the manifest, generate the catalog and add all the datasets to the built in catalog.

Could anyone provide me with some guidance on how I could implement this?

martindurant · 2019-01-28T16:03:18Z

This is very similar to the example workflow in intake/intake-thredds#2 , and elsewhere, where a query is made when the cat is instantiated, and then catalogue entries are generated from that. Dynamic catalogues are very important to Intake!

jacobtomlinson · 2019-01-28T16:20:27Z

Ah awesome! I'll take a look at that example.

martindurant · 2019-02-20T22:05:11Z

@jacobtomlinson , how are things going with catalogs and your various plugins? Are you still developing, do you need any pointers?

jacobtomlinson · 2019-02-22T09:52:36Z

Things are progressing, although we are focusing on the data sets themselves and zarr at the moment.

ian-r-rose · 2019-04-11T20:36:34Z

I am also working on a similar idea to what @jacobtomlinson describes over at intake-dcat, and have run into some similar issues. I would like to be able to read in a remote catalog, convert it to something intake-friendly, and then re-export that intake catalog for use. Some specific ideas that I think might be useful to have on the base Catalog class:

I was a bit surprised to not be able to find a function on the catalog to export itself to a .yaml file. Individual catalog entries can do that, so I can accomplish this by iterating over the entries, calling yaml() on them, and assembling those into a new catalog, but I think it would be nice to have a convenience method to do this automatically. I haven't really thought through how this would work for nested or remote catalogs, so perhaps there are some complicating considerations that would need to be addressed.
I like the search() functionality for producing new catalogs. In a similar vein, I think users will want to produce new catalogs in a more targeted fashion than search(). I think it would be useful to implement some more general-purpose functional-flavored operations to the base Catalog. Something like

Catalog.filter(Callable[[Entry], boolean]) -> Catalog
Catalog.find(Callable[[Entry], boolean]) -> Entry
Catalog.map(Callable[[Entry], Entry]) -> Catalog

The user could then provide their own functions/lambdas for operating on catalogs and producing new ones.

Is there any interest in these ideas? Do they already exist in the interfaces and I have just missed them?

martindurant · 2019-04-11T20:44:10Z

These are all good ideas, and along the lines of what I was intending on working on in the immediate future. The iterator methods are interesting; but they basically amount to viewing the cat as an iterable (which it is) and being able to construct a new cat as Catalog(dict-like-of-entries), which you cannot, but should be able to.

You can fully serialise a catalog with .export(), but that is probably not what you are after; I will write a .save() method for output to a YAML file - or it sounds like you already have code to do this. Some care should be taken, that specs that came from YAML end up very similar to the original.

However, you might not want to explicitly save the catalog to YAML, maybe you want the truly dynamic version; or use the persist() mechanism to have periodically updating snapshots of a catalog service which changes only occasionally.

ian-r-rose · 2019-04-11T20:54:38Z

Yes, I suppose that constructing a new catalog with something like a dictionary comprehensions might be more idiomatic. If we were able to do that, then my suggestions above should be doable with builtin functions.

The code for export() looks close to what I want, but not exactly (though my thinking about best-practices here is a bit of a moving target). For save() I was thinking something like what I have here:
https://github.com/ian-r-rose/intake-dcat/blob/2ae47cefb4362316c418ec151110b661a626d4b6/intake_dcat/catalog.py#L54-L65

martindurant mentioned this issue Apr 12, 2019

Add cat methods #323

Merged

martindurant self-assigned this Apr 12, 2019

martindurant added the in progress label Apr 12, 2019

martindurant closed this as completed in #323 Apr 17, 2019

martindurant removed the in progress label Apr 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic catalogs #242

Dynamic catalogs #242

jacobtomlinson commented Jan 28, 2019

martindurant commented Jan 28, 2019

jacobtomlinson commented Jan 28, 2019

martindurant commented Feb 20, 2019

jacobtomlinson commented Feb 22, 2019

ian-r-rose commented Apr 11, 2019

martindurant commented Apr 11, 2019

ian-r-rose commented Apr 11, 2019

Dynamic catalogs #242

Dynamic catalogs #242

Comments

jacobtomlinson commented Jan 28, 2019

martindurant commented Jan 28, 2019

jacobtomlinson commented Jan 28, 2019

martindurant commented Feb 20, 2019

jacobtomlinson commented Feb 22, 2019

ian-r-rose commented Apr 11, 2019

martindurant commented Apr 11, 2019

ian-r-rose commented Apr 11, 2019