-
Notifications
You must be signed in to change notification settings - Fork 38
Intake Integration #51
Comments
I am chatting with @danielballan about this issue. We have come up with a plan! Intake discovers catalogues in the system by looking at certain paths for So what we can do is launch an Intake server as a Jupyter server proxy that serves up Intake's HTTP API. We can connect to this in a JupyterLab plugin and register a top level Intake dataset. The user should be able to see the catalogues within this dataset and expand them recursively. For datasources, users should be able to insert a snippet into their notebook that loads this datasource with intake, like On the client side this requires implementing a Intake client API in Javascript, which will use messagepack. It will also require writing a JupyterLab extension that registers data converters for these Intake URLs that hit the API. We can then extend that, if we like, to actually request the contents of the data sources and display them in some way on the client. For example, we can display a numpy array in a datagrid. This will require writing custom logic for each intake driver to know how to request a chunk and parse the resulting data. We can also display metadata provided by intake about data sources, like their shape and dtype. We should register this with the metadata service so that the user can see metadata in right hand side pane as they navigate their catalogue. Intake allows datasources to also provide arbitrary metadata. If the driver returns this metadata in JSON LD, we can also display that in the metadata explorer. We also discussed letting users discover catalogues by finding their |
Thanks for starting this discussion, I am actively thinking about it! |
@ian-r-rose has a dcat dataset intake driver that exposes metadata, so we should also try that pipeline of getting metadata from a driver, into the data explorer, and then into the metadata explorer: https://twitter.com/IanRRose/status/1182660959413784576 |
OK, I think I have got over my initial reservations: the frontend is much better off talking with a REST service than with a python kernel, so may as well indeed use the Intake server. Serving the "builtin" items it something we want to allow anyway, rather than always exposing a given cat. It may be useful (but not necessary) to expose connections to other servers too, in which case instead of The server likes to talk msgpack, rather than JSON, I hope that everything translates to the JS side. I suppose, if the matadat can be displayed in something like YAML blocks (i.e., as they would be in the catalog), that's enough. So what needs to happen to make progress here? |
I agree. I think this would be good to allow after initial work exposing the default server.
There is a msgpack client for javascript so this should be fine.
Is there a standard for the metadata a driver exposes? Or is it up to them to expose whatever they want? If it is JSON-LD we could expose it in the metadata service. Otherwise, we could expose it however we like with whatever UI makes sense for it.
|
There are standard things that every entry has (name, description, driver, arguments), but the general metadata is totally arbitrary. |
I have no preference where this lives. On jupyterlab or other related org or in Intake, all are fine. |
@saulshanabrook - this dropped off the table at some point. Are you still interested? |
Intake is a "lightweight package for finding, investigating, loading and disseminating data." It would be nice to figure out how the JupyterLab data registry could integrate with this package.
Catalogs
Having JupyterLab be aware of Intake's "Data catalogs" are probably a good place to start. They "provide an abstraction that allows you to externally define, and optionally share, descriptions of datasets, called catalog entries."
Local
For example, if you have a catalog as a file on disk in a
catalog.yaml
file, we might want to be able to see the datasets it defines in the data registry. This is similar to how currently if you have a.ipynb
file, you can view the the datasets in its cell outputs. To do this, we would have to be able to parse it's YAML format in javascript, and map the different entries to URLs.For example, this
catalog.yml
file:Might map to a number of nested URLs:
And the ones that point to CSV files, would also point to some nested URLs, like
dataset.yml#/sources/entry1_part
would point to:This basically requires re-implementing the logic of the all the drivers, so that they can work client side.
Remote
We could also support loading a remote Intake data catalog. If you loaded a URL like
intake://catalog1:5000
in the data registry you would want to be able to see the datasets available. Here, the proxy mode might be useful:If we implement a client API for this server protocol, then we can let it handle all the data parsing and just expose the results it returns to the user. We would have to look more in depth in its specification.
The text was updated successfully, but these errors were encountered: