Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

Intake Integration #51

Closed
saulshanabrook opened this issue Jul 27, 2019 · 8 comments
Closed

Intake Integration #51

saulshanabrook opened this issue Jul 27, 2019 · 8 comments
Labels
help wanted Extra attention is needed @jupyterlab/dataregistry-extension type:Enhancement New feature or request
Milestone

Comments

@saulshanabrook
Copy link
Member

Intake is a "lightweight package for finding, investigating, loading and disseminating data." It would be nice to figure out how the JupyterLab data registry could integrate with this package.

Catalogs

Having JupyterLab be aware of Intake's "Data catalogs" are probably a good place to start. They "provide an abstraction that allows you to externally define, and optionally share, descriptions of datasets, called catalog entries."

Local

For example, if you have a catalog as a file on disk in a catalog.yaml file, we might want to be able to see the datasets it defines in the data registry. This is similar to how currently if you have a .ipynb file, you can view the the datasets in its cell outputs. To do this, we would have to be able to parse it's YAML format in javascript, and map the different entries to URLs.

For example, this catalog.yml file:

metadata:
  version: 1
sources:
  example:
    description: test
    driver: random
    args: {}

  entry1_full:
    description: entry1 full
    metadata:
      foo: 'bar'
      bar: [1, 2, 3]
    driver: csv
    args: # passed to the open() method
      urlpath: '{{ CATALOG_DIR }}/entry1_*.csv'

  entry1_part:
    description: entry1 part
    parameters: # User parameters
      part:
        description: section of the data
        type: str
        default: "stable"
        allowed: ["latest", "stable"]
    driver: csv
    args:
      urlpath: '{{ CATALOG_DIR }}/entry1_{{ part }}.csv'

Might map to a number of nested URLs:

./dataset.yml#/sources/example
./dataset.yml#/sources/entry1_full
./dataset.yml#/sources/entry1_part

And the ones that point to CSV files, would also point to some nested URLs, like dataset.yml#/sources/entry1_part would point to:

./entry1_latest.csv
./entry1_stable.csv

This basically requires re-implementing the logic of the all the drivers, so that they can work client side.

Remote

We could also support loading a remote Intake data catalog. If you loaded a URL like intake://catalog1:5000 in the data registry you would want to be able to see the datasets available. Here, the proxy mode might be useful:

Proxied access: In this mode, the catalog server uses its local drivers to open the data source and stream the data over the network to the client. The client does not need any special drivers to read the data, and can read data from files and data servers that it cannot access, as long as the catalog server has the required access.

If we implement a client API for this server protocol, then we can let it handle all the data parsing and just expose the results it returns to the user. We would have to look more in depth in its specification.

@saulshanabrook saulshanabrook added type:Enhancement New feature or request help wanted Extra attention is needed labels Jul 27, 2019
@saulshanabrook saulshanabrook added this to the Future milestone Jul 27, 2019
@saulshanabrook
Copy link
Member Author

I am chatting with @danielballan about this issue. We have come up with a plan!

Intake discovers catalogues in the system by looking at certain paths for .yml files. There is an open issue (intake/intake#404) to also discover catalogues in Python packages via an entry point intake.catalogues.

So what we can do is launch an Intake server as a Jupyter server proxy that serves up Intake's HTTP API. We can connect to this in a JupyterLab plugin and register a top level Intake dataset. The user should be able to see the catalogues within this dataset and expand them recursively. For datasources, users should be able to insert a snippet into their notebook that loads this datasource with intake, like import intake \n intake.cat.abcd.

On the client side this requires implementing a Intake client API in Javascript, which will use messagepack. It will also require writing a JupyterLab extension that registers data converters for these Intake URLs that hit the API.

We can then extend that, if we like, to actually request the contents of the data sources and display them in some way on the client. For example, we can display a numpy array in a datagrid. This will require writing custom logic for each intake driver to know how to request a chunk and parse the resulting data.

We can also display metadata provided by intake about data sources, like their shape and dtype. We should register this with the metadata service so that the user can see metadata in right hand side pane as they navigate their catalogue. Intake allows datasources to also provide arbitrary metadata. If the driver returns this metadata in JSON LD, we can also display that in the metadata explorer.


We also discussed letting users discover catalogues by finding their intake.yml files in the file system and expanding, as well showing the catalogues provided by different python packages. That way, when users are exploring the data registry they see the source the catalogue came from, instead of seeing all catalogues flattened at the top level. Authors could also write datasets.yml files that collate these separate catalogues for a single repo. We decided against this approach for now, since Intake already has a discovery mechanism for merging all the catalogues available to users.

cc @martindurant @gwbischof

@martindurant
Copy link

Thanks for starting this discussion, I am actively thinking about it!

@saulshanabrook
Copy link
Member Author

@ian-r-rose has a dcat dataset intake driver that exposes metadata, so we should also try that pipeline of getting metadata from a driver, into the data explorer, and then into the metadata explorer: https://twitter.com/IanRRose/status/1182660959413784576

@martindurant
Copy link

OK, I think I have got over my initial reservations: the frontend is much better off talking with a REST service than with a python kernel, so may as well indeed use the Intake server. Serving the "builtin" items it something we want to allow anyway, rather than always exposing a given cat. It may be useful (but not necessary) to expose connections to other servers too, in which case instead of intake.cat.abc, you would need cat = intake.open_catalog("..."); cat.abc.

The server likes to talk msgpack, rather than JSON, I hope that everything translates to the JS side. I suppose, if the matadat can be displayed in something like YAML blocks (i.e., as they would be in the catalog), that's enough.

So what needs to happen to make progress here?

@saulshanabrook
Copy link
Member Author

It may be useful (but not necessary) to expose connections to other servers too, in which case instead of intake.cat.abc, you would need cat = intake.open_catalog("..."); cat.abc.

I agree. I think this would be good to allow after initial work exposing the default server.

The server likes to talk msgpack, rather than JSON, I hope that everything translates to the JS side.

There is a msgpack client for javascript so this should be fine.

I suppose, if the matadat can be displayed in something like YAML blocks (i.e., as they would be in the catalog), that's enough.

Is there a standard for the metadata a driver exposes? Or is it up to them to expose whatever they want? If it is JSON-LD we could expose it in the metadata service. Otherwise, we could expose it however we like with whatever UI makes sense for it.

So what needs to happen to make progress here?

  • Create new repo for this work
  • Create Python package that exposes intake API through jupyter server proxy
  • Create JS package that exposes intake API in JS
  • Create JS package that exposes JupyterLab extension which connects to intake API with JS API package and adds this to the data registry.

@martindurant
Copy link

Is there a standard for the metadata a driver exposes?

There are standard things that every entry has (name, description, driver, arguments), but the general metadata is totally arbitrary.

@martindurant
Copy link

Create new repo for this work

I have no preference where this lives. On jupyterlab or other related org or in Intake, all are fine.

@martindurant
Copy link

@saulshanabrook - this dropped off the table at some point. Are you still interested?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed @jupyterlab/dataregistry-extension type:Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants