Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intake Integration #51

Open
saulshanabrook opened this issue Jul 27, 2019 · 0 comments

Comments

@saulshanabrook
Copy link
Member

commented Jul 27, 2019

Intake is a "lightweight package for finding, investigating, loading and disseminating data." It would be nice to figure out how the JupyterLab data registry could integrate with this package.

Catalogs

Having JupyterLab be aware of Intake's "Data catalogs" are probably a good place to start. They "provide an abstraction that allows you to externally define, and optionally share, descriptions of datasets, called catalog entries."

Local

For example, if you have a catalog as a file on disk in a catalog.yaml file, we might want to be able to see the datasets it defines in the data registry. This is similar to how currently if you have a .ipynb file, you can view the the datasets in its cell outputs. To do this, we would have to be able to parse it's YAML format in javascript, and map the different entries to URLs.

For example, this catalog.yml file:

metadata:
  version: 1
sources:
  example:
    description: test
    driver: random
    args: {}

  entry1_full:
    description: entry1 full
    metadata:
      foo: 'bar'
      bar: [1, 2, 3]
    driver: csv
    args: # passed to the open() method
      urlpath: '{{ CATALOG_DIR }}/entry1_*.csv'

  entry1_part:
    description: entry1 part
    parameters: # User parameters
      part:
        description: section of the data
        type: str
        default: "stable"
        allowed: ["latest", "stable"]
    driver: csv
    args:
      urlpath: '{{ CATALOG_DIR }}/entry1_{{ part }}.csv'

Might map to a number of nested URLs:

./dataset.yml#/sources/example
./dataset.yml#/sources/entry1_full
./dataset.yml#/sources/entry1_part

And the ones that point to CSV files, would also point to some nested URLs, like dataset.yml#/sources/entry1_part would point to:

./entry1_latest.csv
./entry1_stable.csv

This basically requires re-implementing the logic of the all the drivers, so that they can work client side.

Remote

We could also support loading a remote Intake data catalog. If you loaded a URL like intake://catalog1:5000 in the data registry you would want to be able to see the datasets available. Here, the proxy mode might be useful:

Proxied access: In this mode, the catalog server uses its local drivers to open the data source and stream the data over the network to the client. The client does not need any special drivers to read the data, and can read data from files and data servers that it cannot access, as long as the catalog server has the required access.

If we implement a client API for this server protocol, then we can let it handle all the data parsing and just expose the results it returns to the user. We would have to look more in depth in its specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.