Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

Quilt Data Integration #7

Closed
saulshanabrook opened this issue Mar 21, 2019 · 2 comments
Closed

Quilt Data Integration #7

saulshanabrook opened this issue Mar 21, 2019 · 2 comments

Comments

@saulshanabrook
Copy link
Member

I wanted to write down some thought on possible ways Quilt Data could integrate with the data registry, to let JupyterLab users explore Quilt Data more easily.

The basic entity is a package, like this: https://quiltdata.com/package/uciml/iris. It is in the form <user>/<name>. As a URL, it could like like:

quilt://quiltdata.com/uciml/iris. So if a user could add a dataset with that, what would they wanna do with it?

  • See associated metadata provided on quilt website
  • Insert snippet in Python notebook to read this data (assuming the kernel has quilt installed)
  • View the files in it? It might be useful to expose each of these as a dataset. I wonder if it would, if we could access them client side to throw in a grid viewer or something.
@akarve
Copy link

akarve commented Mar 21, 2019

It is possible to fetch only the metadata of a package (quilt.install("uciml/iris", meta_only=True)). This would allow metadata inspection, as well as package traversal (similar to __dir__). One of the Quilt contributors can likely help on design or implementation.

@saulshanabrook
Copy link
Member Author

Here are some notes from a meeting just had with @akarve and @ResidentMario from Quilt data.

The Quilt data goal is to have declarative specifications of data dependencies.

They created some UX for jupyterlab a while back: https://github.com/quiltdata/jupyterlab It lets you browse quilt datasets and insert snippets of them to the notebook.

Package is backed by a manifest. Represented by JSONL manifest. This is on disk representation. Line based files. Can dynamically filter package per line based on PRESTO. Could have a manifest with 1 million entries.

The options for getting the quilt into jupyterlab:

  • Separate plugin to explore quilt datasets
  • Or do it all in data browser

Second option is preferable by them, because it makes it easier to expose new datasets.

Packages separated by logical key (what user browses) and physical key (direct URI to bytes)

Notes from experience at Kaggle:

Number one way to boost popularity of dataset, is to have jupyter notebook related to it. Also commenting. Publications, external references. Visualization resources about dataset, like vega dashboard about it. It would be great to be able to capture links to people using this dataset.

They are building data browser that is awesome! Has vega visualization support built in, all backed by S3

https://alpha.quiltdata.com/b/quilt-example/tree/
https://github.com/quiltdata/t4

Data registry in JL is like a lite version of this. with I think similar interactions.

Main takeaway:
Dataset URLs are structured. They can be viewed best in a tree viewer. This allows discoverability and is mapped cleanly to URL construct.

Datasets can both be Quilt packages, and the files within them. Like this: quilt://quiltdata.com/saulshanabrook/mypackage/ and quilt://quiltdata.com/saulshanabrook/mypackage/my-file.csv

Need a way for a dataset provider to provide lazy/searchable/paged access to sub datasets. I.e. example is the quilt package that contains 1 million files in it. We don't wanna show all those in datasets in the explorer. So need to talk about UX for this. Could have pagination, or search feature.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants