Quilt Data Integration #7

saulshanabrook · 2019-03-21T19:54:02Z

I wanted to write down some thought on possible ways Quilt Data could integrate with the data registry, to let JupyterLab users explore Quilt Data more easily.

The basic entity is a package, like this: https://quiltdata.com/package/uciml/iris. It is in the form <user>/<name>. As a URL, it could like like:

quilt://quiltdata.com/uciml/iris. So if a user could add a dataset with that, what would they wanna do with it?

See associated metadata provided on quilt website
Insert snippet in Python notebook to read this data (assuming the kernel has quilt installed)
View the files in it? It might be useful to expose each of these as a dataset. I wonder if it would, if we could access them client side to throw in a grid viewer or something.

The text was updated successfully, but these errors were encountered:

akarve · 2019-03-21T21:23:02Z

It is possible to fetch only the metadata of a package (quilt.install("uciml/iris", meta_only=True)). This would allow metadata inspection, as well as package traversal (similar to __dir__). One of the Quilt contributors can likely help on design or implementation.

saulshanabrook · 2019-03-21T22:08:22Z

Here are some notes from a meeting just had with @akarve and @ResidentMario from Quilt data.

The Quilt data goal is to have declarative specifications of data dependencies.

They created some UX for jupyterlab a while back: https://github.com/quiltdata/jupyterlab It lets you browse quilt datasets and insert snippets of them to the notebook.

Package is backed by a manifest. Represented by JSONL manifest. This is on disk representation. Line based files. Can dynamically filter package per line based on PRESTO. Could have a manifest with 1 million entries.

The options for getting the quilt into jupyterlab:

Separate plugin to explore quilt datasets
Or do it all in data browser

Second option is preferable by them, because it makes it easier to expose new datasets.

Packages separated by logical key (what user browses) and physical key (direct URI to bytes)

Notes from experience at Kaggle:

Number one way to boost popularity of dataset, is to have jupyter notebook related to it. Also commenting. Publications, external references. Visualization resources about dataset, like vega dashboard about it. It would be great to be able to capture links to people using this dataset.

They are building data browser that is awesome! Has vega visualization support built in, all backed by S3

https://alpha.quiltdata.com/b/quilt-example/tree/
https://github.com/quiltdata/t4

Data registry in JL is like a lite version of this. with I think similar interactions.

Main takeaway:
Dataset URLs are structured. They can be viewed best in a tree viewer. This allows discoverability and is mapped cleanly to URL construct.

Datasets can both be Quilt packages, and the files within them. Like this: quilt://quiltdata.com/saulshanabrook/mypackage/ and quilt://quiltdata.com/saulshanabrook/mypackage/my-file.csv

Need a way for a dataset provider to provide lazy/searchable/paged access to sub datasets. I.e. example is the quilt package that contains 1 million files in it. We don't wanna show all those in datasets in the explorer. So need to talk about UX for this. Could have pagination, or search feature.

saulshanabrook mentioned this issue Apr 15, 2019

Nested Datasets / Data providers #11

Closed

saulshanabrook added this to the Future milestone Jun 12, 2019

saulshanabrook added the @jupyterlab/dataregistry-extension label Jul 27, 2019

tgeorgeux mentioned this issue Oct 17, 2019

Heuristic Evaluation #104

Closed

fcollonval closed this as completed Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quilt Data Integration #7

Quilt Data Integration #7

saulshanabrook commented Mar 21, 2019

akarve commented Mar 21, 2019 •

edited

saulshanabrook commented Mar 21, 2019

Quilt Data Integration #7

Quilt Data Integration #7

Comments

saulshanabrook commented Mar 21, 2019

akarve commented Mar 21, 2019 • edited

saulshanabrook commented Mar 21, 2019

akarve commented Mar 21, 2019 •

edited