Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Proposal: JupyterLab Data Registry #5548
This is a proposal for the creation of a data bus for JupyterLab. I have talked to a few folks about these ideas, but wanted to put them down in concrete form to open up the discussion and work.
Different JupyterLab extensions are being created to work with datasets of different types in different ways:
Right now, these different extensions have to create a lot of custom glue to move data between each other. This is painful for developers and artificially limits the ways that data can be used. Once a dataset has been used in JupyterLab in any way, all other extensions that can work with that dataset should immediately become aware of it.
Additionally, datasets don't exist in a vaccum. They are surrounded by rich metadata. Such metadata includes associated notebooks, code snippets, links to publications and people that have used the dataset, etc. Furthermore, users collaborating with others will need collaboration facilities, including the ability to create new metadata through comments and annotations on the dataset. Right now, there is no place to store this metadata or associate it with a dataset in JupyterLab.
Finally, there are a wide range of datasets that cannot be fully downloaded to JupyterLab. This includes large files (CSV, hdf5, etc.), SQL databases, streaming datasets, video, and other API endpoints for working with data (such as s3).
We propose the creation of a Data Bus that enables datasets to become a first class entity in JupyterLab. This Data Bus will have the following characteristics:
@ian-r-rose has started some explorations here:
We also have funding from the Schmidt Foundation @ Cal Poly and are getting that setup to fund other core JupyterLab folks to help out.
I should have mentioned arrow. Some of the ideas for the Data Bus have come from discussions with various folks (including @wesm ) about Arrow. It probably makes sense for us to begin to detail the different data types we would want to support. However, I think we may want to distinguish between input formats and the actual data type formats. For example, a number of different input data formats may provide tabular data, but we may only want to have a single tabular data MIME type in the Data Bus.
We may also want to have streaming, in memory and dynamically loading variants of some of these (certainly for the tabular and text file based ones).
For structured data interchange (anything tabular or JSON-like) I strongly encourage you to consider using the Arrow columnar binary protocol as one of your main mediums. Having worked on this problem space for several years now, getting all the details on this right is devilishly difficult and creating libraries that faithfully implement the same protocol is very time consuming.
I would honestly put Arrow in a different category than the other things you listed. It is much more a protocol for interchange than a storage format, and so distinct from Parquet or other binary formats used for storage.
You might want to take a look at the Flight RPC framework we are developing, which uses gRPC under the hood.
Let us know if we can help!
In the JupyterLab meeting today, a few things came up:
@wesm - thanks, and yes, fully agree with what you are saying here.
+100 to schema.org, hadn't seen datacommons! The heavyweight player in open source data catalogs is CKAN (AGPL): https://github.com/ckan/ckan Runs data.gov, etc. Another very large scale data management system is iRODS: https://github.com/irods/irods As with the on-going annotation and language server efforts, we gain very little in implementing our own new formats, and potentially a great deal in supporting open (or de facto, but openly implemented) standards.
From my perspective, as long as producers and consumers have the option to embed a streaming binary protocol (i.e. some people could embed Arrow's message protocol) and do minimal writes on the server side, and zero-copy receives on the client side, then that sounds great.
So the data payload could be treated opaquely in the Data Bus, and handled by code (e.g. IO handlers in various kernels/widgets) that does not necessarily know how to deserialize or access the data. Our goal with Arrow's Flight RPC system is to enable gRPC clients or servers that don't necessary know about Arrow columnar data, only Protocol Buffers, to still be able to handle the opaque components of the data stream (the "FlightData" message https://github.com/apache/arrow/blob/master/format/Flight.proto#L275)
It would also be useful to have a serialization protocol-independent schema representation
It might make sense to systematically analyze types of data/data transports/data stores that are outlined above and start outlining use cases, so we can get a sense of what the end goals could look like here.
From that we can begin to understand what an "adapter" looks like here and how we can start building a mental model of the structure of the problem.
If we take the the Voyager plugin as an example, it takes in either a URL or some inline data (Vega Lite data docs). So now let's say we have a CSV file on disk and we want to Voyager with that data. We could do this by getting the contents of the CSV file and parsing it in JupyterLab, and Voyager the inline data. Or we could send Voyager the URL itself and let it parse the CSV file.
It seems that both could have advantages. If you parse it first in JL then send in the data, you could reu-use that JSON structure if another extension wanted it, without re-parsing. However, if you send in the URL directly, then you can let Voyager handle the parsing, which could possibly be more efficient based on their implementation or do some type inference better for the use case.
If we expand this picture to look at taking some data from a notebook and visualizing it in Voyager, we have even more possible routes. We could save this to JSON file or save this to a CSV file, and pass in those URLs. Or we could export it to Arrow on the server and load this on the client, parsing to JSON, then feed this to Voyager in memory.
The goal of this approach is to start with use cases, then find what technology supports that use case efficiently, then figure out how to design a system that is is flexible enough to support chaining the required technology together with the right UX.
It seems to me that JDB ("Jupyter Data Bus") should be agnostic to the form of data serialization used. What you need is:
This problem of a dataset spanning multiple message frames should be part of the JDB protocol. In Arrow, for example, obtaining the complete schema including dictionaries (for dictionary-encoded fields) may involve receiving multiple messages. In Avro, the schema (JSON) could be sent first, then sequences of records as follow up payloads
Dealing with un-schema'd data in production applications is painful enough / dangerous enough that Jupyter component developers will probably want to use data transport with strong schemas
Way out my depth here, and not sure if this is out of scope, but I just came across tributary, a package supporting Python data streams, offering reactive, asynchronous, functional and lazily evaluated datas streams that perhaps complement static data MIME types with streaming data feeds? Not sure if they're the sort of thing that could offer data access onto and egress from a streaming data bus type? WebRTC would be another obvious streaming type.
The same developer also seems to have had a hand in this streaming chart package — https://github.com/jpmorganchase/perspective — which might provide a possible streaming data bus consumer use case?
Adding a link to the above mentioned project:
Also, @BoPeng multiple kernel Notebook:
It should be considered vital not to take a dep on a particular language and it's implementation which might make Apache Avro important https://avro.apache.org
gRPC might also be worth consideration https://grpc.io as a building block
Many thanks to @10Dev for including me in the discussion. The proposed DataBus is at the JupyterLab level, it is mostly designed for extensions that consume dataframe-like data, but I suppose language kernels could make use of DataBus later if an API is provided, and it can be expanded to support more datatypes. In that case any kernel could use some magics to read from and write to the bus and exchange data with the frontend and other kernels. This is brilliant!
Anyway, before DataBus becomes available, I would like to write a bit about how SoS does a similar thing to exchange data between multiple kernels in the same notebook. Basically, SoS is a super kernel that allows the use of multiple kernels in one notebook, and allows the exchange of variables among them. Using a
SoS creates an independent homonymous variables in
Under the hood, SoS defines language modules for each language (e.g. sos-r, sos-python) that "understand" the data types of the language and assist the transfer of variables directly or by way of the SoS (python3) kernel. More specifically, when
is executed from a kernel, SoS would run a piece of code (hidden to users) to save
This design is non-centric and incremental in the sense that
I can imagine that SoS can make use of DataBus to expand the data exchange capacity to frontends, and assist the data exchange among kernels, so I will be happy to assist/participate in the development of DataBus. Actually, we ourselves have tried to conceptualize a similar project for data exchange between languages (sos-dataexchange) outside of Jupyter, which could benefit from the DataBus project.
I presented the data exchange feature of SoS in my JupyterCon talk in August. You can check out the youtube video (start from 7 minute) if you are interested.
Allow me to propose another idea we had during the brainstorm of the sos-dataexchange project.
How about implementing DataBus as a separate project?
Here is how it might work:
Without #2815, JupyterLab becomes another dead end in the inherently polyglot field of data science and AI. Perhaps only masochists enjoy polyglot, but it is here to stay for a very long time and needs to be properly addressed as soon as it can for JupyterLab to have relevance and longevity. The longer that #2815 is pushed into future milestones, the more dependencies that build up and make the eventual implementation into a formidable undertaking that could exceed resources available.
The DataBus is something that can seriously "grease the wheels" to help #2815 or become an impediment if not designed right.
Also, semantics. What we mean by "data" and "databus" is going to drag up very different associations for different experiences and domains. A DataBus can be something like DBUS ( https://en.wikipedia.org/wiki/D-Bus ) that was supposed to be a lightweight system like this proposal that turned into a legacy monster or a DataBus can be a type of pipelining system like Nextflow ( https://www.nextflow.io ) or just in-memory organized RAM such as several OSS: Apache Arrow ( https://arrow.apache.org ) - Apache Ignite ( https://ignite.apache.org ) - Apache CarbonData ( https://carbondata.apache.org ) - Apache Gora ( https://gora.apache.org ) - Halzelcast ( https://hazelcast.org ) - Infinispan ( http://infinispan.org )
The @BoPeng sos-dataexchange appears to lean more toward a pipeline system than an inter-kernel data send/receive system.
FWIW, I think there is a critical need for a universal data system that has some aspects of a pipeline but is more flexible like a big data software implementation of a Crossbar switch ( https://en.wikipedia.org/wiki/Crossbar_switch ) or perhaps think of it as a local embeddable in-memory Data Grid ( https://en.wikipedia.org/wiki/Data_grid )
Pipelines are too limited and linear even if you add DAGs to them.
Which is why there is a giant discontinuity from Notebooks to production pipelines with a lot of hand crafting and often complete change of architecture to get any reasonable performance.
I might have a failure of imagination but I can't see how any of this can get shoe-horned into JupyterLab...
For Apache Arrow, people might find this article interesting:
referenced this issue
Nov 23, 2018
I believe that the concept of DataBus was conceived without consideration of workflow systems. However, the SoS suite of tools, namely the SoS polyglot notebook and SoS workflow engine were designed to narrow the gap between notebooks and production pipelines, and a lack of data exchange model for the SoS workflow system directly motivated the discussions around our sos-dataexchange project and my proposal in this thread, although we have not been able to write a single line of code for that project.
I disagree with @10Dev that implementing DataBus as a separate project would lead to serious performance issues but I agree that expanding DataBus to a more comprehensive project can be unwise giving the intrinsic complexity of the whole polyglot business. On the other hand, if DataBus is to be designed to allow kernel-level access, there will inevitably be some language-specific bindings to a DataBus protocol, and sos-dataexchanger can be shamelessly implemented as a standalone version of JupyterLab DataBus protocol and its language bindings.
FWIW, my words were "might introduce serious performance issues" i.e. keep an eye on that when designing it...
But it does make me wonder. Perhaps the base platform for Notebook type systems should be Native C++ and GPU as a performant base to host everything else...
In a way that is what Apache Arrow has done with separate polyglot implementations of their platform.
If one creates a matrix of that polyglot versus Arrow and gRPC and Protobuf, there are interesting gaps...
During a developer meeting today, it is clarified that this project will focus on the JLab frontend and visualization of data, not on data processing and data exchange among kernels. The data exchange project from the SoS camp will be a separate project (which might be renamed to DataBus :-).
https://github.com/jpmorganchase/perspective was mentioned during the discussion.
I was looking at a couple of extensions today that I could see subscribing to / consuming / producing data objects:
What struck me about the
What would be nicer would be if the
But how would that work in a databus sense? I can imagine how a pivot table could subscribe to a data frame object, and then return an updated dataframe object transformed through direct manipulation onto back onto the databus. But could/should it also (or instead?) be able to pass back a set of programme statements that effect the same change?
I.e. rather than subscribe to
There is strong interest from various people in having code-snippets attached to data sets in the data bus. The way we have been thinking about it is to put things like that into the metadata.…
On Mon, Dec 3, 2018 at 5:37 PM Tony Hirst ***@***.***> wrote: I was looking at a couple of extensions today that I could see subscribing to / consuming data objects: - [ipypivot]](https://github.com/PierreMarion23/ipypivot), a wrapper for a pivottable widget that allows direct manipulation and reshaping of a pandas dataframe; - dual canvas <https://github.com/AaronWatters/jp_doodle/blob/master/notebooks/workshop/0%20-%20Outline.ipynb>, which allows manipulations on one canvas element to be saved as a snapshot onto a paired canvas element and be made available as an image from that snapshot. What struck me about the ipypivot widget in particular was that it could be used to carry out transformations to the contents of a dataframe within the widget and then return an appropriately reshaped dataframe to the notebook kernel namespace. What concerns me about that is the loss of reproducibility. If I directly manipulate a data object, how do I replay that? (I think there is a workaround with ipypivot: set up the pivot table to perform the transformation you want then play the data transformation through that.) What would be nicer would be if the ipypivot widget were to export a set of pandas statements that implement the transformation applied by the pivot table. A user could then visually and directly engage with a dataframe in a pivot table scratchpad and export the corresponding code capable of effecting the same transformation back into the notebook. (i.e. the widget would act as a code generator rather than an object trasnformer). But how would that work in a databus sense? I can imagine how a pivot table could subscribe to a data frame object, and then return an updated dataframe object transformed through direct manipulation onto back onto the databus. But could/should it also (or instead?) be able to pass back a set of programme statements that effect the same change? I.e. rather than subscribe to df and return df', could it subscribe to df and return a set of commands implementing f(df) such that f(df) = df' ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5548 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABr0ATZ3PkAKakv4FzMzsEAfvJCh4BHks5u1dHrgaJpZM4X9uka> .
-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub email@example.com and firstname.lastname@example.org
If you step back and listen to what you are saying, "metadata" and "subscribe to data" you are talking about an Event System of some sort. Which might actually be a more appropriate starting point where whatever is in your mind when you think of "DataBus" becomes instead a transport negotiation in an Event Manager.
But, Apache Arrow appears to be the opposite of an abstraction layer since each language/runtime (#2815) accesses the identical data representation without deserialization/conversion/copy etc. and then there data in GPU space to consider as well
"The Plasma store can assist with developing applications involving multiple processes that need to share data, which may reside in CPU or GPU memory. Computational processes live separately from the Plasma store, a third party daemon. The processes are able to access data managed by Plasma through zero-copy shared memory access, and so by employing the Arrow columnar format to encode structural information, can describe complex datasets and make them available with minimal serialization overhead. We wish to provide strong support for managing datasets used by multiple processes living on the CPU or GPU." https://ursalabs.org/tech/
In theory, "out of the box" Arrow would support a common data for #2815 kernels in:
Other projects can implement Arrow:
Looking at a Pub/Sub model seems unwieldy the instant you take the Polyglot into account. Arrow is a new accomplishment in Xplat efficiency I think.
FWIW, some Pub/Sub thingies to assist design imagination:
I have started to work on an initial implementation of the data bus. In that process I run starting to run into some challenging design questions around metadata. I will try to summarize those here:
First, many dataset providers in Jupiter lab will not have a persistent handle on the data sets. An example is a dataframe that comes from a notebook.
Second, in these situations is also very likely that the data set will not come with any metadata. In other words, a lot of primitive data set formats we are interested in do not have any built in metadata capabilities.
Third, if a mime type is tied to a dataset, metadata pair it makes it very difficult to have a given data set format that has different metadata schemas attached to it by different providers. For example I may be registering CSV files with the data bus with a very simple metadata schema, someone else may be registering CSV files with a much more complex metadata schema. In the current design those would be treated as two different mime types.
I see two ways out of this dilemma. One, we could attempt to design a universal metadata schema that would apply to all mime types. With some of the work that other organizations have done on dataset schemas this might be possible. At the same time the promise of a “universal metadata schema” may resolve to a failure. Two, we could have different mime type identifiers for the data and the metadata and then have a provider register a pair of those. then consumers could work with the data set, even if they don't understand the metadata.
referenced this issue
Dec 6, 2018
Some things that are surfacing in ongoing discussions:
This was referenced
Dec 28, 2018
@psychemedia I'm hoping to add reflection support to https://github.com/timkpaine/perspective-python so that you can configure your pivots and stuff in the front end and get the corresponding python code. We're also going to enable edit so you can modify stuff and reflect it back on the underlying dataframe/arrow/list/dict similar to ipysheet (but with added benefits of pivoting and streaming)
@timkpaine Ah, that's interesting (I'd posted a similar issue some time ago on another pivottable widget here). It seems that patterns for communicating back from widgets into code are starting to surface, which is hugely useful I think. Supporting code generation from py to html made it easier to create HTML pages (eg using things like IPyleaflet) and being able to use browser interactions to send code back to notebook makes for a whole new class of UIs / interactions. See also things like this for getting data out of Altair widget and this for getting data out of