Packaging - the custom msg, entry points, and cached static assets solution. #116

Open
jdfreder opened this Issue May 26, 2015 · 73 comments

Projects

None yet

9 participants

@jdfreder
Member

Since we haven't agreed on this yet, I'm opening an issue instead of writing an IPEP. If this is agreed on, I'll write an IPEP.

The week before last, @ellisonbg and I brainstormed about Jupyter packaging. As I remember it, the best solution we came up with requires a combination of Python packaging entry-points, a new message type, and static asset caching (in the web server). This is my understanding of how this solution would work (my notes from our meeting are at home, and my apartment is being fumigated, so I don't have access to them).

Jupyter level, kernel extensions

screen shot 2015-05-26 at 1 47 13 pm
A new message, the first, (in blue) would be added, allowing the server to ask the kernel if the static assets it knows about, associated with that kernel, is correct. The message would be a dict of static asset path and contents hashes.

The same message in the opposite direction is the kernel's response. It would be some type of data structure, maybe a binary message, containing static asset paths and their contents, and a list of the static assets that can be deleted from the cache.

A second new message (in red), would be added that would allow the kernel to invoke a require.js call in the front-end. This is preferred over standard display(JS) calls, because the notebook contents will remain unaffected.

IPython level, kernel extensions

Python entry points will be used as a registry. Two entry points will be defined:

  1. an entry point for code to run when the kernel is started.
  2. an entry point for a method that returns static assets (paths).

Jupyter level, server extensions and notebook extensions

Python entry points will be used as a registry. Three entry points will be defined:

  1. an entry point for code to run when the server is started.
  2. an entry point for a method that returns static assets (paths).
  3. an entry point for a paths to be requireed when the notebook page loads.

EDIT
To help the discussion, issues and specific cases are listed here: https://jupyter.hackpad.com/Packaging-PbIgxnC71or

@rgbkrk
Member
rgbkrk commented May 26, 2015

A new message, the first, (in blue) would be added, allowing the server [frontend] to ask the kernel if the static assets it knows about, associated with that kernel, is correct. The message would be a dict of static asset path and contents hashes.

The same message in the opposite direction is the kernel's response. It would be some type of data structure, maybe a binary message, containing static asset paths and their contents, and a list of the static assets that can be deleted from the cache.

I certainly like this approach. It makes sure that assets are based on the kernel runtime rather than associated with the overall notebook server (or other frontend).

Can path be remote or local, depending on the author's implementation?

@Carreau
Member
Carreau commented May 26, 2015

It breaks the assumption that the kernel does not know it is in a notebook/js environment, and make it complicated to map kernel-path, to server-path, to frontend-path.

The Python packaging registry is not language agnostic. It forces each kernel(s) to reimplement a static-webserver, our server only act as a proxy.

I can see a problem with identical-path in many kernels. Once one is cached, it shadows other kernels resources. or you install a new version, and restart your kernel. You get the cached versions.

Kernel authors will never bother to implement delete messages.

@rgbkrk
Member
rgbkrk commented May 26, 2015

It breaks the assumption that the kernel does not know it is in a notebook/js environment, and make it complicated to map kernel-path, to server-path, to frontend-path.

If you include the require bits, I'd say that's true. However, treating this as a resource query and response relative to the kernel does not make it coupled to the notebook. We'd want this for any other HTML based frontends, including Hydrogen.

I'm not in agreement about this using a Python packaging registry, as I think resources should be installed per kernel.

Kernel authors will never bother to implement delete messages.

Don't you think that would effect their users negatively enough that eventually they would?

@minrk
Member
minrk commented May 26, 2015

The kernel knowing about static assets and telling the server seems problematic. I think if the kernel is being asked about the assets, it should be responsible for serving them, as well.

@Carreau
Member
Carreau commented May 26, 2015

Don't you think that would effect their users negatively enough that eventually they would?

No they won't thay are developper, it work for them if they restart the server, which they do every 10 minutes.

We'd want this for any other HTML based frontends, including Hydrogen.

Nothing tell you that resources will be the same for hydrogen and the notebook.

@minrk
Member
minrk commented May 26, 2015

I can see a problem with identical-path in many kernels.

I don't expect this to be a problem. Any resource fetched from a kernel should necessarily be served from a kernel-specific path. So when kernel K is asked for resource R, the server maps it to /K/R, not /R, so kernels are not capable of collision with each other.

I do think if we are going as far as making the Kernels responsible for static resources via messages, the most logical way to do that is to proxy requests to the Kernels themselves, and expect Kernels to run an HTTP server to serve the files. HTTP already has all the features we are describing here, I think.

@minrk
Member
minrk commented May 26, 2015

A second new message (in red), would be added that would allow the kernel to invoke a require.js call in the front-end. This would eliminate the need of a notebook extensions list, and it's need to be configured.

This statement isn't true. nbextensions aren't limited to kernel-specific behavior. toc, slideshow, nbgrader, etc. would all not be addressed by the proposal, and continue to require nbextensions as it is.

@jdfreder
Member

Hey guys, glad we are talking about this. Here are my responses.

@rgbkrk

Can path be remote or local, depending on the author's implementation?

Sorry! I really should have clarified, "path" here means "unique name". It can be whatever string the package author wants!

@Carreau

It forces each kernel(s) to reimplement a static-webserver, our server only act as a proxy.

The only piece of the above that the kernel authors need to implement is the single message, in blue.

It's up to kernel authors to choose a mechanism equivalent to Python's entry points, or something that can be used as an alternative.

It breaks the assumption that the kernel does not know it is in a notebook/js environment,

No.
Webserver says "hey these are assets I know about"
Kernel says "these are assets you are missing, and while you're at it delete these others"
Kernel says "load this asset" (which doesn't have to be JS)
Webserver says to client "load this asset" (which doesn't have to be JS)

I can see a problem with identical-path in many kernels. Once one is cached, it shadows other kernels resources. or you install a new version, and restart your kernel. You get the cached versions.

You missed the part where I mentioned caches are associated to specific kernels, by id.

Kernel authors will never bother to implement delete messages.

That means their kernels aren't up to spec.

@minrk

it should be responsible for serving them, as well.

But then if the kernel hangs, or is thinking, the assets are unavailable.

@jdfreder
Member

This statement isn't true. nbextensions aren't limited to kernel-specific behavior. toc, slideshow, nbgrader, etc. would all not be addressed by the proposal, and continue to require nbextensions as it is.

Thanks for catching that! I'll edit my post.

@rgbkrk
Member
rgbkrk commented May 26, 2015

I think if the kernel is being asked about the assets, it should be responsible for serving them, as well.

That's fair. Wait... How many ports are we talking then? That doesn't seem tractable unless those are proxied to the main notebook server.

@minrk
Member
minrk commented May 26, 2015

But then if the kernel hangs, or is thinking, the assets are unavailable.

That's true, but how else are you going to get the resources from the kernel to the notebook server? It sounds like you have to either:

  1. assume shared filesystem, and make it impossible for kernels to be isolated or remote
  2. reimplement http over zmq, and fetch from the kernel anway
@jdfreder
Member

Nothing tell you that resources will be the same for hydrogen and the notebook.

I don't think we'd need to differentiate. The same way the rich display system works, if a front-end can load an asset, it wont.

@Carreau
Member
Carreau commented May 26, 2015

I mean the JS could be different in notebook than in hydrogen. or rodeo, or thebe. do you introduce mimetype per frontend ?

@rgbkrk
Member
rgbkrk commented May 26, 2015

@minrk

assume shared filesystem, and make it impossible for kernels to be isolated or remote

I'm certainly going to reject that one. Doesn't work right for thebe or any other remote context.

reimplement http over zmq, and fetch from the kernel anyway

At first I thought you were joking, then I assumed someone implemented that. Like this? https://github.com/fanout/zurl

My thinking was that resources can be local paths or fully qualified URLs.

@minrk
Member
minrk commented May 26, 2015

How many ports are we talking then?

One. The notebook server would proxy requests like /kernel/:kernel_name]/static/... to kernel_name.

There's also a question of whether these should be per kernel name or per kernel id. If it's per id, it's going to mean roughly 0 cache hits as every kernel instance would get its own URL.

@jdfreder
Member

That's true, but how else are you going to get the resources from the kernel to the notebook server? It sounds like you have to either:

I may not understand, but this is what the cache is for. The webserver would ask the kernel about the assets once the kernel is started, and wouldn't need to later.

@minrk
Member
minrk commented May 26, 2015

My thinking was that resources can be local paths or fully qualified URLs.

That is forcing knowledge of the notebook server onto the kernels. Do we really want to do that? I assumed not.

@jdfreder
Member

My thinking was that resources can be local paths or fully qualified URLs.

Yes

@minrk
Member
minrk commented May 26, 2015

I may not understand, but this is what the cache is for.

Cache only helps mitigate future requests, it still needs to get them from the kernel in the first place.

The webserver would ask the kernel about the assets once the kernel is started, and wouldn't need to later.

So all resources are known ahead of time, and no new resources are requested during the lifetime of the kernel?

@takluyver
Member

Webserver says to client "load this asset" (which doesn't have to be JS)

This feels like the wrong way round to do things. The webserver shouldn't be telling the client what to load, the client should be asking the server for the things it determines it needs. Like the way widget display messages can include a require path for a module to load the view from. There are established mechanisms for caching to avoid loading the same thing twice.

@jdfreder
Member

So all resources are known ahead of time, and no new resources are requested during the lifetime of the kernel?

Yes, that was our thinking. It's totally possible we overlooked a use case where that was incorrect.

Also, you could re-request assets on kernel restart (not just first start).

@minrk
Member
minrk commented May 26, 2015

I'm struggling to see what problems this solves. If we are assuming the kernel knows everything about the server's filesystem in order to tell the server where verything else, then what's the advantage of the kernel managing resources at all, if it can only manage them in a way that the server can understand and access?

@jdfreder
Member

Like the way widget display messages can include a require path for a module to load the view from. There are established mechanisms for caching to avoid loading the same thing twice.

The widget display message does exactly that, "hey load this"

@minrk
Member
minrk commented May 26, 2015

Does this mechanism provide any benefit over a /kernels/:kernel_name/static directory?

@takluyver
Member

The widget display message does exactly that, "hey load this"

Possibly I misunderstood. It sounds like in your proposal, the server is just telling the frontend to load something, as a separate message from anything that might actually use it. The widget display messages say 'create this class, loading it from X if you need to'. Crucially, loading the resource is tightly tied to using it, which makes it easy to avoid the race conditions where something would try to use the resource just before it was loaded.

@jdfreder
Member

Does this mechanism provide any benefit over a /kernels/:kernel_name/static directory?

If the client, webserver, and kernel exist on three different machines, it does.

Also, the /kernels/:kernel_name/static directory still has the problem of installation being a two step process (yes this is a problem). This is where the kernel being in control of the asset locating offers a large benefit. Package writers can use methods native to their language for packaging static assets, for IPython & Python this is entry points.

@jdfreder
Member

Crucially, loading the resource is tightly tied to using it, which makes it easy to avoid the race conditions where something would try to use the resource just before it was loaded.

That's a good point, about the backend not being aware of when the resource is loaded. Unfortunatley this problem already exists in our current architecture. A solution would be to make the red message request/response, so in the kernel the API could be implemented using an asynchronous design pattern.

@minrk
Member
minrk commented May 26, 2015

If the client, webserver, and kernel exist on three different machines, it does.

How? I don't see a mechanism for getting the files from the kernel to the webserver, only communicating paths, which require the filesystem to be the same.

the /kernels/:kernel_name/static directory still has the problem of installation being a two step process (yes this is a problem).

It also doesn't solve that problem, it just punts it to the kernel. How does the package communicate this information to the kernel, such that the kernel knows at startup, before any imports, what resources are available?

@minrk
Member
minrk commented May 26, 2015

If we use setuptools entrypoints for this, and communicate files from the kernel to the server at startup and only at startup, this means potentially 100s of MB of file transfer on every kernel startup to the web server. e.g. if a kernel plugin makes MathJax available, there's no mechanism to make the pieces available on request, which proxying http would do, instead it requires all possible resources to be moved at once to the server on every kernel start.

@takluyver
Member

A solution would be to make the red message request/response, so in the kernel the API could be implemented using an asynchronous design pattern.

The bit about request/response makes sense to me, but I'm not sure what you mean about using async patterns in the kernel. I was thinking about race conditions in the frontend: if 'load this resource' and 'do something that needs that resource' are two separate messages, the 'do something' message can arrive before loading has finished, and then things get tricky. If the frontend requests (with caching) the resources as it needs them, you avoid this problem.

@minrk
Member
minrk commented May 26, 2015

Even if the caching works well, you will have to hash every resource at startup to validate the cache, rather than at request time. To get a sense of what order of magnitude this might have, try:

time find notebook/static/components -type f -exec md5 "{}" > /dev/null \;

in the notebook repo

@jdfreder
Member

I don't see a mechanism for getting the files from the kernel to the webserver, only communicating paths, which require the filesystem to be the same.

This is why I apologized in my first response, "path" really should be "name". What's being communicated in the message to the kernel are "names" & hashes of corresponding contents. And the other way is "names" and actual file contents. How the file contents make there way over the line is up for discussion, but I was thinking binary messages of some sort.

It also doesn't solve that problem, it just punts it to the kernel. How does the package communicate this information to the kernel, such that the kernel knows at startup, before any imports, what resources are available?

Punting the problem to the kernel is the whole point. Python has a mechanism for this, entry points, which means it's solved for IPython and Jupyter which is all I'm concerned about. The generic messages allow other kernel authors to solve the problem how they want. i.e. IJulia will have to implement their own registry, but as long as they implement the messages, they can do it however they want.

If we use setuptools entrypoints for this, and communicate files from the kernel to the server at startup and only at startup, this means potentially 100s of MB of file transfer on every kernel startup to the web server. e.g. if a kernel plugin makes MathJax available, there's no mechanism to make the pieces available on request, which proxying http would do, instead it requires all possible resources to be moved at once to the server on every kernel start.

The caches stored in the web server would be persisted to the disk. On request, if a resource doesn't exist because blue message #2 hasn't been received yet, the request will be deferred until that message has been received. Once the message is received, if the content still doesn't exist, 404, otherwise respond with the contents.

@jdfreder
Member

if 'load this resource' and 'do something that needs that resource' are two separate messages, the 'do something' message can arrive before loading has finished, and then things get tricky.

The 'do something' message, like the 'load this resource' message comes from the kernel. Hence, if the message is request/response, 'load this resource' function in the kernel would return a defered (or something, whatever is best for the language), in which, once it's resolved would send the 'do something' message.

@jdfreder
Member

you will have to hash every resource at startup to validate the cache,

Yes, that could be a problem for the kernel. hmmm. I hope I don't sound ridiculous saying this, but you could cache the hashes in the kernel by the file name and timestamp...?

@minrk
Member
minrk commented May 26, 2015

@jdfreder I'm not sure what the initial kernel->server publish accomplishes. Why not load on first request for a given resource from the server, and cache that? It wouldn't have the unbounded cost at startup. It would be possible for the kernel to be slow on the first request of a particular resource if the kernel is busy, but I'm not sure that's worse than being slow on every startup.

@rgbkrk
Member
rgbkrk commented May 26, 2015
~/code/jupyter/notebook$ time find notebook/static/components -type f -exec md5 "{}" > /dev/null \;

real    0m51.753s
user    0m19.012s
sys 0m27.365s
@Carreau
Member
Carreau commented May 26, 2015

Kyle machine is faster than mine :

$ time find notebook/static/components -type f -exec md5 "{}" > /dev/null \;

real    1m22.300s
user    0m27.233s
sys 0m51.760s
@minrk
Member
minrk commented May 26, 2015

I hope I don't sound ridiculous saying this, but you could cache the hashes in the kernel by the file name and timestamp.

Not ridiculous, we probably should do that if we require publishing all resources at kernel start time. But now we're caching our cache, so we can cache while we cache :)

@Carreau
Member
Carreau commented May 26, 2015

I also feel that this thread is a "let's abstract things in a way that will allow us to get an abstraction to abstract what we need to be abstracted to solve it."

@takluyver
Member

The 'do something' message, like the 'load this resource' message comes from the kernel. Hence, if the message is request/response, 'load this resource' function in the kernel would return a defered (or something, whatever is best for the language), in which, once it's resolved would send the 'do something' message.

But to implement that, you need the frontend to send a receipt right back to the kernel to acknowledge that the resource has been received. It's not enough for the kernel to know that it sent the resource, it has to wait until the frontend has received it. And then it needs to think about what to do if it doesn't get that receipt within a timeout, and so on.

It really seems like it would be much simpler to have the frontend request resources when it's trying to do something that requires them. That's already the way HTTP+HTML works anyway, so it should be easier to implement.

@minrk
Member
minrk commented May 26, 2015

We should probably loop in kernel authors, but I would guess "you can serve static resources with http" is simpler than our own http-lite.

@rgbkrk
Member
rgbkrk commented May 26, 2015

The scheme I'd want is to be able to request from the kernels, specific resources.

If I have a notebook server, I want it on /some/kernel/path, for the sake of a standalone server, JupyterHub, etc.

If I have a local app (yes, I'm talking Electron/Hydrogen/Atom), then I probably want to get the local path.

If the asset is external, loaded from a CDN, I should be able to get it in either case.

@minrk
Member
minrk commented May 26, 2015

Plus, if resources are actually kernel-spec-specific, rather than kernel-instance-specific (like kernels/:name/static, but handled by the kernel), the resources could be served by a dedicated kernel that's not actually running code for a notebook, with only one resource-serving kernel per spec.

That might be a terrible idea, though.

@jdfreder
Member

It would be possible for the kernel to be slow on the first request of a particular resource if the kernel is busy, but I'm not sure that's worse than being slow on every startup.

Yeah, this was my concern. Also, if the kernel is responsible for serving the files while the kernel is running user code which tells the front-end to load a file, you may get a deadlock (depending on how the 'load this resource' function is implemented).

I also feel that this thread is a "let's abstract things in a way that will allow us to get an abstraction to abstract what we need to be abstracted to solve it."

Needs to start somewhere ๐Ÿ˜‰

you need the frontend to send a receipt right back to the kernel to acknowledge that the resource has been received

Yup.

And then it needs to think about what to do if it doesn't get that receipt within a timeout, and so on.

That's a small detail, which I'd hope the asynchronous pattern of choice could handle well.

@rgbkrk
Member
rgbkrk commented May 26, 2015

Plus, if resources are actually kernel-spec-specific, rather than kernel-instance-specific (like kernels/:name/static, but handled by the kernel), the resources could be served by a dedicated kernel that's not actually running code for a notebook, with only one resource-serving kernel per spec.

That's the kind I was thinking. Packages would install resources into that kernelspecs namespace (CSS, JS, etc.)

@Carreau
Member
Carreau commented May 26, 2015

That's a small detail, which I'd hope the asynchronous pattern of choice could handle well.

python 3.5 only, you need to use async and await.

@takluyver
Member

I would guess "you can serve static resources with http" is simpler than our own http-lite.

Depends on how lite. For the R kernel, I don't relish the idea of trying to integrate an HTTP server event loop with the loop listening on the ZMQ sockets (which currently just uses zmq poll).

And then it needs to think about what to do if it doesn't get that receipt within a timeout, and so on.

That's a small detail, which I'd hope the asynchronous pattern of choice could handle well.

I feel like we're talking at cross purposes here. I know it could all be solved with enough async cleverness. My point is that if you don't have this message asking the frontend to load a specific resource, you avoid the whole problem entirely, and the kernel code doesn't need to do clever async stuff.

By analogy, when you request a web page, it contains references to images, JS etc. that the page requires. It doesn't try to tell the frontend to load those resources before loading the page*: the frontend parses the page and requests the resources it needs.

(* OK, so HTTP 2 will in fact do some of this as an optimisation. But I imagine that all the abstractions for servers and frontends will continue to make it look like it's the frontend requesting resources, not the backend pushing them)

@minrk
Member
minrk commented May 26, 2015

Also, if the kernel is responsible for serving the files while the kernel is running user code which tells the front-end to load a file, you may get a deadlock (depending on how the 'load this resource' function is implemented).

If I were writing this for IPython, I would run an HTTP server in a thread, so a deadlock should not be an issue, and even performance problems would be minimized unless the blocking execution is a single long-running GIL-holding call, which is rare. Even if we make this a zmq channel, I would make it a dedicated zmq channel, so it can be handled concurrently with execution without blocking, and I would still run it in a background thread in IPython, and not part of the shell-channel dispatch loop.

@jdfreder
Member

@takluyver I understand where you are coming from, but still don't see it as much of a problem and there is still the more important (opinion) problem of the two step install. Unless you're suggesting my third bullet, under the last section, is enough - "an entry point for a paths to be requireed when the notebook page loads."

@minrk
Member
minrk commented May 26, 2015

@rgbkrk how much better would it be for sidecar, etc. if we returned file://path/to/resource.js vs http://localhost:port/kernel/resource.js? I wouldn't expect it to be much.

@jdfreder
Member

@rgbkrk said over Gitter "we were a bit far in trying to discuss this
Maybe it would be worth stating the problem, the consumers of the API, etc.
User experience for those, UX for the notebook".

I think this is a good idea. https://jupyter.hackpad.com/Packaging-PbIgxnC71or

@jdfreder jdfreder referenced this issue in jdfreder/jupyter-pip May 27, 2015
Open

General issue, permissions aren't handled well. #13

@rgbkrk
Member
rgbkrk commented May 27, 2015

@minrk Loading it in either case makes no difference. We can load resources from either of file:/// or http:// just fine.

I'm going to admit that I was aiming for purity of pulling local resources when using a local app, to limit how many web services are being run. It's a bit silly of me, considering that each running kernel has 5 open ports for ZMQ sockets.

Either URL is fine.

@ellisonbg
Contributor

Hi all! I am excited that this discussion is happening. I just got back
from traveling for the week and am now sick. I will try to catch up and
provide comments...thanks for starting this @jdfreder

On Tue, May 26, 2015 at 5:13 PM, Kyle Kelley notifications@github.com
wrote:

@minrk https://github.com/minrk Loading it in either case makes no
difference. We can load resources from either of file:/// or http:// just
fine.

I'm going to admit that I was aiming for purity of pulling local resources
when using a local app, to limit how many web services are being run. It's
a bit silly of me, considering that each running kernel has 5 open ports
for ZMQ sockets.

Either URL is fine.

โ€”
Reply to this email directly or view it on GitHub
#116 (comment).

Brian E. Granger
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgranger@calpoly.edu and ellisonbg@gmail.com

@jdfreder
Member

@minrk I'm starting to lean towards your idea of not having a cache, and here's an example of why (I added it to the 'specifics' in the hackpad.

py astronomy extension - an extension that contains 100s of GB of image static assets. A widget is displayed in the front-end that allows users to navigate through the images, one at a time. The webserver shouldn't duplicate all of the static assets, and unless the user views every single image, not all of the images should be sent over any network connection.

@minrk
Member
minrk commented May 27, 2015

@jdfreder you can still have a cache - it could last for the lifetime of the server or kernel or some other metric. The main piece I find problematic is attempting to populate that cache all in one go, whether it's used or not, rather than at request time, like one does with HTTP.

@jdfreder
Member

How about the following revision:

screen shot 2015-05-27 at 10 03 25 pm

  • Browser handles the cache
  • Browser requests things to be loaded instead of kernel pushing them
@takluyver
Member

That seems more reasonable at a glance.

@minrk
Member
minrk commented May 28, 2015

I think that makes more sense. Is there any difference between the red and blue requests other than time? It seems like they are identical in content and structure, other than the time when the send.

@jdfreder
Member

@minrk they could be the same, but there would have to be a standardized name for the "asset" that is requested by the messages in red. Something like "notebook_kernel_assets", that the kernel would recognize and return a list of assets to be loaded on notebook page load, IPython would populate this using entry points for example.

@minrk minrk added this to the no action milestone Sep 11, 2015
@ellisonbg ellisonbg modified the milestone: 5.0, no action Oct 6, 2015
@rgbkrk
Member
rgbkrk commented Oct 12, 2015

Can we turn this into an enhancement proposal?

/cc @parente

@jdfreder jdfreder referenced this issue in matplotlib/matplotlib Jan 21, 2016
Merged

IPython Widget #5754

@parente
Member
parente commented Mar 31, 2016

If discussion of this is starting up again, jupyter/notebook#839 is probably relevant. The proposal here solves part of the problem if all the static assets dependencies are known at notebook start and cell execution is blocked until all those assets load. If not, the kernel might emit JS that requires another dependency that is not yet loaded.

Any web assets not known at notebook load time are still a problem.

@rgbkrk rgbkrk added a commit to rgbkrk/roadmap that referenced this issue Apr 1, 2016
@rgbkrk rgbkrk State intent to solve problem of requiring assets
Libraries, and by extension kernels, have static assets that are required in frontends such as the notebook, thebe, hydrogen, and dashboards..

This issue continues to come up over and over without a total solution. At the very least I'd like for us to put it on the roadmap and acknowledge that we will solve it.

Initial (old) proposal/discussion: jupyter/notebook#116
Related: jupyter/notebook#839

As it applies to Thebe, they currently have to bundle

* all notebook js (not tied to kernel)
* ipywidgets js (tied to kernel)
* thebe itself

Each thebe release has to strictly specify the version of notebook, ipywidgets, in addition to thebe itself, and is incompatible with other versions.

Discussion from the Spring Jupyter Dev Meeting 2016: https://jupyter.hackpad.com/Spring-2016-Dev-Meeting-h0y1TIAWxz1#:h=Kernel-Static-Resources
7cc3ff5
@janschulz
Contributor

The R kernel discussed this (or a subset of this ) problem recently: IRkernel/IRdisplay#14 Basically: how should JS/css libs for visualisations be handled

The knitr/rmarkdown system in R has this problem solved: the "knit_asis" object (kind of like the display_data message) gets styling in an extra attribute knit_meta which lists dependencies (each time such an object is produced) and rmarkdown (in this case in a similar role like the notebook frontend) manages what is outputted in the final document. Knitr/rmarkdown has is slightly easier, as it only makes one pass over the document and you can't reevaluate the notebook, so the solution here needs to handle reevaluating cells and removing styling when no cell references it (and this needs to be in the on-disc format as well...). Also, the "producer" of the knit_asis object has information about the final output format (html/latex) and so knitmeta only contains stuff for that format.

As such visualisations need to be available in html content which is derived from ipynb files (nbconvert, nbviewer, github,...), I don't think any "request from kernel" steps can be part of a solution for this problem, as the ipynb file needs to have all such dependencies available. A solutions should also handle that different output formats (html, latex,...) need different dependencies (message format and stored in the ipynb).

But this would also make the implementation for this easier:

  • The messaging format would need an update to handle assets, maybe by using metadata.<mimetype>.dependency.xxx = [] (with xxx = js|html|latex|... -> whatever the mimetype can handle)
  • The frontend would need to implement a deduplicator to only include this code only once per document and handle removals if reevaluation of cells is possible:
    • the ipynb/model would include each dependency only once (new "dependencies" section in the json -> <mimetype>.<hash>.(type, content))
    • each "cell" contains only a reference to the (hash of the) dependency
    • when new messages come in, all dependencies are moved to the dependency store and removed from the message, which is then "normally" handled. "moved to the dependency store" means, that each dependency is hashed and either newly included in the store and in the document (in a special section, not the output area!) or simply dropped.
    • periodically (or on removal/reevaluation), the dependency store is cleaned of all not anymore used dependencies and such dependencies are removed from the document. Or this only happens on save and the user would need to reload to clean up such libraries.

The downside is that each time such a message is send, the whole dependency chain is send as well :-( But this is happening right now anyway and such dependencies are included in the ipynb file each tome. This will probably encourage the use of external dependencies/URLs instead of files. Pandoc AFAIK can then include url content inline :-)

@janschulz janschulz referenced this issue in IRkernel/IRdisplay Apr 2, 2016
Open

Adding metadata/styling from repr output #14

@rgbkrk
Member
rgbkrk commented Apr 2, 2016

There are three (or more!) parts to write a specification for. The backend/filesystem layout, how a frontend requests them (is it direct, is it a url path in the notebook server /kernelspec/ir/static/..., is it per running kernel), as well as how it ends up in the notebook document (per cell, metadata across the notebook).

The frontend would need to implement a deduplicator to only include this code only once per document and handle removals if reevaluation of cells is possible

Definitely. It seems like the approach you outlined for knitr is great for us all to think about in terms of the notebook @janschulz. I prefer a URL based approach to asset requiring up until I'm offline (which is fairly frequent). A big draw to our current format is that it works without having to be connected to the wider internet all the time.

Since I also care about Hydrogen, Thebe, and other frontends beyond the notebook, my primary interest is in getting the backend specification done across kernels. One approach is for kernel spec directories to contain static assets:

โ”œโ”€โ”€ kernels
โ”‚ย ย  โ”œโ”€โ”€ ir
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ kernel.json
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ logo-64x64.png
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ static

What belongs in static I'm unsure of. Let's say we operated with npm packages underneath:

โ”œโ”€โ”€ kernels
โ”‚ย ย  โ”œโ”€โ”€ ir
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ kernel.json
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ logo-64x64.png
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ static
โ”‚ย ย  โ”‚ย ย      โ”œโ”€โ”€ node_modules
โ”‚ย ย  โ”‚ย ย      โ”‚ย ย  โ””โ”€โ”€ d3
โ”‚ย ย  โ”‚ย ย      โ””โ”€โ”€ package.json

While this would work well for node based frontends (hydrogen, nteract, sidecar), it would not work well on the main notebook or any other remote environment (thebe, dashboards, etc.) without also specifying how we do bundling (webpack, browserify, etc.).

@janschulz
Contributor

What is actually the problem here?

  • "Too big ipynb files" or "too much in RAM" because every plot includes jquery/... again -> solved by properly labeling dependencies in the over-the-wire messages and deduplicating them in the frontend
  • "too much send over the wire" -> not solved by deduplicating in the frontend and must get a solution in the kernel or in a caching webserver if it acts as a proxy between kernel and frontend

If the latter is a problem (and one assumes that the kernelserver/webserver and the kernel are on the same host and the frontend communicates with the kernel via the webserver), then the above (labeled dependencies in the message) plus a proxy which does caching and replaces dependencies with hashes which are then loaded from the webserver could work:

-> kernel sends message with css/html labeled as dependency
-> proxy/"webserver" replaces dependency with hashes
-> frontend finds hashes -> requests content from webserver
-> webserver sends content for hash or error message
-> frontend includes hash and dependency in json and sends the complete stuff to be saved (or the frontend sends only hashes and the webserver replaces them)

Going through https://jupyter.hackpad.com/Packaging-crate-PbIgxnC71or#:h=Specific-cases, the above can be used to solve the ipywidget case (widgets would add their js/css dependecies on cell execution and these would be included when the notebook is reloaded) but not the other three (e.g is has nothing to with packaging extensions for the frontend or backend).

@ellisonbg
Contributor

To summarize the 3 areas we need to solve that @rgbkrk listed:

  1. The backend/filesystem layout of static assets.
  2. How a frontend requests them (is it direct, is it a url path in the notebook server /kernelspec/ir/static/..., is it per running kernel).
  3. How it ends up in the notebook document (per cell, metadata across the notebook).

On 1) my initial though is that because different deployment scenarios and frontend architetures will be so different, that we don't specify the filesytem layout. If we get into that, I can't imagine things get really difficult to reason about all of the different choices: inside/outside Docker, using conda or not, electron or server, where is the kernel running. By this, I mean that a given frontend should be able the use the information from parts 2/3 and translate that into whatever filesytem layout is needed. The other issue is how to deal with a node_modules that is effectively spread out all over the place between the main server and kernels. Do you end up with multiple deployment bundles? How do you deduplicate packages across them?

On 2) is it not sufficient to specify all the things using npm package names and versions? If not, what is missing? I am concerned about making decisions at this level that assume particular bundling tools or path conventions.

One 3) I do think it is pretty important that it is easy for track down all of the static assets for a single notebook, so those assets can be bundled in different contexts such as nbonvert/static, etc. That would seem to point to notebook level metadata, but it probably also has to be in the cells that use those assets? Maybe both? Not sure.

@parente
Member
parente commented Apr 4, 2016

To summarize the 3 areas we need to solve that @rgbkrk listed:

I'd say there's a 4th: how to deal with the asynchronicity of loading frontend assets with respect to kernel code execution.

For example, how do you ensure the future version of jupyter-js-widgets is done loading on the page before some @interact decorator or the equivalent JS tries to instantiate a view when a user does a Run All?

EDIT: s/emails/tries/ ... thanks a lot autocorrect!

@rgbkrk
Member
rgbkrk commented Apr 4, 2016

we don't specify the filesytem layout

We need to specify (or even just explore) the filesystem layout so that:

  • the server in (2) that is publishing assets actually knows how to load/serve them
  • package authors and kernel authors need a place to install them

If the answer continues to be nbextensions, we still run into the problem across multiple kernels.

At least for kernel gateway and the notebook, whether they exist in Docker or not, it's the same local directory structure for that kernel.

@rgbkrk
Member
rgbkrk commented Apr 4, 2016

Ok, now I recall why there was hesitance to having a filesystem layout (as outlined in this issue at the top ๐Ÿ˜‰). We would make the actual kernel serve the assets.

@janschulz
Contributor

package authors and kernel authors need a place to install them

What "packages" are you talking here (and so I know if this affects the R kernel) : R/Python packages which the user uses in the code cells of the notebook and which implement functions which need to send js/css dependencies? Or things like a nbextension which wants to install something so that the notebook has a new function? My above comments were only for the first case (package which are executed in the code cells) and if this is only about the second case, then I should open a new issue here :-)

@lbustelo
lbustelo commented Apr 5, 2016

I want to echo the issue that @parente is adding to this PR and maybe the hardest to fix; the loading of dependency libraries and the right timing to render/execute cell output.

As we've been working in declarativewidgets on a way to change how the user initializes the extension on a particular notebook, we've been struggling a lot with the 'chicken or the egg' problem. We've tried many things, and along the way, it was surprising to find out that on a page refresh, cell output is rendered before extensions are fully loaded. I guess the limitation is understandable after you think about the implications, but at least form me, that was an expectation.

Anyway, I think that to fully understand this issue, we need to think about different scenarios on the client side and the timing of execution as it relates to kernel code and client side extension/library. Here are some of the ones that I can think of.

  1. User creates a new notebook and executes a cell that requires some client side code. (when is that cell really done)
  2. User visits an existing notebook that is cleared of output but performs a Run all (inter-cell dependencies)
  3. User saves a notebook with output and refreshes the browser. (rendering of cell output in relation to dependencies being loaded)
  4. User restarts the kernel and re-run cells (client side code is already loaded, should it be reinit)

For all the above we need to answer:

  • when can cells be executed?
  • when can cell output be rendered?
  • when is the cell output done so that the next cell can execute
@rgbkrk
Member
rgbkrk commented Apr 5, 2016

What "packages" are you talking here (and so I know if this affects the R kernel) : R/Python packages which the user uses in the code cells of the notebook and which implement functions which need to send js/css dependencies?

I'm talking about any kernel, this definitely affects the R kernel. If for some reason the R kernel wants to use the frontend bits of ipywidgets yet is dependent on an older version than is installed (or newer) than the Python side installed into nbextensions, it would have problems. I'd like a way for frontend dependencies to be isolated to the environment they're running with (and to provide a means for fetching them).

@rgbkrk
Member
rgbkrk commented Apr 5, 2016

I want to echo the issue that @parente is adding to this PR and maybe the hardest to fix; the loading of dependency libraries and the right timing to render/execute cell output.

That is likely the hardest to fix as it very much dictates how a frontend gets built.

@rgbkrk
Member
rgbkrk commented Apr 5, 2016

All the scenarios you outlined @lbustelo I've run into in the current notebook in some way, or have assumed I would run into an issue. While developing a custom widget, this forced me to clear all output, restart the kernel, and hard refresh the page.

@minrk minrk modified the milestone: 5.0 Jan 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment