Routing ThreadPoolExecutor results to originating parent_header/cell #9969

rgbkrk · 2016-09-29T22:47:12Z

Code run in a background thread does not necessarily match the output area with the originating parent_header.

In the case of the jupyter notebook, this means the last cell to be run ends up with stdout or display_data from that prior cell. Is there any way for a user to scope outputs in background threads to target an original cell?

My example above is trivial, but I'll also give context on asychronous background tasks that a user is likely to do:

Database queries (in my case - Hive, Presto, and SparkSQL)
External processes
HTTP Requests

I know @glyph cares about this too. 😉

The text was updated successfully, but these errors were encountered:

rgbkrk · 2016-09-29T22:48:17Z

Ever so slightly related to ipython/ipykernel#109

glyph · 2016-09-30T00:34:52Z

Background threads are nice, but background callables are what I really care about :)

glyph · 2016-09-30T00:35:08Z

Thanks @rgbkrk !

takluyver · 2016-09-30T13:04:49Z

I think to do this we would need some hook which was run when a thread was created, inside the newly created thread. We could use that to take a copy of the parent message ID at the time the thread was created, so that output was routed back to there. However, I don't know of any such hook.

With the proposal for a kernel nanny to capture stdout/stderr at a lower level, I think such low-level output would always have this issue (showing up under the last cell executed), because that output won't be sent with a parent ID. That's probably not terribly important, though.

glyph · 2016-09-30T18:51:48Z

@takluyver There's also the issue of what to do with things created by Tornado timed calls, then.

rgbkrk · 2016-10-05T20:30:51Z

I just realized I failed to include my code for this. Here's a reproduction:

import time
def print_wait(data, timeout):
    time.sleep(timeout)
    print(data)

from concurrent import futures
tp = futures.ThreadPoolExecutor(max_workers=10)

rgbkrk · 2016-10-05T20:31:22Z

I hear you on background callables - we should err on the side of message passing across boundaries.

minrk · 2016-10-10T12:27:04Z

It's a bit complicated to track the 'right' cell in general, e.g. is it:

the cell that created the thread?
the cell that 'caused' the execution?

The first is pretty easy to implement, but rarely desirable, because it would route all output to the cell that instantiated the ThreadPool. I don't know how to do 2. without baking awareness of the parent state into a ThreadPool subclass, and at that point I'm not sure now useful it is.

If we assume that we control the ThreadPoolExecutor (i.e. it's our subclass), we should be able to do it with:

make parent storage semi-thread-local (lookup from main thread by default to preserve output of non-parent-aware threads)
on submit, fix header in calling scope for the call
in wrapped function in the thread, set the thread-local header from the submitter
at end of wrapped function, unset thread-local header to restore main-thread default behavior

The last tricky bit is the status: idle message. Often frontends want to know that a given cell has finished producing output. For single-thread execution, status: idle is a clear boundary. For async output, this assumption is broken, so the frontend is left to decide what to do: it can never discard callback information, as it can never know that output for a given cell is truly complete (async output may come at an arbitrarily later time). There are a few options:

delay status: idle until threaded execution is 'done' (impossible in general)
accept that status: idle doesn't indicate end-of-output, and guarantee that Futures/Promises representing output of an execution can never resolve.

rgbkrk · 2016-10-10T15:02:16Z

The last tricky bit is the status: idle message. Often frontends want to know that a given cell has finished producing output. For single-thread execution, status: idle is a clear boundary. For async output, this assumption is broken, so the frontend is left to decide what to do: it can never discard callback information, as it can never know that output for a given cell is truly complete (async output may come at an arbitrarily later time).

As a selfish frontend developer, I'd prefer to close off all my subscriptions/listenders on an idle. Yet, as a user of asynchronous background processes, I'd like a way to route accordingly. This isn't even just for the linear notebook - it affects dashboards, general output areas, etc.

glyph · 2016-10-10T21:42:35Z

Before we figure out how to make print et. al. attribute the correct cell automatically, let's maybe just have a way to do it at all, even if it requires some involvement from the code being executed.

At the time some code is executed, it is being executed in a namespace, and some values are injected there. Dealing with print is tricky, because you implicitly need to share namespaces between one cell and the next in order to get the name-binding semantics users expect, and print always has to be called print (and sys.stdout is a global). But what if we had a function, hypothetically output_to_this_cell, which is re-bound each time a cell evalutes, so a user could do this in a cell:

# cell A
def from_this_cell(arg, *, show=output_to_this_cell):
    show(arg)

Then later

# cell B
from_this_cell("output below cell A")

Is there already a building block that can be used like this for threaded/callbacked notebook code?

rgbkrk · 2016-10-11T04:04:36Z

While it doesn't exist yet, there's a bit of a proposal to provide an update mechanism like this:

from IPython.display import display, HTML, Updatable
d = Updatable(HTML('<b>whoa</b>'))
display(d)
# ...
d.replace(HTML('<i>awesome</i>'))

minrk · 2016-10-11T09:37:53Z

@glyph there's an Output widget that uses the comms system to send output to a different location, but these are 'live widget' things, not plain outputs.

All of the routing information is really in the parent_header, so if you can temporarily set the parent, you are almost there.

The main thing I think we need to do is to simplify (dramatically) the parent-setting machinery, so that it's easier to swap it out. Right now, set_parent does a cascade of setting descendant parents, rather than store it in a nice single place. That will allow us to more easily make the changes needed for this to be a thread-specific lookup.

The second thing we need to do is to change in the frontend how output handlers are cleared in the frontend. Right now, we are clearing callbacks once status-idle has been received, and only handling async output if it appears to come from the most recently executed cell (this is what IPython's stateful storage of parent will do with background threads). We would need to move the callback-clearing to the cell level, rather than the message level, to better indicate that there are no event listeners for a given output area.

This notebook demonstrates all the hacks necessary to get it to work right now:

save the parent for the target cell somewhere
context manager for setting parent temporarily
disable status: idle for target cell (workaround frontend clearing callbacks for 'apparently finished' cells)

And even that won't work for concurrent outputs, as the state is still process-wide, so setting the parent in one thread sets it for all threads as long as that thread is in its context.

So we need to:

move output-callback-clearing to the cell level, so that it clears on next execute, rather than trying to guess when output is done
consolidate parent-setting in the kernel
make it possible for kernel parent lookup/store to use thread-local storage for concurrent parents

minrk · 2016-10-11T10:04:57Z

the frontend-side is done (in one frontend) here: jupyter/notebook#1826

rgbkrk · 2016-10-19T18:08:25Z

I made no changes to nteract for your notebook Min. I'm assuming the cell is thought of as still running with this, so the cell should show that it's still "running"?

minrk · 2016-10-21T08:27:17Z

@rgbkrk yup, the disable-status bit is telling frontends that "this cell will never be done" to ensure that they keep their handlers active for outputs from that cell. With jupyter/notebook#1826 this extra bit is unnecessary, and the cell can send its idle message as normal without clearing the output handlers.

So if we want to minimally allow retargeting to cells, the main kernel feature is a nicer API for recording and reassigning the parent. The harder part (in IPython, at least) is that the parent must be thread-local to avoid problems with concurrent output, but also fallback on the main thread storage if no thread-specific parent has been set.

rgbkrk mentioned this issue Oct 3, 2016

Provide an update_display message jupyter/jupyter_client#209

Closed

minrk mentioned this issue Oct 11, 2016

don’t clear callbacks for cell output until next execute jupyter/notebook#1826

Merged

rgbkrk mentioned this issue Oct 19, 2016

Improving Jupyter for Spark jupyter/jupyter#212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Routing ThreadPoolExecutor results to originating parent_header/cell #9969

Routing ThreadPoolExecutor results to originating parent_header/cell #9969

rgbkrk commented Sep 29, 2016

rgbkrk commented Sep 29, 2016

glyph commented Sep 30, 2016

glyph commented Sep 30, 2016

takluyver commented Sep 30, 2016

glyph commented Sep 30, 2016

rgbkrk commented Oct 5, 2016 •

edited

Loading

rgbkrk commented Oct 5, 2016

minrk commented Oct 10, 2016

rgbkrk commented Oct 10, 2016

glyph commented Oct 10, 2016

rgbkrk commented Oct 11, 2016

minrk commented Oct 11, 2016

minrk commented Oct 11, 2016

rgbkrk commented Oct 19, 2016 •

edited

Loading

minrk commented Oct 21, 2016

Routing ThreadPoolExecutor results to originating parent_header/cell #9969

Routing ThreadPoolExecutor results to originating parent_header/cell #9969

Comments

rgbkrk commented Sep 29, 2016

rgbkrk commented Sep 29, 2016

glyph commented Sep 30, 2016

glyph commented Sep 30, 2016

takluyver commented Sep 30, 2016

glyph commented Sep 30, 2016

rgbkrk commented Oct 5, 2016 • edited Loading

rgbkrk commented Oct 5, 2016

minrk commented Oct 10, 2016

rgbkrk commented Oct 10, 2016

glyph commented Oct 10, 2016

rgbkrk commented Oct 11, 2016

minrk commented Oct 11, 2016

minrk commented Oct 11, 2016

rgbkrk commented Oct 19, 2016 • edited Loading

minrk commented Oct 21, 2016

rgbkrk commented Oct 5, 2016 •

edited

Loading

rgbkrk commented Oct 19, 2016 •

edited

Loading