Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routing ThreadPoolExecutor results to originating parent_header/cell #9969

Open
rgbkrk opened this issue Sep 29, 2016 · 15 comments
Open

Routing ThreadPoolExecutor results to originating parent_header/cell #9969

rgbkrk opened this issue Sep 29, 2016 · 15 comments

Comments

@rgbkrk
Copy link
Member

rgbkrk commented Sep 29, 2016

Code run in a background thread does not necessarily match the output area with the originating parent_header.

screen shot 2016-09-29 at 3 44 57 pm

In the case of the jupyter notebook, this means the last cell to be run ends up with stdout or display_data from that prior cell. Is there any way for a user to scope outputs in background threads to target an original cell?

My example above is trivial, but I'll also give context on asychronous background tasks that a user is likely to do:

  • Database queries (in my case - Hive, Presto, and SparkSQL)
  • External processes
  • HTTP Requests

I know @glyph cares about this too. 😉

@rgbkrk
Copy link
Member Author

rgbkrk commented Sep 29, 2016

Ever so slightly related to ipython/ipykernel#109

@glyph
Copy link

glyph commented Sep 30, 2016

Background threads are nice, but background callables are what I really care about :)

@glyph
Copy link

glyph commented Sep 30, 2016

Thanks @rgbkrk !

@takluyver
Copy link
Member

I think to do this we would need some hook which was run when a thread was created, inside the newly created thread. We could use that to take a copy of the parent message ID at the time the thread was created, so that output was routed back to there. However, I don't know of any such hook.

With the proposal for a kernel nanny to capture stdout/stderr at a lower level, I think such low-level output would always have this issue (showing up under the last cell executed), because that output won't be sent with a parent ID. That's probably not terribly important, though.

@glyph
Copy link

glyph commented Sep 30, 2016

@takluyver There's also the issue of what to do with things created by Tornado timed calls, then.

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 5, 2016

I just realized I failed to include my code for this. Here's a reproduction:

import time
def print_wait(data, timeout):
    time.sleep(timeout)
    print(data)

from concurrent import futures
tp = futures.ThreadPoolExecutor(max_workers=10)

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 5, 2016

I hear you on background callables - we should err on the side of message passing across boundaries.

@minrk
Copy link
Member

minrk commented Oct 10, 2016

It's a bit complicated to track the 'right' cell in general, e.g. is it:

  1. the cell that created the thread?
  2. the cell that 'caused' the execution?

The first is pretty easy to implement, but rarely desirable, because it would route all output to the cell that instantiated the ThreadPool. I don't know how to do 2. without baking awareness of the parent state into a ThreadPool subclass, and at that point I'm not sure now useful it is.

If we assume that we control the ThreadPoolExecutor (i.e. it's our subclass), we should be able to do it with:

  1. make parent storage semi-thread-local (lookup from main thread by default to preserve output of non-parent-aware threads)
  2. on submit, fix header in calling scope for the call
  3. in wrapped function in the thread, set the thread-local header from the submitter
  4. at end of wrapped function, unset thread-local header to restore main-thread default behavior

The last tricky bit is the status: idle message. Often frontends want to know that a given cell has finished producing output. For single-thread execution, status: idle is a clear boundary. For async output, this assumption is broken, so the frontend is left to decide what to do: it can never discard callback information, as it can never know that output for a given cell is truly complete (async output may come at an arbitrarily later time). There are a few options:

  1. delay status: idle until threaded execution is 'done' (impossible in general)
  2. accept that status: idle doesn't indicate end-of-output, and guarantee that Futures/Promises representing output of an execution can never resolve.

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 10, 2016

The last tricky bit is the status: idle message. Often frontends want to know that a given cell has finished producing output. For single-thread execution, status: idle is a clear boundary. For async output, this assumption is broken, so the frontend is left to decide what to do: it can never discard callback information, as it can never know that output for a given cell is truly complete (async output may come at an arbitrarily later time).

As a selfish frontend developer, I'd prefer to close off all my subscriptions/listenders on an idle. Yet, as a user of asynchronous background processes, I'd like a way to route accordingly. This isn't even just for the linear notebook - it affects dashboards, general output areas, etc.

@glyph
Copy link

glyph commented Oct 10, 2016

Before we figure out how to make print et. al. attribute the correct cell automatically, let's maybe just have a way to do it at all, even if it requires some involvement from the code being executed.

At the time some code is executed, it is being executed in a namespace, and some values are injected there. Dealing with print is tricky, because you implicitly need to share namespaces between one cell and the next in order to get the name-binding semantics users expect, and print always has to be called print (and sys.stdout is a global). But what if we had a function, hypothetically output_to_this_cell, which is re-bound each time a cell evalutes, so a user could do this in a cell:

# cell A
def from_this_cell(arg, *, show=output_to_this_cell):
    show(arg)

Then later

# cell B
from_this_cell("output below cell A")

Is there already a building block that can be used like this for threaded/callbacked notebook code?

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 11, 2016

While it doesn't exist yet, there's a bit of a proposal to provide an update mechanism like this:

from IPython.display import display, HTML, Updatable
d = Updatable(HTML('<b>whoa</b>'))
display(d)
# ...
d.replace(HTML('<i>awesome</i>'))

@minrk
Copy link
Member

minrk commented Oct 11, 2016

@glyph there's an Output widget that uses the comms system to send output to a different location, but these are 'live widget' things, not plain outputs.

All of the routing information is really in the parent_header, so if you can temporarily set the parent, you are almost there.

The main thing I think we need to do is to simplify (dramatically) the parent-setting machinery, so that it's easier to swap it out. Right now, set_parent does a cascade of setting descendant parents, rather than store it in a nice single place. That will allow us to more easily make the changes needed for this to be a thread-specific lookup.

The second thing we need to do is to change in the frontend how output handlers are cleared in the frontend. Right now, we are clearing callbacks once status-idle has been received, and only handling async output if it appears to come from the most recently executed cell (this is what IPython's stateful storage of parent will do with background threads). We would need to move the callback-clearing to the cell level, rather than the message level, to better indicate that there are no event listeners for a given output area.

This notebook demonstrates all the hacks necessary to get it to work right now:

  1. save the parent for the target cell somewhere
  2. context manager for setting parent temporarily
  3. disable status: idle for target cell (workaround frontend clearing callbacks for 'apparently finished' cells)

And even that won't work for concurrent outputs, as the state is still process-wide, so setting the parent in one thread sets it for all threads as long as that thread is in its context.

So we need to:

  1. move output-callback-clearing to the cell level, so that it clears on next execute, rather than trying to guess when output is done
  2. consolidate parent-setting in the kernel
  3. make it possible for kernel parent lookup/store to use thread-local storage for concurrent parents

@minrk
Copy link
Member

minrk commented Oct 11, 2016

the frontend-side is done (in one frontend) here: jupyter/notebook#1826

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 19, 2016

screen shot 2016-10-19 at 11 07 35 am

I made no changes to nteract for your notebook Min. I'm assuming the cell is thought of as still running with this, so the cell should show that it's still "running"?

@minrk
Copy link
Member

minrk commented Oct 21, 2016

@rgbkrk yup, the disable-status bit is telling frontends that "this cell will never be done" to ensure that they keep their handlers active for outputs from that cell. With jupyter/notebook#1826 this extra bit is unnecessary, and the cell can send its idle message as normal without clearing the output handlers.

So if we want to minimally allow retargeting to cells, the main kernel feature is a nicer API for recording and reassigning the parent. The harder part (in IPython, at least) is that the parent must be thread-local to avoid problems with concurrent output, but also fallback on the main thread storage if no thread-specific parent has been set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants