Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

heartbeat failure on long gil-holding operation #1260

Closed
minrk opened this Issue Jan 13, 2012 · 6 comments

Comments

Projects
None yet
4 participants
Owner

minrk commented Jan 13, 2012

When exploring the recent heart failure discussion on-list, I discovered that I can reliably cause a heartbeat failure by grabbing the gil and sleeping for longer than the heartbeat period:

def csleep(t):
    """gil-holding sleep with cython.inline"""
    from cython import inline
    import sys
    code = '\n'.join([
        'from posix cimport unistd',
        'unistd.sleep(t)',
    ])
    while True:
        inline(code, quiet=True, t=t)
        print time.time()
        sys.stdout.flush() # this is important

csleep(5)

This reliably triggers the heart-failure dialog in both the notebook and the qtconsole. Strangely, the stdout.flush() (which results in IOStream.flush(), and ultimately ZMQStream.flush()) appears to be necessary to trigger the failure.

It turns out, the culprit is the fact that we are using pyzmq's non-copying sends.

For non-copying sends to work, pyzmq links up zmq's free function callback (their equivalent of dealloc @ refcount=0) and Python's decref. This means that when libzmq is done with the message, it must grab the GIL. The default behavior is for a context to have one io_thread (the canonical measure is 1 io_thread per GBps of throughput), but this means that if libzmq is waiting on the GIL, all other libzmq action in that context is blocked, including the heartbeat.

Potential answers include:

  1. stop using non-copying sends in the kernel - perfectly GIL-less, but with obvious and probably unacceptable cost of copying everything.
  2. give the hearbeat thread its own context - most obvious and certain solution to this problem, but may require some bookkeeping at shutdown to make sure things actually halt.
  3. use more than one io_thread in kernel contexts - doesn't actually guarantee a solution to the problem unless only one non-copying message is outstanding at a time.

I think 2. is the most reasonable answer to all this, and I've tested it to confirm that it does indeed work for this case. We may also want to use Context(io_threads=2) anway, to allow for potential issues with GIL-grabbing in the main app, though I've never seen any such issues in single-threaded code.

Potential optimizations:

  • I can possibly adjust pyzmq's free_fn to be GIL-less, and dump the decref actions into a queue running in a thread, thus getting Python fully out of libzmq's way. I don't actually know how I would do this, but it's presumably possible.
  • Limit use of non-copying sends to large messages. It's often slower to use non-copying sends for small messages, and we could test messages to determine whether they exceed a threshold to determine whether they should be copied or not. It's not always clear where that line should be drawn.
Owner

fperez commented Jan 13, 2012

Thanks for the careful analysis, Min. Some thoughts:

  • Your solution #2 sounds like the way to go for now. I'm not sure I understand precisely what allowing for 2 io_threads in that Context object would buy us, could you clarify for me (keeping in mind my ignorance of libzmq minutiae)?
  • Your second optimization is interesting, but we should certainly benchmark things out. It sounds like a tradeoff of latency vs throughput from where I stand, is that correct? But in any case, I'd rather err towards the non-copying side, since I imagine that using non-copying for small messages will at most add small amounts of delay to things that are quick to begin with, where as the opposite (using copying sends for potentially large messages) can lead to catastrophic loss of performance or even running out of memory.
  • I'll leave the decision on the first optimization up to you, since it's a pyzmq improvement. But I hope if you do it, that doesn't mean we bump our minimum pyzmq requirement for IPython. I think for a while we should try to keep the dependency chain easy for users, as we gain adoption of the notebook and the parallel zmq-based code...
Owner

minrk commented Jan 13, 2012

  • Your solution #2 sounds like the way to go for now. I'm not sure I understand precisely what allowing for 2 io_threads in that Context object would buy us, could you clarify for me (keeping in mind my ignorance of libzmq minutiae)?

libzmq can be multithreaded itself. The io_threads are simple worker threads for processing zmq events. Since zmq events are typically so lightweight, the load has to be pretty enormous for more than one to make any sense. By default, this is one, so if there is an action that takes a bit (e.g. asking for the GIL), all other zmq activity is blocked. This is an issue unique to pyzmq noncopying sends (or other tools which take over memory management from libzmq). If you have more io_threads, then activities can continue even if one is being slow. This doesn't solve the problem in general, because it's possible that a bubble would appear in each pipe, but it does diminish the impact of a single bubble.

  • Your second optimization is interesting, but we should certainly benchmark things out. It sounds like a tradeoff of latency vs throughput from where I stand, is that correct?

Almost - instantiating tracking objects, and the involvement of the GIL in the non-copying case make it less clear-cut which is faster for small messages, rather than a simple latency vs throughput tension. The only truly certain comparison is memory footprint.

But in any case, I'd rather err towards the non-copying side, since I imagine that using non-copying for small messages will at most add small amounts of delay to things that are quick to begin with, where as the opposite (using copying sends for potentially large messages) can lead to catastrophic loss of performance or even running out of memory.

Indeed, I would certainly put a cutoff, so that it's only small messages that are copied. It's possible the threshold for this in IPython is as high as 1kB or more, I really don't know. In libzmq itself, which is much more lightweight, they have the notion of a VSM (very small message), where non-copying requests are simply ignored. This threshold is something like 32B. But pyzmq/Python perform quite a few more actions per message, so a simple memcopy may be faster than invocation of the refcount/tracking machinery. If the cost is small, then the extra logic may not be worth it, in any case.

  • I'll leave the decision on the first optimization up to you, since it's a pyzmq improvement. But I hope if you do it, that doesn't mean we bump our minimum pyzmq requirement for IPython. I think for a while we should try to keep the dependency chain easy for users, as we gain adoption of the notebook and the parallel zmq-based code...

Since it would be an optimization, it wouldn't change any APIs. It would be a simple action to let the GIL-grabbing action fire in a thread that does not block libzmq. The GIL would still be involved in exactly the same way, but the waiting would just happen in another thread.

Right now, it's not likely to happen, because I don't have the time, energy, or interest to work on something like that.

Owner

ellisonbg commented Jan 13, 2012

I agree with @fperez and @minrk that option 2 is the best. As far as the optimization for small messages, I would probably not worry about it for now.

Owner

ellisonbg commented Jan 13, 2012

Oh, and fantastic job tracking this down and summarizing it here. It is also good to see cython.inline in this context.

minrk added a commit to minrk/ipython that referenced this issue Jan 13, 2012

@minrk minrk closed this in 1c4daed Jan 14, 2012

Thanks for looking into the issue, I'm hopeful this fixes the problems I'm having with the notebook becoming unresponsive when doing long computations in a C++ library wrapped in Python.

The library uses Boost.Python for the wrapping, and I'm pretty sure this results in the GIL being grabbed every time the C++/Python boundary is crossed. For now I work around the problem by setting a high time_to_dead value.

I'll try to update my Ipython installation to confirm that the problem is solved. Thanks again!

Owner

fperez commented Jan 17, 2012

@bluescarni: even though this issue is now closed, please do drop us a line with feedback on how it goes for you in practice after updating. It's always useful to have that feedback in the long run, in case we ever see similar problems in the future.

minrk added a commit to minrk/ipython that referenced this issue Jan 18, 2012

Merge pull request #1262 from minrk/hbgil
Heartbeat no longer shares the app's Context

Fixes in both the single and parallel kernels, preventing the heartbeat thread from sharing the zmq Context with the rest of the process.  Non-copying sends require grabbing the GIL from the zmq io-thread in order to free memory, which could let Python get in the way of the heartbeat.

Test script and notebook added to examples directory.

closes #1260

minrk added a commit to minrk/ipython that referenced this issue Jan 28, 2012

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014

Merge pull request #1262 from minrk/hbgil
Heartbeat no longer shares the app's Context

Fixes in both the single and parallel kernels, preventing the heartbeat thread from sharing the zmq Context with the rest of the process.  Non-copying sends require grabbing the GIL from the zmq io-thread in order to free memory, which could let Python get in the way of the heartbeat.

Test script and notebook added to examples directory.

closes #1260
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment