Kernel Has Died error in Notebook #1198

Closed
anderwm opened this Issue Dec 22, 2011 · 14 comments

3 participants

@anderwm

I get this message occasionally at seemingly random times during a notebook session. Although the kernel restarts fine, without a more interesting error message there is not a lot I can do. I am sure this was added to recover from errors without much effort from the end user, but is there somewhere I can look to see who is the serial killer that keeps murdering my beloved kernels?

@minrk
IPython member

The odds are that your kernel has not died, and the bug is false positive in the heartbeat code.

Can you post:

pyzmq version (zmq.__version__)
libzmq version (zmq.zmq_version())
Python version
OS

Can you describe what you are typically doing when this happens? Are you using pylab mode?

@anderwm
zmq.__version__
Out[6]: '2.1.11'

zmq.zmq_version()
Out[7]: '2.1.11'

Python version 2.7 on Windows XP 32 bit

The most repeatable way it happens is when first loading a notebook from the dashboard...like 5 seconds after I see the cells load I get the dead kernel message. If I x out the window(choose nothing), I am not able to execute code. If I restart the kernel it works fine, and if I close the notebook and open it back up it does not happen again. Yes, I am using pylab mode in general, but this happens without it enabled as well.

@minrk
IPython member

Okay, thanks.

Try increasing the heartbeat period by setting MappingKernelManager.time_to_dead=10, either in config or at the command-line.

@anderwm

On first test that seems to work, I'll play around with it some more tomorrow. This computer is a bit old/slow, is the kernel manager tripping because it is taking a while?

@minrk
IPython member

Without injecting some debugging statements it's a bit hard to tell, but the way the heartbeat works in the notebook is this:

The kernel manager sends a ping every so often (in this case every time_to_dead in seconds), and if it doesn't receive a reply by the time it would send out the next ping, it believes the kernel is dead. The kernel remains responsive, even during GIL-holding blocking code, because its responder is actually an extremely tiny pure-C 0MQ thread.

Looking at the code, I can see two problems:

  1. (this is most likely the cause of your issue) The heartbeat mechanism starts right away when the kernel subprocess is initiated. So if the kernel isn't up and running and responsive within time_to_dead, then it will fail to make the first heartbeat, and essentially be treated as DOA.

  2. (unlikely issue, but still a bug) if the server is slow, it might queue up the heart-failed action while a heartbeat reply is waiting in the queue (hb_stream is not flushed, as it is in the parallel code's more elaborate heartbeat mechanism). But that would require some seriously heavy load on the server.

I was able to replicate (as far as I can tell) your issue by cutting the heartbeat time down to 300 milliseconds, so my notebook keeps seeing dead kernels. But if I delay the first heartbeat for the original 3 seconds, I once again have a perfectly responsive kernel. So the solution here is that we need to allow user-configurable delay on the first beat, and it should probably start out around 5s, which should cover any normal environment.

@minrk
IPython member

I added what should be a fix to my existing notebook PR #1187, if you want to try that out.

@ellisonbg
IPython member
@minrk
IPython member

@ellisonbg - it's most likely that it's taking more than one heartbeat cycle for the kernel to start. See above comments and link for discussion and a proposed fix.

@anderwm

This is probably not appropriate here, but I am new to Git. Is the best way to try your code:

1)make your fork a remote
2)pull the branch that you committed to (nbShutdown I believe)
3)run ipython from inside my local folder (from the git clone)

or is there a more effective mechanism, or hit me with a link to read

@minrk
IPython member

Yes, that's exactly right. For more specific steps:

git remote add minrk git://github.com/minrk/ipython.git
git fetch minrk
git checkout -b nbshutdown minrk/nbshutdown
python ipython.py notebook --pylab inline --notebook-dir=/path/to/your/notebooks
@anderwm

And just to make sure the comparison is apt,

If I run

ipython notebook

from any other directory it will be the original code...like from site_packages?

or I should say, it will get the code from site_packages/ipython/whatever

@minrk
IPython member

Yes, unless you have done a 'dev' install, either with python setupegg.py develop or using symlinks so that site-packages points to the git tree. I recommend doing one of these for any project for which you actually plan to track the development version.

@anderwm

So far I have not been able to recreate the error with the new code, so that's a good sign. I will continue to test it as I have time.

@minrk
IPython member

closed by PR #1187

@minrk minrk closed this Jan 6, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment