Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Has Died error in Notebook #1198

Closed
anderwm opened this issue Dec 22, 2011 · 14 comments
Closed

Kernel Has Died error in Notebook #1198

anderwm opened this issue Dec 22, 2011 · 14 comments
Milestone

Comments

@anderwm
Copy link

anderwm commented Dec 22, 2011

I get this message occasionally at seemingly random times during a notebook session. Although the kernel restarts fine, without a more interesting error message there is not a lot I can do. I am sure this was added to recover from errors without much effort from the end user, but is there somewhere I can look to see who is the serial killer that keeps murdering my beloved kernels?

@minrk
Copy link
Member

minrk commented Dec 23, 2011

The odds are that your kernel has not died, and the bug is false positive in the heartbeat code.

Can you post:

pyzmq version (zmq.__version__)
libzmq version (zmq.zmq_version())
Python version
OS

Can you describe what you are typically doing when this happens? Are you using pylab mode?

@anderwm
Copy link
Author

anderwm commented Dec 23, 2011

zmq.__version__
Out[6]: '2.1.11'

zmq.zmq_version()
Out[7]: '2.1.11'

Python version 2.7 on Windows XP 32 bit

The most repeatable way it happens is when first loading a notebook from the dashboard...like 5 seconds after I see the cells load I get the dead kernel message. If I x out the window(choose nothing), I am not able to execute code. If I restart the kernel it works fine, and if I close the notebook and open it back up it does not happen again. Yes, I am using pylab mode in general, but this happens without it enabled as well.

@minrk
Copy link
Member

minrk commented Dec 23, 2011

Okay, thanks.

Try increasing the heartbeat period by setting MappingKernelManager.time_to_dead=10, either in config or at the command-line.

@anderwm
Copy link
Author

anderwm commented Dec 23, 2011

On first test that seems to work, I'll play around with it some more tomorrow. This computer is a bit old/slow, is the kernel manager tripping because it is taking a while?

@minrk
Copy link
Member

minrk commented Dec 23, 2011

Without injecting some debugging statements it's a bit hard to tell, but the way the heartbeat works in the notebook is this:

The kernel manager sends a ping every so often (in this case every time_to_dead in seconds), and if it doesn't receive a reply by the time it would send out the next ping, it believes the kernel is dead. The kernel remains responsive, even during GIL-holding blocking code, because its responder is actually an extremely tiny pure-C 0MQ thread.

Looking at the code, I can see two problems:

  1. (this is most likely the cause of your issue) The heartbeat mechanism starts right away when the kernel subprocess is initiated. So if the kernel isn't up and running and responsive within time_to_dead, then it will fail to make the first heartbeat, and essentially be treated as DOA.
  2. (unlikely issue, but still a bug) if the server is slow, it might queue up the heart-failed action while a heartbeat reply is waiting in the queue (hb_stream is not flushed, as it is in the parallel code's more elaborate heartbeat mechanism). But that would require some seriously heavy load on the server.

I was able to replicate (as far as I can tell) your issue by cutting the heartbeat time down to 300 milliseconds, so my notebook keeps seeing dead kernels. But if I delay the first heartbeat for the original 3 seconds, I once again have a perfectly responsive kernel. So the solution here is that we need to allow user-configurable delay on the first beat, and it should probably start out around 5s, which should cover any normal environment.

@minrk
Copy link
Member

minrk commented Dec 23, 2011

I added what should be a fix to my existing notebook PR #1187, if you want to try that out.

@ellisonbg
Copy link
Member

I too see this occasionally but have no idea of what is causing it.
What version of pyzmq are you using?

Cheers,

Brian

On Thu, Dec 22, 2011 at 12:29 PM, anderwm
reply@reply.github.com
wrote:

I get this message occasionally at seemingly random times during a notebook session.  Although the kernel restarts fine, without a more interesting error message there is not a lot I can do.  I am sure this was added to recover from errors without much effort from the end user, but is there somewhere I can look to see who is the serial killer that keeps murdering my beloved kernels?


Reply to this email directly or view it on GitHub:
#1198

Brian E. Granger
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu and ellisonbg@gmail.com

@minrk
Copy link
Member

minrk commented Dec 23, 2011

@ellisonbg - it's most likely that it's taking more than one heartbeat cycle for the kernel to start. See above comments and link for discussion and a proposed fix.

@anderwm
Copy link
Author

anderwm commented Dec 23, 2011

This is probably not appropriate here, but I am new to Git. Is the best way to try your code:

1)make your fork a remote
2)pull the branch that you committed to (nbShutdown I believe)
3)run ipython from inside my local folder (from the git clone)

or is there a more effective mechanism, or hit me with a link to read

@minrk
Copy link
Member

minrk commented Dec 23, 2011

Yes, that's exactly right. For more specific steps:

git remote add minrk git://github.com/minrk/ipython.git
git fetch minrk
git checkout -b nbshutdown minrk/nbshutdown
python ipython.py notebook --pylab inline --notebook-dir=/path/to/your/notebooks

@anderwm
Copy link
Author

anderwm commented Dec 23, 2011

And just to make sure the comparison is apt,

If I run

ipython notebook

from any other directory it will be the original code...like from site_packages?

or I should say, it will get the code from site_packages/ipython/whatever

@minrk
Copy link
Member

minrk commented Dec 23, 2011

Yes, unless you have done a 'dev' install, either with python setupegg.py develop or using symlinks so that site-packages points to the git tree. I recommend doing one of these for any project for which you actually plan to track the development version.

@anderwm
Copy link
Author

anderwm commented Dec 23, 2011

So far I have not been able to recreate the error with the new code, so that's a good sign. I will continue to test it as I have time.

@minrk
Copy link
Member

minrk commented Jan 6, 2012

closed by PR #1187

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants