# Dead kernel loop #1232

Closed
opened this Issue Jan 5, 2012 · 29 comments

Projects
None yet
7 participants

### anderwm commented Jan 5, 2012

 I have been using the notebook a bit on my home pc with some success, so I am trying to get it running at the office. 64-bit windows 7 with 64 bit python installed Everything works as expected, up to the dashboard page (which opens fine in Chrome) When I choose a notebook, or choose new notebook, the notebook opens and I get the dead kernel message. When I restart I get another dead kernel, and so forth...all I get in the kernel screen is the following  The IPython Notebook is running at: http://127.0.0.1:8888 Use Control-C to stop this server and shut down all kernels. Using MathJax from CDN Kernel started: a902af46-e03b-4fb5-bbd6-7e5a3eb3c81a To connect another client to this kernel, use: --existing kernel-a902af46-e03b-4fb5-bbd6-7e5a3eb3c81a.json Connecting to: tcp://127.0.0.1:63930 Connecting to: tcp://127.0.0.1:63931 Connecting to: tcp://127.0.0.1:63933 Kernel started: 34e69ec9-b6f3-40d9-8424-3b824f8aa441 Connecting to: tcp://127.0.0.1:64012 Connecting to: tcp://127.0.0.1:64013 Connecting to: tcp://127.0.0.1:64015 To connect another client to this kernel, use: --existing kernel-34e69ec9-b6f3-40d9-8424-3b824f8aa441.json Kernel started: 98bdfcd3-0a8d-4aea-959c-49219aae41b6 Connecting to: tcp://127.0.0.1:64049 Connecting to: tcp://127.0.0.1:64050 Connecting to: tcp://127.0.0.1:64052  For somebody like me who usually diagnoses their own stupidity from the error trace, the kernel model makes it difficult.
Owner

### fperez commented Jan 6, 2012

 Mhh, I've seen this on windows as well, but it seemed like the problem was that startup is slower on windows than on other systems, and the dead kernel page triggers too fast. I got it to work by closing that notebook altogether (with the 'kill kernel on exit' box checked), and then opening a new one fresh. It seems that once everything had been loaded from disk, the startup was fast enough. However, it's possible that our logic for dead kernels is too aggressive; @minrk, do you recent changes in #1187 help on this front?
Owner

### minrk commented Jan 6, 2012

 This is almost certainly the same issue as addressed by #1187, so I would recommend testing against that. It's possible the changes there aren't enough for some situations, but the timers are tunable via the MultiKernelManager.first_beat and MultiKernelManager.time_to_dead traits for slower environments. @anderwm The reason there's no traceback is there's probably no actual error. The "kernel died" message is triggered by the kernel's failure to respond to the heartbeat. The cause is most likely that the kernel took too long to startup, and the heartbeat timeout starts counting the instant the kernel process is requested. This is precisely the issue that PR #1187 is meant to address.
Owner

### fperez commented Jan 6, 2012

 Should we add to #1187 a closes #1232?
Owner

### minrk commented Jan 6, 2012

 Yes, I think so, unless we want to wait for confirmation that it does indeed address the issue, though I am fairly confident that it does.
Owner

### fperez commented Jan 6, 2012

 It also looks to me like it. Let's do it, we can always reopen if it proves to persist. I'll merge now with that.

Owner

### minrk commented Jan 6, 2012

 Great, thanks!

### anderwm commented Jan 6, 2012

 I don't think this is the same problem guys. I saw where you merged #1187 so I pulled it down and am now trying it on my work PC. I still get the same burst of dead kernels on startup. After playing with it some more I have found that if I restart it 10-15 times It will work for a few moments, but even before trying to run a command it will die again and I will have to restart it 10-15 times. #1187 did fix the problem on my laptop...which is very slow and old compared to this system. It's a quad core, 64 bit, 4gb RAM...at least better than the junk I normally buy. I also tried upping the time_to_dead config option to no avail.

Owner

### fperez commented Jan 6, 2012

 Ok, I've reopened it then so we can try to hunt it down...

### anderwm commented Jan 10, 2012

 Also, the time_to_dead config option doesn't seem to do anything in this case. As I can change it to 10, but the dead kernel is immediate upon restarting it. I checked it in the debugger and it is indeed using the 10 seconds. It will execute code in between kernel deaths (which are somewhat random ranging from .1-20 seconds) if I am quick enough.
Contributor

### dwf commented Jan 14, 2012

 I just encountered this in 0.12 on Linux 32-bit, EPD 7.2. It seems like it's fixed in master, but now the JavaScript seems a little wonky on Chrome (not Firefox, though; odd). I'll file another bug when I know more.
Contributor

### dwf commented Jan 14, 2012

 Oh, also -- it seemed I was only seeing it with --pylab or --pylab=inline. The notebook without --pylab worked fine.

### anderwm commented Jan 16, 2012

 I get it in Chrome and Firefox, with or without --pylab. Only only this 64 bit python machine though.
Owner

### fperez commented Jan 16, 2012

 @minrk, do you think this could be somehow related to the messaging glitches you've been investigating in #1266? I haven't seen this problem even once, so I'm kind of stumped as to what could be causing it... Still, I made it high priority, it would be good to get to the bottom of this before 0.13.
Contributor

### dwf commented Jan 16, 2012

 Just to clarify, I'm not seeing any of these issues in trunk. Only in the version that (unfortunately) shipped with EPD 7.2-1. I also got that behaviour on a 64-bit Mac install of EPD. :(
Owner

### fperez commented Jan 16, 2012

 Thanks, @dwf, for the clarification; @anderwm, are you seeing the problem with the current master branch from git?

### anderwm commented Jan 16, 2012

 @fperez Yes, I pulled it from the master again today.
Owner

### fperez commented Jan 16, 2012

 OK, thanks for the confirmation. Nasty, since we're not seeing it... I'm hoping @minrk will have one of his epiphanies on this one, since it's in the zmq heart of the beast...

### anderwm commented Jan 16, 2012

 It is a strange one for sure. I tried messing with the heartbeat period(in heartmonitor) and the time to dead settings. Although I am admittedly naive about the inner workings, I can't get the messages to even slow down much past 1 sec. If there is anything I can do to help you let me know. Edit--On further review the HeartMonitor.Period changes how long the first death takes accurately, however, after that death there comes a burst of messages. Once you restart it 4-7 times you get another HeartMonitor.Period seconds to execute code. If you make it really big(~100 seconds), the dead kernels come immediately.
Owner

### minrk commented Jan 16, 2012

 Wait, HeartMonitor.period? That's in IPython.parallel, and is not used by the notebook. It also makes no sense to me that having a large time_to_dead could cause earlier failures. The very first opportunity to call the heart-failure callback should be at MappingKernelManager.first_beat + MappingKernelManager.time_to_dead seconds. Can you try this config? c.MappingKernelManager.first_beat = 10. c.MappingKernelManager.time_to_dead = 5.  I also pushed to a debughb branch with some extra logging messages that should hopefully help track down the timing of what's going on.
Owner

### minrk commented Jan 17, 2012

 @anderwm also, can you confirm your pyzmq and tornado versions?

### anderwm commented Jan 17, 2012

 I told you I was naive. I noticed that the HeartMonitor was in parallel, but I assumed the notebook had similar functionality so used the same file. Obviously I was incorrect. I cannot remember off the top of my head the versions, but they are very likely the most recent ones. I will check for sure when I return to the office(14 hrs from now). Then I will try your configs and pull down your branch. Thanks
Owner

### minrk commented Jan 17, 2012

 No worries, it was just concerning to me that you were seeing apparent results from code that is not in use. It is indeed confusing that there is so much duplicate code between IPython.parallel and IPython.zmq, but they were developed together, and IPython.zmq couldn't keep up with the more demanding needs in IPython.parallel. We hope to consolidate these things soon, so there will be less duplication.

### anderwm commented Jan 17, 2012

 zmq.version '2.1.10' zmq.zmq_version() '2.1.10' With configs set as specified above, there is still an immediate death after clicking new notebook. Then it seems to wait 5 seconds before dying again. When it dies, I have to click restart several times before I get the 5 seconds again. c.MappingKernelManager.first_beat wasn't in the commented out default config file, so I added it. I also tried setting them from the command line and the result was the same. I will try updating zmq, I probably used pre-compiled binaries because this is a 64 bit windows box. I think the latest version should be 2.1.11. @minrk Where does the log wind up (application.log)? Specifically, the one created normally and added to by your debughb branch.

### anderwm commented Jan 17, 2012

 It was something about the zmq version. After updating to 2.1.11 everything works. Hard to say if it specific to windows, 64-bit python, or what.
Owner

### fperez commented Jan 17, 2012

 Mmh, interesting. If it's gone, let's then close this puppy. It's easy enough to reopen if you see the problem again. Thanks for the patient reporting!

### anderwm commented Jan 17, 2012

 No problem, although you might want to specify a version number of zmq in the documentation(sorry if i missed it). Unless nobody else can replicate this and we just chalk it up to a bad zmq install.

### kyzyl commented May 9, 2012

 Perhaps this belongs in a new/different issue, but I'm getting this behavior again. I'm using epd 7.2-2, 64bit on the latest stable Ubuntu (x64). I tried updating zmq as per above but the issue persists. One point where I see my situation differ is that upon starting a notebook it takes about 14s for the kernel to die. Setting a long time_to_dead seems will remove the notification of a dying kernel, but then no commands will execute ("[*]" forever). I get the same results when using the ipython that ships with epd7.2, the latest version from easy_install -U, and the latest trunk :-/ Side note: The same versions of everything on Win7x64 produce no problems. Everything works. Thoughts?

Closed

Closed

### osiloke commented Jul 18, 2013

 Hi, i would like to know why this was closed when the last question was not answered. I'm also getting this issue on Mac OSX, i'm running Ipython==0.13.2 with zmq==3.2.2. I installed zmq and python using homebrew. I also noticed that the ipython notebook startup process takes more than a minute and the debug messages really dont help. It seems to be timeing out before it can run the notebook code. I've tried the suggestions above but none of them are working for me.
Owner

### Carreau commented Jul 18, 2013

 Hi, i would like to know why this was closed when the last question was not answered. The question was asked after the issue was closed. hence the issue #1719

### mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014

 fperez  Merge pull request #1187 from minrk/nbshutdown  Notebook cleanups and fixes: connection file cleanup, first heartbeat, startup flush. Kernels would not linger, but the KernelManagers are not garbage-collected on shutdown. This means that connection files for kernels still running at notebook shutdown would not be removed. Now, kernels are explicitly killed at server shutdown, allowing the KernelManagers to cleanup files. Small changes along the way: * disables the unnecessary (and actively detrimental) SIGINT handler inherited from the original copy/paste from the qt app. * put webapp initialization in init_webapp out of initialize, to preserve convention of there being no unique code in initialize(). * don't warn about http on all interfaces if running in 100% read-only mode, because no login or execution is possible. Closes #1232.  6c0c625 
to join this conversation on GitHub. Already have an account? Sign in to comment