Dead kernel loop #1232

anderwm · 2012-01-05T18:12:30Z

I have been using the notebook a bit on my home pc with some success, so I am trying to get it running at the office.

64-bit windows 7 with 64 bit python installed
Everything works as expected, up to the dashboard page (which opens fine in Chrome)
When I choose a notebook, or choose new notebook, the notebook opens and I get the dead kernel message. When I restart I get another dead kernel, and so forth...all I get in the kernel screen is the following

 The IPython Notebook is running at: http://127.0.0.1:8888
 Use Control-C to stop this server and shut down all kernels.
 Using MathJax from CDN
 Kernel started: a902af46-e03b-4fb5-bbd6-7e5a3eb3c81a
 To connect another client to this kernel, use:
 --existing kernel-a902af46-e03b-4fb5-bbd6-7e5a3eb3c81a.json
 Connecting to: tcp://127.0.0.1:63930
 Connecting to: tcp://127.0.0.1:63931
 Connecting to: tcp://127.0.0.1:63933
 Kernel started: 34e69ec9-b6f3-40d9-8424-3b824f8aa441
 Connecting to: tcp://127.0.0.1:64012
 Connecting to: tcp://127.0.0.1:64013
 Connecting to: tcp://127.0.0.1:64015
 To connect another client to this kernel, use:
 --existing kernel-34e69ec9-b6f3-40d9-8424-3b824f8aa441.json
 Kernel started: 98bdfcd3-0a8d-4aea-959c-49219aae41b6
 Connecting to: tcp://127.0.0.1:64049
 Connecting to: tcp://127.0.0.1:64050
 Connecting to: tcp://127.0.0.1:64052

For somebody like me who usually diagnoses their own stupidity from the error trace, the kernel model makes it difficult.

The text was updated successfully, but these errors were encountered:

fperez · 2012-01-06T00:45:44Z

Mhh, I've seen this on windows as well, but it seemed like the problem was that startup is slower on windows than on other systems, and the dead kernel page triggers too fast. I got it to work by closing that notebook altogether (with the 'kill kernel on exit' box checked), and then opening a new one fresh. It seems that once everything had been loaded from disk, the startup was fast enough.

However, it's possible that our logic for dead kernels is too aggressive; @minrk, do you recent changes in #1187 help on this front?

minrk · 2012-01-06T05:37:15Z

This is almost certainly the same issue as addressed by #1187, so I would recommend testing against that. It's possible the changes there aren't enough for some situations, but the timers are tunable via the MultiKernelManager.first_beat and MultiKernelManager.time_to_dead traits for slower environments.

@anderwm The reason there's no traceback is there's probably no actual error. The "kernel died" message is triggered by the kernel's failure to respond to the heartbeat. The cause is most likely that the kernel took too long to startup, and the heartbeat timeout starts counting the instant the kernel process is requested. This is precisely the issue that PR #1187 is meant to address.

fperez · 2012-01-06T05:55:02Z

Should we add to #1187 a closes #1232?

minrk · 2012-01-06T05:58:17Z

Yes, I think so, unless we want to wait for confirmation that it does indeed address the issue, though I am fairly confident that it does.

fperez · 2012-01-06T05:59:56Z

It also looks to me like it. Let's do it, we can always reopen if it proves to persist. I'll merge now with that.

minrk · 2012-01-06T06:02:02Z

Great, thanks!

anderwm · 2012-01-06T14:29:46Z

I don't think this is the same problem guys. I saw where you merged #1187 so I pulled it down and am now trying it on my work PC. I still get the same burst of dead kernels on startup. After playing with it some more I have found that if I restart it 10-15 times It will work for a few moments, but even before trying to run a command it will die again and I will have to restart it 10-15 times. #1187 did fix the problem on my laptop...which is very slow and old compared to this system. It's a quad core, 64 bit, 4gb RAM...at least better than the junk I normally buy. I also tried upping the time_to_dead config option to no avail.

fperez · 2012-01-06T17:10:37Z

Ok, I've reopened it then so we can try to hunt it down...

anderwm · 2012-01-10T23:10:27Z

Also, the time_to_dead config option doesn't seem to do anything in this case. As I can change it to 10, but the dead kernel is immediate upon restarting it. I checked it in the debugger and it is indeed using the 10 seconds.

It will execute code in between kernel deaths (which are somewhat random ranging from .1-20 seconds) if I am quick enough.

dwf · 2012-01-14T07:43:26Z

I just encountered this in 0.12 on Linux 32-bit, EPD 7.2. It seems like it's fixed in master, but now the JavaScript seems a little wonky on Chrome (not Firefox, though; odd). I'll file another bug when I know more.

dwf · 2012-01-14T07:46:58Z

Oh, also -- it seemed I was only seeing it with --pylab or --pylab=inline. The notebook without --pylab worked fine.

anderwm · 2012-01-16T14:56:21Z

I get it in Chrome and Firefox, with or without --pylab. Only only this 64 bit python machine though.

fperez · 2012-01-16T20:18:25Z

@minrk, do you think this could be somehow related to the messaging glitches you've been investigating in #1266? I haven't seen this problem even once, so I'm kind of stumped as to what could be causing it... Still, I made it high priority, it would be good to get to the bottom of this before 0.13.

dwf · 2012-01-16T20:34:59Z

Just to clarify, I'm not seeing any of these issues in trunk. Only in the
version that (unfortunately) shipped with EPD 7.2-1. I also got that
behaviour on a 64-bit Mac install of EPD. :(

fperez · 2012-01-16T20:42:23Z

Thanks, @dwf, for the clarification; @anderwm, are you seeing the problem with the current master branch from git?

anderwm · 2012-01-16T22:38:16Z

@fperez Yes, I pulled it from the master again today.

fperez · 2012-01-16T22:44:44Z

OK, thanks for the confirmation. Nasty, since we're not seeing it... I'm hoping @minrk will have one of his epiphanies on this one, since it's in the zmq heart of the beast...

anderwm · 2012-01-16T22:54:58Z

It is a strange one for sure. I tried messing with the heartbeat period(in heartmonitor) and the time to dead settings. Although I am admittedly naive about the inner workings, I can't get the messages to even slow down much past 1 sec. If there is anything I can do to help you let me know. Edit--On further review the HeartMonitor.Period changes how long the first death takes accurately, however, after that death there comes a burst of messages. Once you restart it 4-7 times you get another HeartMonitor.Period seconds to execute code. If you make it really big(~100 seconds), the dead kernels come immediately.

minrk · 2012-01-16T23:59:39Z

Wait, HeartMonitor.period? That's in IPython.parallel, and is not used by the notebook. It also makes no sense to me that having a large time_to_dead could cause earlier failures. The very first opportunity to call the heart-failure callback should be at MappingKernelManager.first_beat + MappingKernelManager.time_to_dead seconds.

Can you try this config?

c.MappingKernelManager.first_beat = 10.
c.MappingKernelManager.time_to_dead = 5.

I also pushed to a debughb branch with some extra logging messages that should hopefully help track down the timing of what's going on.

minrk · 2012-01-17T00:03:13Z

@anderwm also, can you confirm your pyzmq and tornado versions?

anderwm · 2012-01-17T00:19:06Z

I told you I was naive. I noticed that the HeartMonitor was in parallel, but I assumed the notebook had similar functionality so used the same file. Obviously I was incorrect. I cannot remember off the top of my head the versions, but they are very likely the most recent ones. I will check for sure when I return to the office(14 hrs from now). Then I will try your configs and pull down your branch. Thanks

minrk · 2012-01-17T00:52:03Z

No worries, it was just concerning to me that you were seeing apparent results from code that is not in use. It is indeed confusing that there is so much duplicate code between IPython.parallel and IPython.zmq, but they were developed together, and IPython.zmq couldn't keep up with the more demanding needs in IPython.parallel. We hope to consolidate these things soon, so there will be less duplication.

anderwm · 2012-01-17T15:36:52Z

zmq.version
'2.1.10'
zmq.zmq_version()
'2.1.10'

With configs set as specified above, there is still an immediate death after clicking new notebook. Then it seems to wait 5 seconds before dying again. When it dies, I have to click restart several times before I get the 5 seconds again.
c.MappingKernelManager.first_beat wasn't in the commented out default config file, so I added it. I also tried setting them from the command line and the result was the same. I will try updating zmq, I probably used pre-compiled binaries because this is a 64 bit windows box. I think the latest version should be 2.1.11.

@minrk Where does the log wind up (application.log)? Specifically, the one created normally and added to by your debughb branch.

anderwm · 2012-01-17T15:45:36Z

It was something about the zmq version. After updating to 2.1.11 everything works. Hard to say if it specific to windows, 64-bit python, or what.

fperez · 2012-01-17T18:50:46Z

Mmh, interesting. If it's gone, let's then close this puppy. It's easy enough to reopen if you see the problem again. Thanks for the patient reporting!

anderwm · 2012-01-17T19:15:12Z

No problem, although you might want to specify a version number of zmq in the documentation(sorry if i missed it). Unless nobody else can replicate this and we just chalk it up to a bad zmq install.

kyzyl · 2012-05-09T09:07:49Z

Perhaps this belongs in a new/different issue, but I'm getting this behavior again. I'm using epd 7.2-2, 64bit on the latest stable Ubuntu (x64). I tried updating zmq as per above but the issue persists.

One point where I see my situation differ is that upon starting a notebook it takes about 14s for the kernel to die. Setting a long time_to_dead seems will remove the notification of a dying kernel, but then no commands will execute ("[*]" forever).

I get the same results when using the ipython that ships with epd7.2, the latest version from easy_install -U, and the latest trunk :-/

Side note: The same versions of everything on Win7x64 produce no problems. Everything works.

Thoughts?

osiloke · 2013-07-18T07:11:53Z

Hi, i would like to know why this was closed when the last question was not answered. I'm also getting this issue on Mac OSX, i'm running Ipython==0.13.2 with zmq==3.2.2. I installed zmq and python using homebrew. I also noticed that the ipython notebook startup process takes more than a minute and the debug messages really dont help. It seems to be timeing out before it can run the notebook code. I've tried the suggestions above but none of them are working for me.

Carreau · 2013-07-18T09:41:28Z

Hi, i would like to know why this was closed when the last question was not answered.

The question was asked after the issue was closed. hence the issue #1719

Notebook cleanups and fixes: connection file cleanup, first heartbeat, startup flush. Kernels would not linger, but the KernelManagers are not garbage-collected on shutdown. This means that connection files for kernels still running at notebook shutdown would not be removed. Now, kernels are explicitly killed at server shutdown, allowing the KernelManagers to cleanup files. Small changes along the way: * disables the unnecessary (and actively detrimental) SIGINT handler inherited from the original copy/paste from the qt app. * put webapp initialization in `init_webapp` out of `initialize`, to preserve convention of there being no unique code in `initialize()`. * don't warn about http on all interfaces if running in 100% read-only mode, because no login or execution is possible. Closes ipython#1232.

fperez closed this as completed in e73fe99 Jan 6, 2012

fperez reopened this Jan 6, 2012

fperez closed this as completed Jan 17, 2012

kyzyl mentioned this issue May 10, 2012

Dead kernel loop, redux #1719

Closed

yogeshc mentioned this issue Jan 21, 2013

Kernel restarting after message "Kernel XXXX failed to respond to heartbeat" #2824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead kernel loop #1232

Dead kernel loop #1232

anderwm commented Jan 5, 2012

fperez commented Jan 6, 2012

minrk commented Jan 6, 2012

fperez commented Jan 6, 2012

minrk commented Jan 6, 2012

fperez commented Jan 6, 2012

minrk commented Jan 6, 2012

anderwm commented Jan 6, 2012

fperez commented Jan 6, 2012

anderwm commented Jan 10, 2012

dwf commented Jan 14, 2012

dwf commented Jan 14, 2012

anderwm commented Jan 16, 2012

fperez commented Jan 16, 2012

dwf commented Jan 16, 2012

fperez commented Jan 16, 2012

anderwm commented Jan 16, 2012

fperez commented Jan 16, 2012

anderwm commented Jan 16, 2012

minrk commented Jan 16, 2012

minrk commented Jan 17, 2012

anderwm commented Jan 17, 2012

minrk commented Jan 17, 2012

anderwm commented Jan 17, 2012

anderwm commented Jan 17, 2012

fperez commented Jan 17, 2012

anderwm commented Jan 17, 2012

kyzyl commented May 9, 2012

osiloke commented Jul 18, 2013

Carreau commented Jul 18, 2013

Dead kernel loop #1232

Dead kernel loop #1232

Comments

anderwm commented Jan 5, 2012

fperez commented Jan 6, 2012

minrk commented Jan 6, 2012

fperez commented Jan 6, 2012

minrk commented Jan 6, 2012

fperez commented Jan 6, 2012

minrk commented Jan 6, 2012

anderwm commented Jan 6, 2012

fperez commented Jan 6, 2012

anderwm commented Jan 10, 2012

dwf commented Jan 14, 2012

dwf commented Jan 14, 2012

anderwm commented Jan 16, 2012

fperez commented Jan 16, 2012

dwf commented Jan 16, 2012

fperez commented Jan 16, 2012

anderwm commented Jan 16, 2012

fperez commented Jan 16, 2012

anderwm commented Jan 16, 2012

minrk commented Jan 16, 2012

minrk commented Jan 17, 2012

anderwm commented Jan 17, 2012

minrk commented Jan 17, 2012

anderwm commented Jan 17, 2012

anderwm commented Jan 17, 2012

fperez commented Jan 17, 2012

anderwm commented Jan 17, 2012

kyzyl commented May 9, 2012

osiloke commented Jul 18, 2013

Carreau commented Jul 18, 2013