Dead kernel loop #1232

Closed
anderwm opened this Issue Jan 5, 2012 · 29 comments

Projects

None yet

7 participants

@anderwm
anderwm commented Jan 5, 2012

I have been using the notebook a bit on my home pc with some success, so I am trying to get it running at the office.

64-bit windows 7 with 64 bit python installed
Everything works as expected, up to the dashboard page (which opens fine in Chrome)
When I choose a notebook, or choose new notebook, the notebook opens and I get the dead kernel message. When I restart I get another dead kernel, and so forth...all I get in the kernel screen is the following

 The IPython Notebook is running at: http://127.0.0.1:8888
 Use Control-C to stop this server and shut down all kernels.
 Using MathJax from CDN
 Kernel started: a902af46-e03b-4fb5-bbd6-7e5a3eb3c81a
 To connect another client to this kernel, use:
 --existing kernel-a902af46-e03b-4fb5-bbd6-7e5a3eb3c81a.json
 Connecting to: tcp://127.0.0.1:63930
 Connecting to: tcp://127.0.0.1:63931
 Connecting to: tcp://127.0.0.1:63933
 Kernel started: 34e69ec9-b6f3-40d9-8424-3b824f8aa441
 Connecting to: tcp://127.0.0.1:64012
 Connecting to: tcp://127.0.0.1:64013
 Connecting to: tcp://127.0.0.1:64015
 To connect another client to this kernel, use:
 --existing kernel-34e69ec9-b6f3-40d9-8424-3b824f8aa441.json
 Kernel started: 98bdfcd3-0a8d-4aea-959c-49219aae41b6
 Connecting to: tcp://127.0.0.1:64049
 Connecting to: tcp://127.0.0.1:64050
 Connecting to: tcp://127.0.0.1:64052

For somebody like me who usually diagnoses their own stupidity from the error trace, the kernel model makes it difficult.

@fperez
Member
fperez commented Jan 6, 2012

Mhh, I've seen this on windows as well, but it seemed like the problem was that startup is slower on windows than on other systems, and the dead kernel page triggers too fast. I got it to work by closing that notebook altogether (with the 'kill kernel on exit' box checked), and then opening a new one fresh. It seems that once everything had been loaded from disk, the startup was fast enough.

However, it's possible that our logic for dead kernels is too aggressive; @minrk, do you recent changes in #1187 help on this front?

@minrk
Member
minrk commented Jan 6, 2012

This is almost certainly the same issue as addressed by #1187, so I would recommend testing against that. It's possible the changes there aren't enough for some situations, but the timers are tunable via the MultiKernelManager.first_beat and MultiKernelManager.time_to_dead traits for slower environments.

@anderwm The reason there's no traceback is there's probably no actual error. The "kernel died" message is triggered by the kernel's failure to respond to the heartbeat. The cause is most likely that the kernel took too long to startup, and the heartbeat timeout starts counting the instant the kernel process is requested. This is precisely the issue that PR #1187 is meant to address.

@fperez
Member
fperez commented Jan 6, 2012

Should we add to #1187 a closes #1232?

@minrk
Member
minrk commented Jan 6, 2012

Yes, I think so, unless we want to wait for confirmation that it does indeed address the issue, though I am fairly confident that it does.

@fperez
Member
fperez commented Jan 6, 2012

It also looks to me like it. Let's do it, we can always reopen if it proves to persist. I'll merge now with that.

@fperez fperez closed this in e73fe99 Jan 6, 2012
@minrk
Member
minrk commented Jan 6, 2012

Great, thanks!

@anderwm
anderwm commented Jan 6, 2012

I don't think this is the same problem guys. I saw where you merged #1187 so I pulled it down and am now trying it on my work PC. I still get the same burst of dead kernels on startup. After playing with it some more I have found that if I restart it 10-15 times It will work for a few moments, but even before trying to run a command it will die again and I will have to restart it 10-15 times. #1187 did fix the problem on my laptop...which is very slow and old compared to this system. It's a quad core, 64 bit, 4gb RAM...at least better than the junk I normally buy. I also tried upping the time_to_dead config option to no avail.

@fperez fperez reopened this Jan 6, 2012
@fperez
Member
fperez commented Jan 6, 2012

Ok, I've reopened it then so we can try to hunt it down...

@anderwm
anderwm commented Jan 10, 2012

Also, the time_to_dead config option doesn't seem to do anything in this case. As I can change it to 10, but the dead kernel is immediate upon restarting it. I checked it in the debugger and it is indeed using the 10 seconds.

It will execute code in between kernel deaths (which are somewhat random ranging from .1-20 seconds) if I am quick enough.

@dwf
Contributor
dwf commented Jan 14, 2012

I just encountered this in 0.12 on Linux 32-bit, EPD 7.2. It seems like it's fixed in master, but now the JavaScript seems a little wonky on Chrome (not Firefox, though; odd). I'll file another bug when I know more.

@dwf
Contributor
dwf commented Jan 14, 2012

Oh, also -- it seemed I was only seeing it with --pylab or --pylab=inline. The notebook without --pylab worked fine.

@anderwm
anderwm commented Jan 16, 2012

I get it in Chrome and Firefox, with or without --pylab. Only only this 64 bit python machine though.

@fperez
Member
fperez commented Jan 16, 2012

@minrk, do you think this could be somehow related to the messaging glitches you've been investigating in #1266? I haven't seen this problem even once, so I'm kind of stumped as to what could be causing it... Still, I made it high priority, it would be good to get to the bottom of this before 0.13.

@dwf
Contributor
dwf commented Jan 16, 2012

Just to clarify, I'm not seeing any of these issues in trunk. Only in the
version that (unfortunately) shipped with EPD 7.2-1. I also got that
behaviour on a 64-bit Mac install of EPD. :(

@fperez
Member
fperez commented Jan 16, 2012

Thanks, @dwf, for the clarification; @anderwm, are you seeing the problem with the current master branch from git?

@anderwm
anderwm commented Jan 16, 2012

@fperez Yes, I pulled it from the master again today.

@fperez
Member
fperez commented Jan 16, 2012

OK, thanks for the confirmation. Nasty, since we're not seeing it... I'm hoping @minrk will have one of his epiphanies on this one, since it's in the zmq heart of the beast...

@anderwm
anderwm commented Jan 16, 2012

It is a strange one for sure. I tried messing with the heartbeat period(in heartmonitor) and the time to dead settings. Although I am admittedly naive about the inner workings, I can't get the messages to even slow down much past 1 sec. If there is anything I can do to help you let me know. Edit--On further review the HeartMonitor.Period changes how long the first death takes accurately, however, after that death there comes a burst of messages. Once you restart it 4-7 times you get another HeartMonitor.Period seconds to execute code. If you make it really big(~100 seconds), the dead kernels come immediately.

@minrk
Member
minrk commented Jan 16, 2012

Wait, HeartMonitor.period? That's in IPython.parallel, and is not used by the notebook. It also makes no sense to me that having a large time_to_dead could cause earlier failures. The very first opportunity to call the heart-failure callback should be at MappingKernelManager.first_beat + MappingKernelManager.time_to_dead seconds.

Can you try this config?

c.MappingKernelManager.first_beat = 10.
c.MappingKernelManager.time_to_dead = 5.

I also pushed to a debughb branch with some extra logging messages that should hopefully help track down the timing of what's going on.

@minrk
Member
minrk commented Jan 17, 2012

@anderwm also, can you confirm your pyzmq and tornado versions?

@anderwm
anderwm commented Jan 17, 2012

I told you I was naive. I noticed that the HeartMonitor was in parallel, but I assumed the notebook had similar functionality so used the same file. Obviously I was incorrect. I cannot remember off the top of my head the versions, but they are very likely the most recent ones. I will check for sure when I return to the office(14 hrs from now). Then I will try your configs and pull down your branch. Thanks

@minrk
Member
minrk commented Jan 17, 2012

No worries, it was just concerning to me that you were seeing apparent results from code that is not in use. It is indeed confusing that there is so much duplicate code between IPython.parallel and IPython.zmq, but they were developed together, and IPython.zmq couldn't keep up with the more demanding needs in IPython.parallel. We hope to consolidate these things soon, so there will be less duplication.

@anderwm
anderwm commented Jan 17, 2012

zmq.version
'2.1.10'
zmq.zmq_version()
'2.1.10'

With configs set as specified above, there is still an immediate death after clicking new notebook. Then it seems to wait 5 seconds before dying again. When it dies, I have to click restart several times before I get the 5 seconds again.
c.MappingKernelManager.first_beat wasn't in the commented out default config file, so I added it. I also tried setting them from the command line and the result was the same. I will try updating zmq, I probably used pre-compiled binaries because this is a 64 bit windows box. I think the latest version should be 2.1.11.

@minrk Where does the log wind up (application.log)? Specifically, the one created normally and added to by your debughb branch.

@anderwm
anderwm commented Jan 17, 2012

It was something about the zmq version. After updating to 2.1.11 everything works. Hard to say if it specific to windows, 64-bit python, or what.

@fperez
Member
fperez commented Jan 17, 2012

Mmh, interesting. If it's gone, let's then close this puppy. It's easy enough to reopen if you see the problem again. Thanks for the patient reporting!

@fperez fperez closed this Jan 17, 2012
@anderwm
anderwm commented Jan 17, 2012

No problem, although you might want to specify a version number of zmq in the documentation(sorry if i missed it). Unless nobody else can replicate this and we just chalk it up to a bad zmq install.

@kyzyl
kyzyl commented May 9, 2012

Perhaps this belongs in a new/different issue, but I'm getting this behavior again. I'm using epd 7.2-2, 64bit on the latest stable Ubuntu (x64). I tried updating zmq as per above but the issue persists.

One point where I see my situation differ is that upon starting a notebook it takes about 14s for the kernel to die. Setting a long time_to_dead seems will remove the notification of a dying kernel, but then no commands will execute ("[*]" forever).

I get the same results when using the ipython that ships with epd7.2, the latest version from easy_install -U, and the latest trunk :-/

Side note: The same versions of everything on Win7x64 produce no problems. Everything works.

Thoughts?

@osiloke
osiloke commented Jul 18, 2013

Hi, i would like to know why this was closed when the last question was not answered. I'm also getting this issue on Mac OSX, i'm running Ipython==0.13.2 with zmq==3.2.2. I installed zmq and python using homebrew. I also noticed that the ipython notebook startup process takes more than a minute and the debug messages really dont help. It seems to be timeing out before it can run the notebook code. I've tried the suggestions above but none of them are working for me.

@Carreau
Member
Carreau commented Jul 18, 2013

Hi, i would like to know why this was closed when the last question was not answered.

The question was asked after the issue was closed. hence the issue #1719

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment