New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dead kernel loop #1232
Comments
Mhh, I've seen this on windows as well, but it seemed like the problem was that startup is slower on windows than on other systems, and the dead kernel page triggers too fast. I got it to work by closing that notebook altogether (with the 'kill kernel on exit' box checked), and then opening a new one fresh. It seems that once everything had been loaded from disk, the startup was fast enough. However, it's possible that our logic for dead kernels is too aggressive; @minrk, do you recent changes in #1187 help on this front? |
This is almost certainly the same issue as addressed by #1187, so I would recommend testing against that. It's possible the changes there aren't enough for some situations, but the timers are tunable via the @anderwm The reason there's no traceback is there's probably no actual error. The "kernel died" message is triggered by the kernel's failure to respond to the heartbeat. The cause is most likely that the kernel took too long to startup, and the heartbeat timeout starts counting the instant the kernel process is requested. This is precisely the issue that PR #1187 is meant to address. |
Should we add to #1187 a |
Yes, I think so, unless we want to wait for confirmation that it does indeed address the issue, though I am fairly confident that it does. |
It also looks to me like it. Let's do it, we can always reopen if it proves to persist. I'll merge now with that. |
Great, thanks! |
I don't think this is the same problem guys. I saw where you merged #1187 so I pulled it down and am now trying it on my work PC. I still get the same burst of dead kernels on startup. After playing with it some more I have found that if I restart it 10-15 times It will work for a few moments, but even before trying to run a command it will die again and I will have to restart it 10-15 times. #1187 did fix the problem on my laptop...which is very slow and old compared to this system. It's a quad core, 64 bit, 4gb RAM...at least better than the junk I normally buy. I also tried upping the time_to_dead config option to no avail. |
Ok, I've reopened it then so we can try to hunt it down... |
Also, the time_to_dead config option doesn't seem to do anything in this case. As I can change it to 10, but the dead kernel is immediate upon restarting it. I checked it in the debugger and it is indeed using the 10 seconds. It will execute code in between kernel deaths (which are somewhat random ranging from .1-20 seconds) if I am quick enough. |
I just encountered this in 0.12 on Linux 32-bit, EPD 7.2. It seems like it's fixed in master, but now the JavaScript seems a little wonky on Chrome (not Firefox, though; odd). I'll file another bug when I know more. |
Oh, also -- it seemed I was only seeing it with |
I get it in Chrome and Firefox, with or without --pylab. Only only this 64 bit python machine though. |
Just to clarify, I'm not seeing any of these issues in trunk. Only in the |
@fperez Yes, I pulled it from the master again today. |
OK, thanks for the confirmation. Nasty, since we're not seeing it... I'm hoping @minrk will have one of his epiphanies on this one, since it's in the zmq heart of the beast... |
It is a strange one for sure. I tried messing with the heartbeat period(in heartmonitor) and the time to dead settings. Although I am admittedly naive about the inner workings, I can't get the messages to even slow down much past 1 sec. If there is anything I can do to help you let me know. Edit--On further review the HeartMonitor.Period changes how long the first death takes accurately, however, after that death there comes a burst of messages. Once you restart it 4-7 times you get another HeartMonitor.Period seconds to execute code. If you make it really big(~100 seconds), the dead kernels come immediately. |
Wait, HeartMonitor.period? That's in IPython.parallel, and is not used by the notebook. It also makes no sense to me that having a large time_to_dead could cause earlier failures. The very first opportunity to call the heart-failure callback should be at Can you try this config?
I also pushed to a debughb branch with some extra logging messages that should hopefully help track down the timing of what's going on. |
@anderwm also, can you confirm your pyzmq and tornado versions? |
I told you I was naive. I noticed that the HeartMonitor was in parallel, but I assumed the notebook had similar functionality so used the same file. Obviously I was incorrect. I cannot remember off the top of my head the versions, but they are very likely the most recent ones. I will check for sure when I return to the office(14 hrs from now). Then I will try your configs and pull down your branch. Thanks |
No worries, it was just concerning to me that you were seeing apparent results from code that is not in use. It is indeed confusing that there is so much duplicate code between IPython.parallel and IPython.zmq, but they were developed together, and IPython.zmq couldn't keep up with the more demanding needs in IPython.parallel. We hope to consolidate these things soon, so there will be less duplication. |
With configs set as specified above, there is still an immediate death after clicking new notebook. Then it seems to wait 5 seconds before dying again. When it dies, I have to click restart several times before I get the 5 seconds again. @minrk Where does the log wind up (application.log)? Specifically, the one created normally and added to by your debughb branch. |
It was something about the zmq version. After updating to 2.1.11 everything works. Hard to say if it specific to windows, 64-bit python, or what. |
Mmh, interesting. If it's gone, let's then close this puppy. It's easy enough to reopen if you see the problem again. Thanks for the patient reporting! |
No problem, although you might want to specify a version number of zmq in the documentation(sorry if i missed it). Unless nobody else can replicate this and we just chalk it up to a bad zmq install. |
Perhaps this belongs in a new/different issue, but I'm getting this behavior again. I'm using epd 7.2-2, 64bit on the latest stable Ubuntu (x64). I tried updating zmq as per above but the issue persists. One point where I see my situation differ is that upon starting a notebook it takes about 14s for the kernel to die. Setting a long time_to_dead seems will remove the notification of a dying kernel, but then no commands will execute ("[*]" forever). I get the same results when using the ipython that ships with epd7.2, the latest version from easy_install -U, and the latest trunk :-/ Side note: The same versions of everything on Win7x64 produce no problems. Everything works. Thoughts? |
Hi, i would like to know why this was closed when the last question was not answered. I'm also getting this issue on Mac OSX, i'm running Ipython==0.13.2 with zmq==3.2.2. I installed zmq and python using homebrew. I also noticed that the ipython notebook startup process takes more than a minute and the debug messages really dont help. It seems to be timeing out before it can run the notebook code. I've tried the suggestions above but none of them are working for me. |
The question was asked after the issue was closed. hence the issue #1719 |
Notebook cleanups and fixes: connection file cleanup, first heartbeat, startup flush. Kernels would not linger, but the KernelManagers are not garbage-collected on shutdown. This means that connection files for kernels still running at notebook shutdown would not be removed. Now, kernels are explicitly killed at server shutdown, allowing the KernelManagers to cleanup files. Small changes along the way: * disables the unnecessary (and actively detrimental) SIGINT handler inherited from the original copy/paste from the qt app. * put webapp initialization in `init_webapp` out of `initialize`, to preserve convention of there being no unique code in `initialize()`. * don't warn about http on all interfaces if running in 100% read-only mode, because no login or execution is possible. Closes ipython#1232.
I have been using the notebook a bit on my home pc with some success, so I am trying to get it running at the office.
64-bit windows 7 with 64 bit python installed
Everything works as expected, up to the dashboard page (which opens fine in Chrome)
When I choose a notebook, or choose new notebook, the notebook opens and I get the dead kernel message. When I restart I get another dead kernel, and so forth...all I get in the kernel screen is the following
For somebody like me who usually diagnoses their own stupidity from the error trace, the kernel model makes it difficult.
The text was updated successfully, but these errors were encountered: