Blocker: Qt console broken after "all magics" menu became dynamic #1057

Closed
fperez opened this Issue Nov 27, 2011 · 17 comments

Projects

None yet

4 participants

@fperez
Member
fperez commented Nov 27, 2011

It turns out that the merge of #956 introduced a serious, major problem with the Qt console. Unfortunately it's intermittent, which is why we didn't notice it during review, but we do need to fix it.

Symptoms: sometimes, perhaps one out of every four or five invocations, the qt console will not display the first prompt. It shows the banner but the input prompt never appears. The only solution is to kill it and try again, hoping for better. From a git bisection I did, it appears the first bad commit was e9402d7. The way to test this is simply to reset to that commit, make sure you remove all .pyc files just in case and then try opening multiple times the qt console. Eventually, you'll see an instance where it comes out with no input prompt.

@Carreau, I hope you'll be able to figure out a solution for this one... Some detective work by @takluyver pinpointed the issue somewhat, here's our IRC log:

<takluyver> with the Qt one, I think it's to do with the order of requests
<fperez__> mh
<takluyver> there are now 3 requests getting made during startup
<takluyver> one for the prompt number, one for the history, and the one that commit added for the list of magics, which is triggered by the first reply on the shell channel.
<fperez__> yes
<fperez__> is the callback triggering the problem?
<takluyver> Not quite sure
<takluyver> when I tried putting print statements in, the problem stopped happening
<takluyver> I also can't get it to happen with --plain or --pure
<takluyver> damn thing's quantum, just observing it affects it ;)
<fperez__> it's some kind of race condition, so the smallest change to the code will make it go away.
<fperez__> yes, they're called heisenbugs
<takluyver> aha
<fperez__> the very worst kind.

In any case, this is a major blocker. If a full solution isn't found, we'll have to revert the dynamic magic menu out altogether, but I hope we can find a fix.

@takluyver
Member

Using the commit mentioned above, I'm finding the frequency is a bit lower than I seemed to be getting yesterday. It also seems like it happens a bit more often when the memory is full (i.e. after loading a load of big applications), although this could be a red herring. Yesterday when I tried, I couldn't get it to happen when I had both cores running makework.

@fperez fperez added a commit that referenced this issue Nov 28, 2011
@fperez fperez Temporary fix to work around #1057.
Basically it reverts the effect of #956 and goes back to a static list
for the 'all magics' menu.  I tried to mark very clearly the new code
so we can disable it once a proper fix for #1057 is committed, but we
can't have a broken qt console in master.
65546bf
@fperez
Member
fperez commented Nov 28, 2011

I've just committed to master a temporary fix that effectively disables #956 without actually deleting @Carreau's code. I simply copied back the old static list and commented out the call to the new dynamic function. @Carreau, if you can find a proper fix, start off this and simply delete the code I added in 65546bf, I made it be a minimal amount that only fixed the current problem and nothing else.

With this workaround, if push comes to shove we could release 0.12 as-is and push this issue back to 0.13. I'd prefer to have a proper fix, but now at least we don't have a completely broken qt console in master.

For that reason I'm downgrading the priority to just 'high', as it's not really a blocker anymore. But I'd much prefer to have a real fix than this band-aid I just put in. @Carreau, let us know if we can help in some way.

@Carreau
Member
Carreau commented Nov 28, 2011

Hum, I'm totally unable to reproduce this bug (on my debian at least, i'll try on mac). even after 25 launch in a row.
I'm ok to try 100, maybe the 25 last will all crash, we never know with randomness.

I've already connect the action to kernel firstResponse to avoid this bug that otherwise appear 100% of the time. So maybe it's a deeper issue.
I'm not familiar with the kerner startup internals. Do you think it can get confuse if it get the request for the list of magic too early ? Does the bug still appear if you change get_ipython().lsmagic() (mainwindow.py L619) to a more simple expression like ["item1","item2"] ?

Do you think just setting a timer after which updating the menu might be enough or do you want something more specific.
like updating after history request...?

I can also propose to leave the "static" menu at startup but put a 'update menu' at the top of it (ie, just don't trigger the auto update as startup).

@fperez
Member
fperez commented Nov 28, 2011

On Sun, Nov 27, 2011 at 11:41 PM, Bussonnier Matthias
reply@reply.github.com
wrote:

Hum, I'm totally unable to reproduce this bug (on my debian at least, i'll try on mac). even after 25 launch in a row.
I'm ok to try 100, maybe the 25 last will all crash, we never know with randomness.

Note that I did add this afternoon a fix to master that effectively
undid some of your changes. So right now, master doesn't have the
problem. See this commit for details:

65546bf

You would need to re-activate that code to see the problem. It was
happening frequently enough for me that I really had to turn it off,
as this was on master that people do update from.

I've already connect the action to kernel firstResponse to avoid this bug that otherwise appear 100% of the time. So maybe it's a deeper issue.
I'm not familiar with the kerner startup internals. Do you think it can get confuse if it get the request for the list of magic too early ?  Does the bug still appear if you change get_ipython().lsmagic() (mainwindow.py L619) to a more simple expression like ["item1","item2"] ?

Do you think just setting a timer after which updating the menu might be enough or do you want something more specific.
like updating after history request...?

No, timers are never the solution to race conditions. The issue is
one of logic, and adding a timer simply makes the problem harder to
trigger and thus harder to fix, but it doesn't really fix it.

I can also propose to leave the "static" menu at startup but put a 'update menu' at the top of it (ie, just don't trigger the auto update as startup).

That's probably a good idea.

Thanks for the quick response!

@Carreau
Member
Carreau commented Nov 28, 2011

@fperez
I did test without updating to master, and I tempered with the code to try to reproduce it, so i would have noticed your fix.

Actually, If a dirrectly trigger update_magic_menu() even before kernel's first response, and changed the request kernel from get_ipython().lsmagic() to a static list, then the bug is gone. So my guess is that there might be a deeper issue in a 'too early' request for the list of magics.

I'll try to get it not work on Ubuntu at work....

And BTW, this afternoon is durring the night for me :-)

@fperez
Member
fperez commented Nov 28, 2011

On Sun, Nov 27, 2011 at 11:58 PM, Bussonnier Matthias
reply@reply.github.com
wrote:

@fperez
I did test without updating to master, and I tempered with the code to try to reproduce it, so i would have noticed your fix.

Ah, ok.

Actually, If a dirrectly trigger update_magic_menu() even before kernel's first response, and changed the request kernel from get_ipython().lsmagic() to a static list, then the bug is gone. So my guess is that there might be a deeper issue in a 'too early' request for the list of magics.

Yes, that's probably it. @minrk, have you seen this problem, or do
you have perhaps an inkling of what could be going wrong here
regarding the sequence of initial requests to the kernel?

I'll try to get it not work on Ubuntu at work....

Today when I fixed it, it was to the point where I simply couldn't
start the console at all. Every attempt to open it gave me a stuck
console.

And BTW, this afternoon is durring the night for me :-)

Yes, I think you're in Europe, right? But why do you point this out,
did I say something about the afternoon before?

@Carreau
Member
Carreau commented Nov 28, 2011

@fperez

Note that I did add this afternoon a fix to master that effectively[...]

So yeah, you kind of mentioned it... :-)

@fperez
Member
fperez commented Nov 28, 2011

On Mon, Nov 28, 2011 at 12:06 AM, Bussonnier Matthias
reply@reply.github.com
wrote:

So yeah, you kind of mentioned it... :-)

Ah, yes. I just said that because I figured the changes had happened
while you were asleep, and thought you might have pulled them in
already, that's all...

@Carreau
Member
Carreau commented Nov 28, 2011

I know, don't worry, I was just messing with you. Now it's my turn to be awake while most of you will be asleep if I'm right.

@fperez
Member
fperez commented Nov 28, 2011

On Mon, Nov 28, 2011 at 12:14 AM, Bussonnier Matthias
reply@reply.github.com
wrote:

I know, don't worry, I was just messing with you. Now it's my turn to be awake while most of you will be asleep if I'm right.

yes, I'm clocking out now for some sleep... Bye!

@Carreau
Member
Carreau commented Nov 28, 2011

I think I got it... (the why, not the how to fix it)
in frontendwidget.py I wrote

    self._request_info['execute'] = self._ExecutionRequest(msg_id, 'silent_exec_callback')

Instead of "appending" another execute request. Then the _handle_execute_callback doesn't see the previous one as assigned to it ...But _ExecutionRequest is a named tuple and the current logic seem to not allow(not be written for ?) multiple execution request waiting. (I'm not quite familiar with the whole logic and named tuple, so I may be wrong)

I might not have have time to deeply look at it during my day/your night. I'll try to make a better "Quick-fix" and try to modifie the current code to handle several multiple execution request waiting. There might be something I haven't understand, and I 'll appreciate some advice in the right way to handle that.

@Carreau
Member
Carreau commented Nov 28, 2011

@fperez, @takluyver
could you try this branch [ qtconsole-racecondition ] and tell me if it fixes the race condition for you ?
If so I'll clean it up tomorow (for me), and send a PR.
It basically slightly complicate the _request_info['execute'] handeling by making it a list which stores several waiting execution request.
I know there are still print statement in a few places, I'll remove and rebase later, same for changings the lists to mabe more adapted container.

@minrk
Member
minrk commented Nov 28, 2011

@carreau - you are correct, the qtconsole does not allow multiple outstanding execution requests, and does not protect itself by preventing them. This is fine if there are no background operations, but now that we have some, it obviously isn't.

I think your approach has the right idea, but I would use a dict by msg_id instead of a list. msg_ids are unique, so there is no worry about duplication.

@takluyver
Member

I checked out the commit after you reverted @fperez's temporary fix, but before your fix, to check that I could still reproduce the bug. Unfortunately, I can't reproduce it after more than 40 tries, so I won't be able to tell if your fix makes a difference. Hopefully @fperez can check it.

The problem and the approach makes sense. I agree with Min that a dictionary is preferable.

@fperez
Member
fperez commented Nov 29, 2011

On Mon, Nov 28, 2011 at 3:26 PM, Thomas
reply@reply.github.com
wrote:

Hopefully @fperez can check it.

I'll try to check tonight with my home machine, which is where I could
reliably reproduce the problem. It's an extremely fast desktop, and
that may be why it happens so easily on that system. Slower boxes may
have enough builtin delays to mask it; I can't see it on my laptop,
for example.

Thanks everyone for working on this one, we'll get to the bottom of it.

@Carreau
Member
Carreau commented Nov 29, 2011

Won't have much time to ti re-write it today, and i'll do it with more suitable container.
If we start storing message id in dictionary, does it still make sens to have several _handle_xxx_reply ?
couldn't we just have a _handle_reply and switch code branch depending on what is poped from the dict ?

@Carreau Carreau added a commit to Carreau/ipython that referenced this issue Nov 29, 2011
@Carreau Carreau Revert "Temporary fix to work around #1057."
    This reverts commit 65546bf, done to
    temporaly fixed a race condition introduced by #956, next commits should
    fixe this race condition
4573b16
@Carreau
Member
Carreau commented Nov 29, 2011

Ok, fixed, with dictionnary, (rebased, force pushed ) several times because I'm a moron, and also to put space around equal sign...

There is some redundent info as now, _ExecutionRequest namedTuple not only contains msg_id and kind , but are stored in a dictionnary with msg_id as key...
Just get a question on this else:
https://github.com/ipython/ipython/blob/master/IPython/frontend/qt/console/ipython_widget.py#L173
why is it aligned with if info.kind and not if info , like here :
https://github.com/ipython/ipython/blob/master/IPython/frontend/qt/console/frontend_widget.py#L403

@Carreau Carreau added a commit to Carreau/ipython that referenced this issue Nov 29, 2011
@Carreau Carreau qtconsole: fix race-cond, handle multiple exec
	fix race condition in qtconsole due to a race condition beween the population
	of the magic menu and the first prompt request. Resolved by upgrading the
	logic of the console to handle several executions request in parallel.

	Some namedTuple might still carry redundent information but the fix is the
	first priority

	fixes #1057 , introduced by #956
8553345
@fperez fperez closed this in a5fd0a3 Nov 30, 2011
@mattvonrocketstein mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014
@fperez fperez Temporary fix to work around #1057.
Basically it reverts the effect of #956 and goes back to a static list
for the 'all magics' menu.  I tried to mark very clearly the new code
so we can disable it once a proper fix for #1057 is committed, but we
can't have a broken qt console in master.
c725d78
@mattvonrocketstein mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014
@Carreau Carreau Revert "Temporary fix to work around #1057."
    This reverts commit 65546bf, done to
    temporaly fixed a race condition introduced by #956, next commits should
    fixe this race condition
bdbf575
@mattvonrocketstein mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014
@Carreau Carreau qtconsole: fix race-cond, handle multiple exec
	fix race condition in qtconsole due to a race condition beween the population
	of the magic menu and the first prompt request. Resolved by upgrading the
	logic of the console to handle several executions request in parallel.

	Some namedTuple might still carry redundent information but the fix is the
	first priority

	fixes #1057 , introduced by #956
f447bb0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment