New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle the edge cases where an eventqueue method panics #12903
Conversation
…router running with a thread that will never update. Kill the router instead and let it get restarted by the kubelet.
|
[test] |
|
Evaluated for origin test up to c7eca1f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think this is the least bad option given the timing.
|
What will happen in the integration tests? Will we continue flaming?
On Feb 9, 2017, at 7:45 PM, OpenShift Bot <notifications@github.com> wrote:
continuous-integration/openshift-jenkins/test Running (
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/64/)
(Base Commit: 898266a
<898266a>
)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12903 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p_Wmfz8ZVzZva_YTgtCawZM2kQ0Aks5ra7M2gaJpZM4L82q1>
.
|
|
This sounds like we're going to ship a router that can crash randomly at
any minute to customers?
On Feb 9, 2017, at 7:45 PM, OpenShift Bot <notifications@github.com> wrote:
continuous-integration/openshift-jenkins/test Running (
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/64/)
(Base Commit: 898266a
<898266a>
)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12903 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p_Wmfz8ZVzZva_YTgtCawZM2kQ0Aks5ra7M2gaJpZM4L82q1>
.
|
|
Have we observed this in the wild on haproxy or just on f5? |
|
Clayton: Unfortunately, you are right. But, have you seen crashes from the F5 anywhere other than the once that QA found? Based on Ram's investigation of the router code, nothing has changed in how we call the event queue code. So we have always shipped a dodgy router. We plan to fix this in 3.6 and if it's not too scary we can backport. |
|
@smarterclayton Not in the wild. Only once on the F5 from QA and they could not reproduce. Ram can reliably tickle the error in the queue with code designed to trigger it... but that's it. |
|
So let's use the reproducer and bisect to when this was introduced. Start
from v1.4.0.
I'd like to know why event queue is broken before we work around it.
On Feb 9, 2017, at 8:10 PM, Ben Bennett <notifications@github.com> wrote:
@smarterclayton <https://github.com/smarterclayton> Not in the wild. Only
once on the F5 from QA and they could not reproduce. Ram can reliably
tickle the error in the queue with code designed to trigger it... but
that's it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12903 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p3XNwoGAh3fdgVK23VBNxhkGvPsHks5ra7jwgaJpZM4L82q1>
.
|
|
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/64/) (Base Commit: 898266a) |
|
@smarterclayton I just ran the contrived test case I had : https://gist.github.com/ramr/38423bad348846b743fcd8afba7533dd I will try running the stress test I had with those different routers versions - maybe a couple. But that said, QE has not reproduced this issue and saw it just once with the F5 router (and not with the haproxy router). In the event this does happen, isn't killing the router off better than having a router which will never receive any other events? Note that at this point the goroutine handling the events is dead duck in water and so the router won't get any more events when this does occur. And as re: integration tests failures, that issue was actually related to the way that the test code was sending the events. That said, yeah if it does occur here, it will kill the container and the test will fail. But that is already going to happen - because with the current code when we hit this issue with the event queue, the goroutine handling the events will die and the router won't get any updates/deletes/adds and so the tests will fail in any case. |
|
Ok, then I'm ok with this edge case. Just wanted to be 100% certain
On Feb 9, 2017, at 9:51 PM, Ram Ranganathan <notifications@github.com> wrote:
@smarterclayton <https://github.com/smarterclayton> I just ran the
contrived test case I had :
https://gist.github.com/ramr/38423bad348846b743fcd8afba7533dd
against checkouts from tags v1.{2,3,4}.0 and it failed on all those
versions.
So this issue exists in the previous releases.
I will try running the stress test I had with those different routers
versions - maybe a couple.
But that said, QE has not reproduced this issue and saw it just once with
the F5 router (and not with the haproxy router). In the event this does
happen, isn't killing the router off better than having a router which will
never receive any other events? Note that at this point the goroutine
handling the events is dead duck in water and so the router won't get any
more events when this does occur.
And as re: integration tests failures, that issue was actually related to
the way that the test code was sending the events. That said, yeah if it
does occur here, it will kill the container and the test will fail. But
that is *already* going to happen - because with the current code when we
hit this issue with the event queue, the goroutine handling the events will
die and the router won't get any updates/deletes/adds and so the tests will
fail in any case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12903 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p7ybFS54Hzg-mYR5PG07tLoIyvLjks5ra9CigaJpZM4L82q1>
.
|
|
hmm, a little bit of weirdness. It happens on a stress test I ran in v1.3.0 as well but this was in the logs: The leading numbers are line numbers. So it seems to recover from it on 1.3.0 - let me see if what happens on the current release branch. Maybe we don't need this PR at all. |
|
Actually on further testing with 1.5.0-alpha.2 as well as on HEAD on master the same recovery as in 1.3.0 occurs. So I don't think we need this PR. @knobunc FYI - this is not needed because we recover from that failure. Not sure what's restarting that thread - probably something in k8s cache/reflector code. |
|
Reopening because this is apparently not gracefully handled (for discussion) |
|
And is happening in haproxy |
|
Just as an update on the status of this. @ramr is no longer working on the team, so we are trying to reconstruct where this was since it has occurred in production systems and on haproxy. We are still looking at this approach and hope to have something ready soon. |
|
hold on - if this is an actual problem with the event queue, what's the status on verifying that? |
|
[merge] |
|
Evaluated for origin merge up to c7eca1f |
|
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/170/) (Base Commit: d08d28a) (Image: devenv-rhel7_6093) |
Temporary band aid fix for bugz: https://bugzilla.redhat.com/show_bug.cgi?id=1419771
Since it was a bit too late to get the event queue code switched to use a work queue - this is a defensive fix if we want to put it in.
When the event queue panic[k]s (ala see test case[s] below), don't leave the router process running because that router instance will never get any events. Instead kill the router and let it get restarted.
@knobunc @smarterclayton PTAL Thx
Couldn't reproduce the bug but I could simulate it via this test: https://gist.github.com/ramr/38423bad348846b743fcd8afba7533dd
And this test case/script also causes the same issue to manifest itself. You may have to run it a few times but normally within 2-3 attempts I can simulate it.
Test case: https://gist.github.com/ramr/58dbdc3c5982db7b3c3154eb4bca60c8
$ ./reproduce-eq-panic.sh [<route-json-yaml-file>]