Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle the edge cases where an eventqueue method panics #12903

Merged
merged 1 commit into from Mar 21, 2017

Conversation

ramr
Copy link
Contributor

@ramr ramr commented Feb 10, 2017

Temporary band aid fix for bugz: https://bugzilla.redhat.com/show_bug.cgi?id=1419771

Since it was a bit too late to get the event queue code switched to use a work queue - this is a defensive fix if we want to put it in.

When the event queue panic[k]s (ala see test case[s] below), don't leave the router process running because that router instance will never get any events. Instead kill the router and let it get restarted.

@knobunc @smarterclayton PTAL Thx

Couldn't reproduce the bug but I could simulate it via this test: https://gist.github.com/ramr/38423bad348846b743fcd8afba7533dd

And this test case/script also causes the same issue to manifest itself. You may have to run it a few times but normally within 2-3 attempts I can simulate it.
Test case: https://gist.github.com/ramr/58dbdc3c5982db7b3c3154eb4bca60c8
$ ./reproduce-eq-panic.sh [<route-json-yaml-file>]

…router

running with a thread that will never update. Kill the router instead and
let it get restarted by the kubelet.
@ramr
Copy link
Contributor Author

ramr commented Feb 10, 2017

[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to c7eca1f

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think this is the least bad option given the timing.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 10, 2017 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 10, 2017 via email

@smarterclayton
Copy link
Contributor

Have we observed this in the wild on haproxy or just on f5?

@knobunc
Copy link
Contributor

knobunc commented Feb 10, 2017

Clayton: Unfortunately, you are right. But, have you seen crashes from the F5 anywhere other than the once that QA found? Based on Ram's investigation of the router code, nothing has changed in how we call the event queue code. So we have always shipped a dodgy router. We plan to fix this in 3.6 and if it's not too scary we can backport.

@knobunc
Copy link
Contributor

knobunc commented Feb 10, 2017

@smarterclayton Not in the wild. Only once on the F5 from QA and they could not reproduce. Ram can reliably tickle the error in the queue with code designed to trigger it... but that's it.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 10, 2017 via email

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_future/64/) (Base Commit: 898266a)

@ramr
Copy link
Contributor Author

ramr commented Feb 10, 2017

@smarterclayton I just ran the contrived test case I had : https://gist.github.com/ramr/38423bad348846b743fcd8afba7533dd
against checkouts from tags v1.{2,3,4}.0 and it failed on all those versions.
So this issue exists in the previous releases.

I will try running the stress test I had with those different routers versions - maybe a couple.

But that said, QE has not reproduced this issue and saw it just once with the F5 router (and not with the haproxy router). In the event this does happen, isn't killing the router off better than having a router which will never receive any other events? Note that at this point the goroutine handling the events is dead duck in water and so the router won't get any more events when this does occur.

And as re: integration tests failures, that issue was actually related to the way that the test code was sending the events. That said, yeah if it does occur here, it will kill the container and the test will fail. But that is already going to happen - because with the current code when we hit this issue with the event queue, the goroutine handling the events will die and the router won't get any updates/deletes/adds and so the tests will fail in any case.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 10, 2017 via email

@ramr
Copy link
Contributor Author

ramr commented Feb 10, 2017

hmm, a little bit of weirdness. It happens on a stress test I ran in v1.3.0 as well but this was in the logs:

 594 I0210 03:44:00.919609       1 status.go:252] admit: admitting route by upda     ting status: wildcard-edge-redirect-route (true): wild1.edge.header.test
 595 E0210 03:44:00.983044       1 runtime.go:52] Recovered from panic: "Invalid      state transition: DELETED -> ADDED" (Invalid state transition: DELETED ->      ADDED)
 596 /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshi     ft/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:58
 597 /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshi     ft/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:51
 598 /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshi     ft/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:41
 599 /usr/local/go/src/runtime/asm_amd64.s:472
 600 /usr/local/go/src/runtime/panic.go:443
 601 /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshi     ft/origin/pkg/client/cache/eventqueue.go:128

...
610 I0210 03:44:01.020982       1 status.go:291] skipping route: wildcard-edge-     redirect-route
611 I0210 03:44:01.021222       1 controller.go:97] Processing Route: wildcard-     edge-redirect-route -> header-test-insecure

The leading numbers are line numbers. So it seems to recover from it on 1.3.0 - let me see if what happens on the current release branch. Maybe we don't need this PR at all.

@ramr
Copy link
Contributor Author

ramr commented Feb 10, 2017

Actually on further testing with 1.5.0-alpha.2 as well as on HEAD on master the same recovery as in 1.3.0 occurs. So I don't think we need this PR.

I0210 04:53:12.871252       1 router.go:665] Adding route default/wildcard-edge-redirect-route
E0210 04:53:12.951804       1 runtime.go:64] Observed a panic: "Invalid state transition: DELETED -> ADDED" (Invalid state transition: DELETED -> ADDED)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:70
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:63
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:49
/usr/local/go/src/runtime/asm_amd64.s:479
/usr/local/go/src/runtime/panic.go:458
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/client/cache/eventqueue.go:132
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/client/cache/eventqueue.go:196
...
 - HAProxy port 1936 health check ok : 0 retry attempt(s).
I0210 04:53:13.003126       1 controller.go:164] Processing Route: wildcard-edge-redirect-route -> header-test-insecure
I0210 04:53:13.003198       1 controller.go:165]            Alias: wild1.edge.header.test

@knobunc FYI - this is not needed because we recover from that failure. Not sure what's restarting that thread - probably something in k8s cache/reflector code.

@ramr ramr closed this Feb 10, 2017
@smarterclayton
Copy link
Contributor

Reopening because this is apparently not gracefully handled (for discussion)

@smarterclayton
Copy link
Contributor

And is happening in haproxy

@knobunc
Copy link
Contributor

knobunc commented Mar 17, 2017

Just as an update on the status of this. @ramr is no longer working on the team, so we are trying to reconstruct where this was since it has occurred in production systems and on haproxy. We are still looking at this approach and hope to have something ready soon.

@smarterclayton
Copy link
Contributor

hold on - if this is an actual problem with the event queue, what's the status on verifying that?

@knobunc
Copy link
Contributor

knobunc commented Mar 21, 2017

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to c7eca1f

@openshift-bot
Copy link
Contributor

openshift-bot commented Mar 21, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/170/) (Base Commit: d08d28a) (Image: devenv-rhel7_6093)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants