Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flake] TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter flaking #111154

Closed
liggitt opened this issue Jul 14, 2022 · 14 comments · Fixed by #111162
Closed

[Flake] TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter flaking #111154

liggitt opened this issue Jul 14, 2022 · 14 comments · Fixed by #111162
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@liggitt
Copy link
Member

liggitt commented Jul 14, 2022

Which jobs are flaking?

pull-kubernetes-unit

Which tests are flaking?

TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter

Since when has it been flaking?

7/14

Testgrid link

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-unit&include-filter-by-regex=TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter&width=20

Reason for failure (if possible)

https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter

Anything else we need to know?

No response

Relevant SIG(s)

/sig api-machinery

@liggitt liggitt added the kind/flake Categorizes issue or PR as related to a flaky test. label Jul 14, 2022
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 14, 2022
@liggitt liggitt changed the title TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter [Flake] TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter flaking Jul 14, 2022
@liggitt
Copy link
Member Author

liggitt commented Jul 14, 2022

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 14, 2022
@leilajal
Copy link
Contributor

/cc @lavalamp
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 14, 2022
@aojea
Copy link
Member

aojea commented Jul 14, 2022

@MikeSpreitzer and @wojtek-t can this be related to #110104 ?

@MikeSpreitzer
Copy link
Member

@aojea: I see no direct relationship, but hey, anything's possible. We learned that with the misery in integration test failures when we merged not-really-guilty other stuff.

@MikeSpreitzer
Copy link
Member

The expanded logging turned up some new evidence. See https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/111162/pull-kubernetes-unit/1547751720010911744 , where the complaint is now

    priority-and-fairness_test.go:1065: Expected request: "/request/second/enqueued-as-designed" to get a response, but got error: &url.Error{Op:"Get", URL:"https://127.0.0.1:35489/request/second/enqueued-as-designed", Err:http.http2StreamError{StreamID:0x3, Code:0x2, Cause:(*errors.errorString)(0xc0000b68e0)}}="Get \"https://127.0.0.1:35489/request/second/enqueued-as-designed\": stream error: stream ID 3; INTERNAL_ERROR; received from peer"

@MikeSpreitzer
Copy link
Member

The problem here is not APF, it is something in the timeout filter or the test server.

@liggitt
Copy link
Member Author

liggitt commented Jul 15, 2022

The problem here is not APF, it is something in the timeout filter or the test server.

is something in APF triggering a timeout?

@liggitt
Copy link
Member Author

liggitt commented Jul 15, 2022

The flake definitely came in with #110104

on the prior commit on master:

$ git checkout 6a784ef
HEAD is now at 6a784ef8f84 CHANGELOG: Update directory for v1.24.3 release

$ go test -race -c k8s.io/apiserver/pkg/server/filters

$ stress ./filters.test -test.run TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter
0 runs so far, 0 failures
12 runs so far, 0 failures
24 runs so far, 0 failures
36 runs so far, 0 failures
48 runs so far, 0 failures
49 runs so far, 0 failures
60 runs so far, 0 failures
72 runs so far, 0 failures
84 runs so far, 0 failures
96 runs so far, 0 failures
108 runs so far, 0 failures
120 runs so far, 0 failures
129 runs so far, 0 failures
136 runs so far, 0 failures
^C

on the merge commit of #110104:

$ git checkout 22d018cf76d
HEAD is now at 22d018cf76d Merge pull request #110104 from MikeSpreitzer/add-timing-ratio-histograms

$ go test -race -c k8s.io/apiserver/pkg/server/filters

$ stress ./filters.test -test.run TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter
0 runs so far, 0 failures
12 runs so far, 0 failures

/var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20220714T234535-278993198
I0714 23:45:43.071718   77695 apf_controller.go:317] Starting API Priority and Fairness config controller
I0714 23:45:43.072525   77695 apf_controller.go:322] Running API Priority and Fairness config worker
E0714 23:45:43.215552   77695 runtime.go:77] Observed a panic: request handler panic'd as designed - "/request/panic-as-designed"
goroutine 159 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:109 +0x106
panic({0x30ed940, 0xc000351c70})
	/Users/liggitt/.gvm/gos/go1.18.1/src/runtime/panic.go:844 +0x258
k8s.io/apiserver/pkg/server/filters.TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter.func1.1({0x3736ab0, 0xc0000ac4e0}, 0xc0007f2a00)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go:698 +0x190
net/http.HandlerFunc.ServeHTTP(0xc00063db60, {0x3736ab0, 0xc0000ac4e0}, 0xc0007c7648?)
	/Users/liggitt/.gvm/gos/go1.18.1/src/net/http/server.go:2084 +0x4e
k8s.io/apiserver/pkg/server/filters.WithPriorityAndFairness.func1.9()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness.go:288 +0x14f
k8s.io/apiserver/pkg/util/flowcontrol.(*configController).Handle.func2()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/apf_filter.go:194 +0x2db
k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset.(*request).Finish.func1(0xc0002961e0, 0xc0007c78be, 0xc00041f960)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go:357 +0x91
k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset.(*request).Finish(0x3395a19?, 0x0?)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go:358 +0x4e
k8s.io/apiserver/pkg/util/flowcontrol.(*configController).Handle(0xc000538f00, {0x3736f20?, 0xc000686ed0}, {0xc00022ad10?, {0x3737540
…
24 runs so far, 1 failures
36 runs so far, 1 failures

/var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20220714T234535-662425781
I0714 23:45:54.415935   78233 apf_controller.go:317] Starting API Priority and Fairness config controller
I0714 23:45:54.416575   78233 apf_controller.go:322] Running API Priority and Fairness config worker
E0714 23:45:54.542185   78233 runtime.go:77] Observed a panic: request handler panic'd as designed - "/request/panic-as-designed"
goroutine 136 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:109 +0x106
panic({0x30ed940, 0xc000147180})
	/Users/liggitt/.gvm/gos/go1.18.1/src/runtime/panic.go:844 +0x258
k8s.io/apiserver/pkg/server/filters.TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter.func1.1({0x3736ab0, 0xc0001e1b00}, 0xc000351100)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go:698 +0x190
net/http.HandlerFunc.ServeHTTP(0xc0000bfc50, {0x3736ab0, 0xc0001e1b00}, 0xc000807648?)
	/Users/liggitt/.gvm/gos/go1.18.1/src/net/http/server.go:2084 +0x4e
k8s.io/apiserver/pkg/server/filters.WithPriorityAndFairness.func1.9()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness.go:288 +0x14f
k8s.io/apiserver/pkg/util/flowcontrol.(*configController).Handle.func2()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/apf_filter.go:194 +0x2db
k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset.(*request).Finish.func1(0xc00044c0f0, 0xc0008078be, 0xc0002becb0)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go:357 +0x91
k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset.(*request).Finish(0x3395a19?, 0x0?)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go:358 +0x4e
k8s.io/apiserver/pkg/util/flowcontrol.(*configController).Handle(0xc0004a2280, {0x3736f20?, 0xc0008ac750}, {0xc0001cd130?, {0x3737540
…

/var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20220714T234535-102924944
I0714 23:45:54.406811   78226 apf_controller.go:317] Starting API Priority and Fairness config controller
I0714 23:45:54.407566   78226 apf_controller.go:322] Running API Priority and Fairness config worker
E0714 23:45:54.532571   78226 runtime.go:77] Observed a panic: request handler panic'd as designed - "/request/panic-as-designed"
goroutine 77 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:109 +0x106
panic({0x30ed940, 0xc0005119e0})
	/Users/liggitt/.gvm/gos/go1.18.1/src/runtime/panic.go:844 +0x258
k8s.io/apiserver/pkg/server/filters.TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter.func1.1({0x3736ab0, 0xc000437be0}, 0xc000147200)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go:698 +0x190
net/http.HandlerFunc.ServeHTTP(0xc0004f02d0, {0x3736ab0, 0xc000437be0}, 0xc0005f1648?)
	/Users/liggitt/.gvm/gos/go1.18.1/src/net/http/server.go:2084 +0x4e
k8s.io/apiserver/pkg/server/filters.WithPriorityAndFairness.func1.9()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness.go:288 +0x14f
k8s.io/apiserver/pkg/util/flowcontrol.(*configController).Handle.func2()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/apf_filter.go:194 +0x2db
k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset.(*request).Finish.func1(0xc0002ba0f0, 0xc0005f18be, 0xc0002c5a40)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go:357 +0x91
k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset.(*request).Finish(0x3395a19?, 0x0?)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go:358 +0x4e
k8s.io/apiserver/pkg/util/flowcontrol.(*configController).Handle(0xc0002b6280, {0x3736f20?, 0xc0004f0cf0}, {0xc00014e420?, {0x3737540?
…

/var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20220714T234535-248880815
I0714 23:45:54.403886   78230 apf_controller.go:317] Starting API Priority and Fairness config controller
I0714 23:45:54.404609   78230 apf_controller.go:322] Running API Priority and Fairness config worker
E0714 23:45:54.529550   78230 runtime.go:77] Observed a panic: request handler panic'd as designed - "/request/panic-as-designed"
goroutine 137 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:109 +0x106
panic({0x30ed940, 0xc0002252c0})
	/Users/liggitt/.gvm/gos/go1.18.1/src/runtime/panic.go:844 +0x258
k8s.io/apiserver/pkg/server/filters.TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter.func1.1({0x3736ab0, 0xc0000acac0}, 0xc0001d4e00)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go:698 +0x190
net/http.HandlerFunc.ServeHTTP(0xc00075a2d0, {0x3736ab0, 0xc0000acac0}, 0xc000769648?)
	/Users/liggitt/.gvm/gos/go1.18.1/src/net/http/server.go:2084 +0x4e
k8s.io/apiserver/pkg/server/filters.WithPriorityAndFairness.func1.9()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness.go:288 +0x14f
k8s.io/apiserver/pkg/util/flowcontrol.(*configController).Handle.func2()
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/apf_filter.go:194 +0x2db
k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset.(*request).Finish.func1(0xc0007260f0, 0xc0007698be, 0xc0001ed7a0)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go:357 +0x91
k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset.(*request).Finish(0x3395a19?, 0x0?)
	/Users/liggitt/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go:358 +0x4e
k8s.io/apiserver/pkg/util/flowcontrol.(*configController).Handle(0xc0001ee140, {0x3736f20?, 0xc00075aff0}, {0xc0004d42c0?, {0x3737540
…
48 runs so far, 4 failures

@liggitt
Copy link
Member Author

liggitt commented Jul 15, 2022

/assign @MikeSpreitzer

@MikeSpreitzer
Copy link
Member

is something in APF triggering a timeout?

Note that the flaking test is deliberately suffering server-side timeouts. The symptom in flakes is that the test is checking that the client gets kube's designed response to a timeout while handling, a particular form HTTP response message, but sometimes the client gets something else (an http2 internal error?).

#110104 is a change in metrics, and should have nothing to do with how timeouts while handling and detected or handled. I will see what I can see...

@MikeSpreitzer
Copy link
Member

@liggitt , regarding your comment above (#111154 (comment)): note that the output shown (the panic "as designed") is normal; if there are failures involved, the failing part was elided.

@MikeSpreitzer
Copy link
Member

Also, the failures seem to be coming from one particular subtest (priority_level_concurrency_is_set_to_1,_queue_length_is_1,_first_request_should_time_out_and_second_(enqueued)_request_should_time_out_as_well) and that one does not involve designed panics.

@MikeSpreitzer
Copy link
Member

On closer examination of the timeout filter, it appears that this "INTERNAL ERROR" is something that can happen normally in the course of handling a timeout. So I will adjust the test to accept this.

@MikeSpreitzer
Copy link
Member

I have updated #111162 to include the fix for this flake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants