Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APF: fix data race in unit tests #120990

Merged
merged 1 commit into from Oct 13, 2023

Conversation

tkashem
Copy link
Contributor

@tkashem tkashem commented Oct 4, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

In support of #114189.

If we wire per handler read/write timeout, then this data race manifests.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 4, 2023
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 4, 2023
@tkashem
Copy link
Contributor Author

tkashem commented Oct 4, 2023

/test pull-kubernetes-e2e-kind

@tkashem
Copy link
Contributor Author

tkashem commented Oct 4, 2023

/test pull-kubernetes-linter-hints

@tkashem
Copy link
Contributor Author

tkashem commented Oct 5, 2023

/test pull-kubernetes-e2e-kind

@tkashem
Copy link
Contributor Author

tkashem commented Oct 5, 2023

Please take a look, this PR is a pre-requisite for #114189
/cc @wojtek-t @sttts

@tkashem
Copy link
Contributor Author

tkashem commented Oct 5, 2023

This is the data race

WARNING: DATA RACE
Read at 0x00c0006fb68f by goroutine 5397:
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter.func2()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go:796 +0x424
  testing.tRunner()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/testing/testing.go:1595 +0x238
  testing.(*T).Run.func1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/testing/testing.go:1648 +0x44

Previous write at 0x00c0006fb68f by goroutine 5566:
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter.func2.1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go:765 +0x9c
  net/http.HandlerFunc.ServeHTTP()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/server.go:2136 +0x47
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*priorityAndFairnessHandler).Handle.func9()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness.go:286 +0x141
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol.(*configController).Handle.func2()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/apf_filter.go:197 +0x398
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset.(*request).Finish.func1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go:401 +0x87
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset.(*request).Finish()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go:402 +0x4a
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol.(*configController).Handle()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/util/flowcontrol/apf_filter.go:184 +0xcae
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*priorityAndFairnessHandler).Handle()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness.go:289 +0x1441
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*priorityAndFairnessHandler).Handle-fm()
      <autogenerated>:1 +0x51
  net/http.HandlerFunc.ServeHTTP()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/server.go:2136 +0x47
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.newHandlerChain.func1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go:1233 +0x349
  net/http.HandlerFunc.ServeHTTP()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/server.go:2136 +0x47
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:115 +0xcf

Goroutine 5397 (running) created at:
  testing.(*T).Run()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/testing/testing.go:1648 +0x82a
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/priority-and-fairness_test.go:746 +0x86
  testing.tRunner()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/testing/testing.go:1595 +0x238
  testing.(*T).Run.func1()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/testing/testing.go:1648 +0x44

Goroutine 5566 (running) created at:
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:101 +0x31c
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.newHandlerChain.WithRequestDeadline.withRequestDeadline.func7()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/request_deadline.go:142 +0x36a
  net/http.HandlerFunc.ServeHTTP()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/server.go:2136 +0x47
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.newHandlerChain.WithRequestInfo.func4()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x167
  net/http.HandlerFunc.ServeHTTP()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/server.go:2136 +0x47
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.newHandlerChain.WithPanicRecovery.withPanicRecovery.func8()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:74 +0x130
  net/http.HandlerFunc.ServeHTTP()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/server.go:2136 +0x47
  k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.newHandlerChain.WithAuditInit.withAuditInit.func9()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/audit_init.go:63 +0x147
  net/http.HandlerFunc.ServeHTTP()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/server.go:2136 +0x47
  net/http.serverHandler.ServeHTTP()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/server.go:2938 +0x2a1
  net/http.initALPNRequest.ServeHTTP()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/server.go:3546 +0x35e
  net/http.(*initALPNRequest).ServeHTTP()
      <autogenerated>:1 +0x7b
  net/http.Handler.ServeHTTP-fm()
      <autogenerated>:1 +0x67
  net/http.(*http2serverConn).runHandler()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/h2_bundle.go:6131 +0xe1
  net/http.(*http2serverConn).processHeaders.func2()
      /home/prow/go/src/k8s.io/kubernetes/_output/local/.gimme/versions/go1.21.1.linux.amd64/src/net/http/h2_bundle.go:5845 +0x5d
==================

from https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/114189/pull-kubernetes-unit/1705271966702768128

@@ -710,19 +711,19 @@ func TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter(t *testing.T) {
// - first request is expected to panic as designed
// - second request is expected to success
_, err := requestGetter(firstRequestPathPanic)
if !executed {
if invokedGot := atomic.LoadInt32(&invoked); invokedGot != 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - if we had a race before [so something was writing it at the same time when something else was reading], won't this now effectively become flaky [sometimes it will be 1, sometimes something else]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we expect the server to write first, and the test to read later (using the client and server timeout). client waits long enough for the server to respond, server timeout = 5s, client timeout = 10s

I have replaced the use of atomic with channel to establish a happens-before relationship, but i guess, in theory the flake still exists - for example, the server is so overloaded that it fails to execute the request handler in time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point was slightly different - if we have long-enough time between the timeouts, then why the rate here was happening in the original version?

Either there is enough time in between (and then the original version should work), or there isn't enough time and then with the new code you could be racy too, no?

Copy link
Contributor Author

@tkashem tkashem Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is my observation so far:

  • the data race does not manifest with the race detector on, i have tried with stress, no luck so far
$ go test -race k8s.io/apiserver/pkg/server/filters -c
$ stress -p=32 ./filters.test -test.run=TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter
5s: 0 runs so far, 0 failures
10s: 32 runs so far, 0 failures
...
15m30s: 4482 runs so far, 0 failures
15m35s: 4505 runs so far, 0 failures

I wrote a much simpler, but equivalent test with the race present, executed variable being written to by the server filter goroutine, and read by the test goroutine, and if I run this test the data race is detected, but for some reason the same race is not detected in the TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter test.

func TestDataRaceWithServerHandler(t *testing.T) {
	handlerDoneCh, clientDoneCh := make(chan struct{}), make(chan struct{})
	var executed bool
	handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		defer close(handlerDoneCh)
		executed = true

		<-clientDoneCh
	})

	server := httptest.NewUnstartedServer(handler)
	defer server.Close()
	server.EnableHTTP2 = true
	server.StartTLS()

	client := server.Client()
	client.Timeout = 5 * time.Second

	func() {
		defer close(clientDoneCh)
		client.Get(server.URL + "/foo")
	}()

	if !executed {
		t.Errorf("expected handler to be executed")
	}
}

due to #114189, net/http will spin up a new goroutine for each request on the server side, which directly returns a stream reset error to the caller. So it brings in two new changes:

  • a new goroutne in play (but it does not directly read from or write to the memory)
  • it helps the client Get call to complete sooner (before, the response would come to the caller via the timeout filter)

in summary, the writer (test goroutine) writes to the memory location sooner with #114189, but I am not sure why this reduced gap in read-write would manifest the data race.

This PR does two things:

  • move both read and write in the same goroutine (move t.Errorf inside the server filter, as it is safe to do so now)
  • use channel instead of atomic, this minimizes the chance of a flake.

@wojtek-t wojtek-t self-assigned this Oct 5, 2023
@cici37
Copy link
Contributor

cici37 commented Oct 5, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 5, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 9, 2023
@tkashem
Copy link
Contributor Author

tkashem commented Oct 9, 2023

ran a stress test:

$ stress -p=32 ./filters.test -test.run=TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter
5s: 0 runs so far, 0 failures
10s: 32 runs so far, 0 failures
15s: 48 runs so far, 0 failures
...
10m25s: 3010 runs so far, 0 failures
10m30s: 3028 runs so far, 0 failures
10m35s: 3056 runs so far, 0 failures
10m40s: 3079 runs so far, 0 failures

if err != nil {
t.Fatalf("Expected request: %q to get a response, but got error: %#v", secondRequestPathShouldWork, err)
}
if response.StatusCode != http.StatusOK {
t.Errorf("Expected HTTP status code: %d for request: %q, but got: %#v", http.StatusOK, secondRequestPathShouldWork, response)
}

for _, err := range headerMatcher.errors() {
t.Errorf("Expected APF headers to match, but got: %v", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of splitting the read and write into two separate go routines, do the verification inside the handler - t.Errorf is safe to be invoked in separate goroutines

@wojtek-t
Copy link
Member

OK - thanks for explanations - that makes sense to me.

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 5ea50fcd1d880e79669d69eab83eb374615432e8

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tkashem, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2023
@k8s-ci-robot k8s-ci-robot merged commit 86ba008 into kubernetes:master Oct 13, 2023
13 of 14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants