New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APF: fix data race in unit tests #120990
APF: fix data race in unit tests #120990
Conversation
/test pull-kubernetes-e2e-kind |
/test pull-kubernetes-linter-hints |
/test pull-kubernetes-e2e-kind |
This is the data race
|
@@ -710,19 +711,19 @@ func TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter(t *testing.T) { | |||
// - first request is expected to panic as designed | |||
// - second request is expected to success | |||
_, err := requestGetter(firstRequestPathPanic) | |||
if !executed { | |||
if invokedGot := atomic.LoadInt32(&invoked); invokedGot != 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - if we had a race before [so something was writing it at the same time when something else was reading], won't this now effectively become flaky [sometimes it will be 1, sometimes something else]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we expect the server to write first, and the test to read later (using the client and server timeout). client waits long enough for the server to respond, server timeout = 5s, client timeout = 10s
I have replaced the use of atomic with channel to establish a happens-before relationship, but i guess, in theory the flake still exists - for example, the server is so overloaded that it fails to execute the request handler in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my point was slightly different - if we have long-enough time between the timeouts, then why the rate here was happening in the original version?
Either there is enough time in between (and then the original version should work), or there isn't enough time and then with the new code you could be racy too, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is my observation so far:
- the data race does not manifest with the race detector on, i have tried with
stress
, no luck so far
$ go test -race k8s.io/apiserver/pkg/server/filters -c
$ stress -p=32 ./filters.test -test.run=TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter
5s: 0 runs so far, 0 failures
10s: 32 runs so far, 0 failures
...
15m30s: 4482 runs so far, 0 failures
15m35s: 4505 runs so far, 0 failures
- with set per handler read/write deadline #114189, when read/write deadline is enabled, this data race manifests.
I wrote a much simpler, but equivalent test with the race present, executed
variable being written to by the server filter goroutine, and read by the test goroutine, and if I run this test the data race is detected, but for some reason the same race is not detected in the TestPriorityAndFairnessWithPanicRecoveryAndTimeoutFilter
test.
func TestDataRaceWithServerHandler(t *testing.T) {
handlerDoneCh, clientDoneCh := make(chan struct{}), make(chan struct{})
var executed bool
handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
defer close(handlerDoneCh)
executed = true
<-clientDoneCh
})
server := httptest.NewUnstartedServer(handler)
defer server.Close()
server.EnableHTTP2 = true
server.StartTLS()
client := server.Client()
client.Timeout = 5 * time.Second
func() {
defer close(clientDoneCh)
client.Get(server.URL + "/foo")
}()
if !executed {
t.Errorf("expected handler to be executed")
}
}
due to #114189, net/http will spin up a new goroutine for each request on the server side, which directly returns a stream reset error to the caller. So it brings in two new changes:
- a new goroutne in play (but it does not directly read from or write to the memory)
- it helps the client
Get
call to complete sooner (before, the response would come to the caller via the timeout filter)
in summary, the writer (test goroutine) writes to the memory location sooner with #114189, but I am not sure why this reduced gap in read-write would manifest the data race.
This PR does two things:
- move both read and write in the same goroutine (move t.Errorf inside the server filter, as it is safe to do so now)
- use channel instead of atomic, this minimizes the chance of a flake.
/triage accepted |
3ee7757
to
52c58d9
Compare
ran a stress test:
|
if err != nil { | ||
t.Fatalf("Expected request: %q to get a response, but got error: %#v", secondRequestPathShouldWork, err) | ||
} | ||
if response.StatusCode != http.StatusOK { | ||
t.Errorf("Expected HTTP status code: %d for request: %q, but got: %#v", http.StatusOK, secondRequestPathShouldWork, response) | ||
} | ||
|
||
for _, err := range headerMatcher.errors() { | ||
t.Errorf("Expected APF headers to match, but got: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of splitting the read and write into two separate go routines, do the verification inside the handler - t.Errorf
is safe to be invoked in separate goroutines
OK - thanks for explanations - that makes sense to me. /lgtm Thanks! |
LGTM label has been added. Git tree hash: 5ea50fcd1d880e79669d69eab83eb374615432e8
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tkashem, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
In support of #114189.
If we wire per handler read/write timeout, then this data race manifests.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: