net/http: TestInterruptWithPanic_h2 flakes on ppc64le builder #17243
Comments
@aclements Can you run your flake finder tool on this? |
Looks like it's been flaking on more than just PPC:
Flake analysis isn't very helpful with this one, I'm afraid:
|
I can't reproduce this. I deflaked this test in 238247e but that commit message is misleading. It was only deflaking additions added in https://golang.org/cl/33099 and https://golang.org/cl/33103. It looks like the original flake code remains unchanged. In particular, the test handler does this: cst := newClientServerTest(t, h2, HandlerFunc(func(w ResponseWriter, r *Request) {
io.WriteString(w, msg)
w.(Flusher).Flush()
panic(panicValue)
}), func(ts *httptest.Server) { That means the Handler's http2 @tombergan, did you fix this by accident when you redid the http2 write scheduler? Or, can you imagine why we'd schedule a |
Here's one hypothesis. sc.resetStream constructs a FrameWriteRequest that doesn't have a stream field: Both the old and new write schedulers use the stream field to determine which queue gets the FrameWriteRequest: This means RST_STREAM frames will be added to the root queue (for control frames), which is processed before the stream queues. We could have this sequence of events:
The test will pass if step 3 happens before step 2, but will fail in the above order. Both the old and new write schedulers have this bug -- more precisely, the bug is in sc.resetStream. This bug can explain the flake. However, I can't explain why the test has stopped flaking. Perhaps something changed in the buffering code which made the above order much less likely? |
To be clear, I've never been able to reproduce it. And it hasn't happened many times I don't think. |
(In a meeting now, but your theory looks valid at first glance.) Do you want to prepare a fix? Or I can send you something? |
A bit busy today, but I can get to this tomorrow if you don't get there first. |
CL https://golang.org/cl/33238 mentions this issue. |
CL https://golang.org/cl/33241 mentions this issue. |
Updates x/net/http2 to x/net git rev 00ed5e9 for: http2: schedule RSTStream writes onto its stream's queue https://golang.org/cl/33238 Fixes #17243 Change-Id: I79cc5d15bf69ead28d549d4f798c12f4ee2a2201 Reviewed-on: https://go-review.googlesource.com/33241 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
Apparently golang/net@00ed5e9 wasn't enough. I still see a flake:
|
I'm going to deflake this a different way, without fixing the http2 issue for now. |
CL https://golang.org/cl/33332 mentions this issue. |
Updates #17243 Change-Id: Iaa737874e75fdac73452f1fc13a5749e8df78ebe Reviewed-on: https://go-review.googlesource.com/33332 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
TODO(bradfitz): file bug about the general problem and theory and then close this one. |
I'll repurpose this bug about the x/net/http2 behavior. It seems that even with our fix in https://golang.org/cl/33238 (and bundling it into std), this flaky test persists. Perhaps the problem is now in the Transport instead of the Server. Investigate for Go 1.9. |
That is, if the Transport sees a HEADERS frame on the wire and then "immediately" sees a RST_STREAM, is it possible for the Transport to report an error instead of its RoundTrip returning a non-nil *Response and nil error? |
@tombergan, I think between the Transport and Server changes, we might've fixed this one? |
If it hasn't been flaking, I say close it. |
Fixes golang/go#17243 Change-Id: I76f972f908757b103e2ab8d9b1701312308d66e5 Reviewed-on: https://go-review.googlesource.com/33238 Reviewed-by: Tom Bergan <tombergan@google.com>
The failure looks like this:
It's happened three times in the last three screens of build.golang.org:
https://build.golang.org/log/969151e16737eda8c8720b1c0c3cb6f0ddbe5025
https://build.golang.org/log/ae7a7294c3d83cc5584aabeec8bb4fe55b9e4832
https://build.golang.org/log/512a378cc11774b203aa0962c9818326eda0cd76
but I have the vague feeling it's been going on for a while.
The text was updated successfully, but these errors were encountered: