Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: frequent failures in TestNoSniffExpectRequestBody_h2 since Oct. 28 #42256

Open
bcmills opened this issue Oct 28, 2020 · 13 comments
Open

net/http: frequent failures in TestNoSniffExpectRequestBody_h2 since Oct. 28 #42256

bcmills opened this issue Oct 28, 2020 · 13 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Oct 28, 2020

--- FAIL: TestNoSniffExpectRequestBody_h2 (10.00s)
    clientserver_test.go:1316: unexpected Read call
2020/10/28 09:19:06 http: TLS handshake error from 127.0.0.1:33072: read tcp 127.0.0.1:33050->127.0.0.1:33072: use of closed network connection
FAIL
FAIL	net/http	24.279s

2020-10-28T14:25:56-b85c2dd/aix-ppc64
2020-10-28T13:46:11-a0a4439/solaris-amd64-oraclerel
2020-10-28T13:25:44-72dec90/aix-ppc64
2020-10-28T04:20:39-02335cf/aix-ppc64
2020-10-28T01:22:47-2414e1f/aix-ppc64
2020-10-28T01:18:38-7be9158/solaris-amd64-oraclerel

Marking as release-blocker because this appears to be a regression in 1.16.

The failures so far are on somewhat esoteric builders, but those are also some of our slowest builders — this could be a timing bug that just happens to reproduce more easily on them.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Oct 28, 2020

@bcmills bcmills changed the title net/http: frequent failures in TestNoSnippExpectRequestBody_h2 since Oct. 28 net/http: frequent failures in TestNoSniffExpectRequestBody_h2 since Oct. 28 Oct 28, 2020
@danp
Copy link
Contributor

@danp danp commented Nov 2, 2020

I spent some time digging into this, somewhat out of curiosity but also to be sure it wasn't related to the fix for #42156 since the timing was suspect. It doesn't seem to be related to #42156 since http.Server.Shutdown is not involved.

It would be nice to simulate go test -count 100 -run TestNoSniffExpectRequestBody_h2 or so on one of these builders. It's my understanding gomote access is not given out much. If that is the case, would it be appropriate to debug by opening a DO NOT REVIEW/SUBMIT CL and have TryBot target one or more of these builders using TRY?

@neild
Copy link
Contributor

@neild neild commented Nov 2, 2020

I'm pretty certain this has been fixed, but I haven't managed to bisect down to the exact point of the fix yet. It reproduces on these builders at the previously-failing commits with -count 100, but not at head.

@danp
Copy link
Contributor

@danp danp commented Nov 2, 2020

The timer-related changes in 8fdc79e and b4b0144 before the first failure (7be9158) and in 89a6540 after the last observed failure (676ad56) seemed interesting but I wasn't able to induce any failures locally in the questionable commit range to dig deeper.

@neild
Copy link
Contributor

@neild neild commented Nov 2, 2020

Bisection results point at this being fixed by e02ab89:

https://go.googlesource.com/go/+/e02ab89eb8994fa6f2dfa2924cdadb097633fcc1 (passes)
https://go.googlesource.com/go/+/a313eec3869c609c0da2402a4cac2d32113d599c (fails)

The test is flaky. The failures take ~10s to run, which points at the ExpectContinueTimeout in the test expiring for some reason. I'm not sure at the moment why the test is failing or why that CL seems to fix it.

\cc @ianlancetaylor

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 3, 2020

https://golang.org/cl/266304 shouldn't affect any correct code. It's a minor optimization which will cause the runtime poller to be woken up less often to run timers. It will only affect programs that set an I/O deadline or a time.Timer, and then change the deadline or the Timer to run earlier than when it was created. Programs that do that will see slightly different, hopefully more efficient, scheduling behavior.

It's really puzzling that it fails before CL 266304 and passes after it. If anything I would expect CL 266304 to wake up fewer goroutines. So why would waking up more goroutines cause a timeout to expire?

I do not know what is happening here.

@danp
Copy link
Contributor

@danp danp commented Nov 3, 2020

It will only affect programs that set an I/O deadline or a time.Timer, and then change the deadline or the Timer to run earlier than when it was created.

It's worth nothing that is what the h2 bits do in the Expect: 100-continue case, using AfterFunc with a "huge duration" here:

go/src/net/http/h2_bundle.go

Lines 9105 to 9108 in da7aa86

const hugeDuration = 365 * 24 * time.Hour
s.timer = time.AfterFunc(hugeDuration, func() {
s.fnonce.Do(s.fn)
})

And then using Reset to have the timer fire earlier:

s.timer.Reset(s.delay)

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 3, 2020

Well, at least that explains why the CL has some effect, though I still can't explain why it has the effect that it does.

@danp
Copy link
Contributor

@danp danp commented Nov 3, 2020

Actually, something else I wondered about when digging into this is the use of Stop and Reset together. Expanding the code around the Reset:

go/src/net/http/h2_bundle.go

Lines 9140 to 9142 in da7aa86

if s.timer.Stop() {
s.timer.Reset(s.delay)
}

if s.timer.Stop() {
	s.timer.Reset(s.delay)
}

Since this timer was created with AfterFunc, is it questionable to use Stop and Reset in this way?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 3, 2020

That usage is fine. The Stop method will return true if the function has not yet run. After calling Stop, it's fine to call Reset to set a new time.

@dmitshur
Copy link
Member

@dmitshur dmitshur commented Nov 11, 2020

This test stops passing completely when the change in CL 253259 is bundled into net/http.
I filed #42498 about that, it may or may not be related to this issue.

@fraenkel
Copy link
Contributor

@fraenkel fraenkel commented Nov 11, 2020

I can explain how the error occurs, but it doesn't explain why.
The test has a client POST some data with a 10s delay on sending the body via an Expect: 100-continue. The server sends back a 403 (Unauthorized) once it receives the headers. The client should receive the 403 and cancel the timer for the body write.

The body is being written after 10s but the timer should have been cancelled.

@fraenkel
Copy link
Contributor

@fraenkel fraenkel commented Nov 21, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.