Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/tls: add TestClientHandshakeContextCancellation back after fixing flake #45106

Open
katiehockman opened this issue Mar 18, 2021 · 13 comments
Open

Comments

@katiehockman
Copy link
Member

@katiehockman katiehockman commented Mar 18, 2021

Related to #45084 and #32406

We had to remove the test in 2f3db22 since it was flaky, so we should fix it and add it back.

@johanbrandhorst I've gone ahead and assigned this to you since you wrote the original code and removed the test, but let us know if you don't have time to fix it and we can assign it to someone on our team.

/cc @FiloSottile @rolandshoemaker

@johanbrandhorst
Copy link
Member

@johanbrandhorst johanbrandhorst commented Mar 18, 2021

Thanks for the follow-up Katie, I should be able to find the time for this, but I'll let you know if not.

@johanbrandhorst
Copy link
Member

@johanbrandhorst johanbrandhorst commented Mar 19, 2021

I'm having some problems reproducing this failure on my machine, any thoughts on the environment that triggers this?

$ go test -run '^TestClientHandshakeContextCancellation$' ./crypto/tls
ok      crypto/tls      0.019s
$ go test -run '^TestClientHandshakeContextCancellation$' -count=100 ./crypto/tls
ok      crypto/tls      0.039s
$ go test -run '^TestClientHandshakeContextCancellation$' -count=1000 ./crypto/tls
ok      crypto/tls      0.215s
$ go test -run '^TestClientHandshakeContextCancellation$' -count=100000 ./crypto/tls
ok      crypto/tls      19.952s
$ go test -run '^TestClientHandshakeContextCancellation$' -race -count=1 ./crypto/tls
ok      crypto/tls      0.071s
$ go test -race -count=1 ./crypto/tls/...
ok      crypto/tls      5.215s
$ go test -race -count=10 ./crypto/tls/...
ok      crypto/tls      31.854s
$ go test -count=10 ./crypto/tls/...
ok      crypto/tls      14.045s
@katiehockman
Copy link
Member Author

@katiehockman katiehockman commented Mar 19, 2021

Here's one of the builder failures, which describes the environment some: https://build.golang.org/log/6e8d66419f4cf08633e84e3c03777ccbf3f41e54

Based on the logs @bcmills provided in the issue, it wasn't affecting darwin architectures, so if you're running those tests from a Mac, that might be why you aren't seeing it. Maybe try a Linux machine, if you have one.

@johanbrandhorst
Copy link
Member

@johanbrandhorst johanbrandhorst commented Mar 19, 2021

Thanks for the link, I'll have a look. I'm running it on linux/amd64. Maybe I'll just have to run the whole testsuite. I assumed it might be contention on the localPipe but even running the whole crypto/tls suite didn't seem to trigger it. I'll play around with it a bit more, but if someone else finds a good way to reproduce the error, or better yet, have a good idea about what the problem could be, let me know!

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 28, 2021

Change https://golang.org/cl/305250 mentions this issue: crypto/tls: reintroduce simplified cancellation test

@millerresearch
Copy link

@millerresearch millerresearch commented Mar 30, 2021

As mentioned on #45299, I can reproduce this on a Raspberry Pi (plan9-arm) pretty reliably. It goes away if a 50 millisecond delay is inserted between the conn.readClientHello and cancel. I'm looking for the race now.

@millerresearch
Copy link

@millerresearch millerresearch commented Mar 30, 2021

Another clue: on linux-amd64 I can induce a race by inserting a delay into the interrupter goroutine in the (*Conn).handshakeContext function like so:

  go func() {
    <-time.After(50*time.Millisecond)
    select {
    case <-handshakeCtx.Done():

On a heavily loaded system, this sort of delay could occur naturally.

@johanbrandhorst
Copy link
Member

@johanbrandhorst johanbrandhorst commented Mar 30, 2021

The implication then is that the test may fail if the interrupter goroutine hasn't had a chance to start before the connection failed? That would explain the flake, but how do we fix it? The test is failing because it's not seeing the context.Canceled error, so we could accept either context.Canceled or io.EOF as errors, but how does that compromise the value of the test?

@millerresearch
Copy link

@millerresearch millerresearch commented Mar 30, 2021

Is io.EOF actually an acceptable result when the server cancels the handshake? If not, then it's not the test that needs correcting, but the underlying implementation's vulnerability to races.

@johanbrandhorst
Copy link
Member

@johanbrandhorst johanbrandhorst commented Mar 30, 2021

As the author of the tests, I think the thing I wanted to assert was really that the HandshakeContext didn't block while waiting for the server/client to respond. Really any error should be fine, and the test will simply block if the cancellation isn't working as expected. This should be a way forward for me to fix the tests, if you think that sounds good @millerresearch?

@millerresearch
Copy link

@millerresearch millerresearch commented Mar 30, 2021

I'm not experienced with contexts and cancellation, so I'm not qualified to say whether the test is testing what it should. If deadlock is a possibility it's better to have a test which explicitly checks for that, rather than just leaving the test to block. But that's not what is happening here.
My question was about whether seeing EOF instead of the expected Canceled condition was the sign of a flaw in the implementation, which might cause trouble in real usage by an application. If users of HandshakeContext don't need to distinguish EOF from Canceled, then the test shouldn't need to either.

@johanbrandhorst
Copy link
Member

@johanbrandhorst johanbrandhorst commented Apr 1, 2021

Thanks for your help @millerresearch, I've updated https://go-review.googlesource.com/c/go/+/305250 with what I hope is a fix for both this issue and #45299, I would appreciate if you could try running the tests in your environment too to confirm it fixes your issue.

@millerresearch
Copy link

@millerresearch millerresearch commented Apr 1, 2021

Yes, with your fix I'm no longer able to induce a test failure. Well done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants