Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with client termination of H2 CONNECT streams #3652

Open
howardjohn opened this issue Apr 29, 2024 · 3 comments
Open

Issues with client termination of H2 CONNECT streams #3652

howardjohn opened this issue Apr 29, 2024 · 3 comments
Labels
A-http2 Area: HTTP/2 specific. C-bug Category: bug. Something is wrong. This is bad! E-medium Effort: medium. Some knowledge of how hyper internal works would be useful.

Comments

@howardjohn
Copy link
Contributor

Version
Hyper 1.3

Platform

6.8.7-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 17 Apr 2024 15:20:28 +0000 x86_64 GNU/Linux

Description

We are seeing a few issues around using H2 CONNECT. Our application is using Tokio, Rustls, and Hyper to communicate between a client and server both using the same stack. We have multiple streams on one TCP connection sometimes (though the issue occurs regardless of multiplexing).

Our first issue that popped up was leaking connections. This was debugging and an (attempted) fix was opened in #3647. More details there. This was like not detected before because:

  • The CONNECT tests in the repo all close from the server side
  • When using hyper's pool this isn't seen since the pool will hold onto the connection after all streams are closed (until the idle timeout). We only started seeing this when we stopped using hyper's pool and started closing the connection immediately after the last stream closed.

With that fix, everything was working fine. However, a later refactor in our broke things again; we started dropping the SendRequest before we were complete with IO operations on Upgraded (before, we dropped the SendRequest after). This causes the stream/connection to be closed unexpectedly. This can be reproduced easily by moving the drop(client) in #3647 to before the write/read operations. This is easy to workaround, but its not documented and @seanmonstar suggested it was not expected on Discord.

@howardjohn howardjohn added the C-bug Category: bug. Something is wrong. This is bad! label Apr 29, 2024
@seanmonstar
Copy link
Member

That is surprising. The Connection is design to stay alive until all shared references to it are gone. Those are the SendRequest, and any ResponseFuture, SendStream, and RecvStream. Perhaps something about the way the types are stored inside the Upgraded makes it drop the reference and no longer be counted for liveness?

cc @nox, who worked on this previously.

@seanmonstar seanmonstar added E-medium Effort: medium. Some knowledge of how hyper internal works would be useful. A-http2 Area: HTTP/2 specific. labels Apr 30, 2024
@howardjohn
Copy link
Contributor Author

Ah, I think I got things a little mixed up.

In #3647, I sent a test and a fix.

  • The test was supposed to reproduce "connection is not closed when SendRequest and Upgraded are dropped". It was written incorrectly; I never dropped Upgraded which kept the connection alive as expected.
  • My fix closed the connection when all SendRequests dropped, even if there were SendStream, RecvStream, etc left (via Upgraded). So it is the cause of the issue described

That being said, I do still see the issue in our application despite dropping everything, I am just unable to reproduce it in a unit test.

howardjohn added a commit to howardjohn/h2 that referenced this issue May 1, 2024
See hyperium/hyper#3652.

What I have found is the final reference to a stream being dropped
after the `maybe_close_connection_if_no_streams` but before the
`inner.poll()` completes can lead to the connection dangling forever
without any forward progress. No streams/references are alive, but the
connection is not complete and never wakes up again. This seems like a
classic TOCTOU race condition.

In this fix, I check again at the end of poll and if this state is
detected, wake up the task again.

Wth the test in hyperium/hyper#3655, on my machine, it fails about 5% of the time:
```
1876 runs so far, 100 failures (94.94% pass rate). 95.197349ms avg, 1.097347435s max, 5.398457ms min
```

With that PR, this test is 100% reliable
```
64010 runs so far, 0 failures (100.00% pass rate). 44.484057ms avg, 121.454709ms max, 1.872657ms min
```

Note: we also have reproduced this using `h2` directly outside of `hyper`, which is what gives me
confidence this issue lies in `h2` and not `hyper`.
@howardjohn
Copy link
Contributor Author

Ok I think I figured it out and put up two PRs:

seanmonstar pushed a commit to hyperium/h2 that referenced this issue May 2, 2024
See hyperium/hyper#3652.

What I have found is the final reference to a stream being dropped
after the `maybe_close_connection_if_no_streams` but before the
`inner.poll()` completes can lead to the connection dangling forever
without any forward progress. No streams/references are alive, but the
connection is not complete and never wakes up again. This seems like a
classic TOCTOU race condition.

In this fix, I check again at the end of poll and if this state is
detected, wake up the task again.

Wth the test in hyperium/hyper#3655, on my machine, it fails about 5% of the time:
```
1876 runs so far, 100 failures (94.94% pass rate). 95.197349ms avg, 1.097347435s max, 5.398457ms min
```

With that PR, this test is 100% reliable
```
64010 runs so far, 0 failures (100.00% pass rate). 44.484057ms avg, 121.454709ms max, 1.872657ms min
```

Note: we also have reproduced this using `h2` directly outside of `hyper`, which is what gives me
confidence this issue lies in `h2` and not `hyper`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-http2 Area: HTTP/2 specific. C-bug Category: bug. Something is wrong. This is bad! E-medium Effort: medium. Some knowledge of how hyper internal works would be useful.
Projects
None yet
Development

No branches or pull requests

2 participants