Issue 361 - Improve http2 connection error handling #363

huntharo · 2024-01-05T16:40:42Z

Partial fix for Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, causing system OOM #361
ONLY implemented for the -z 10s (work_until) case
TODO:
- The futures are not aborted when the timer is hit, which will cause long running requests to delay the program exit - this is only due to a borrow/move problem that I cannot figure out
- Implement for the non-work_until cases
- Add a timeout to the TCP socket setup - this appears to be where some of the delay on shutdown is happening if the server closes after startup
- Consider adding a delay to the reconnect loop so that it will not try to connect more than 1 time per second per concurrent connection - Without this the connect loop will spin at ~23k connect attempts/second for -c 20, for example
Test cases:
- Start with the server not running at all (never connects) - Currently this will exit on time - IMPROVED: Previously this would attempt to connect once for each -c, fail, and immediately exit - IMPROVED: Currently this will repeatedly try to connect until the specified timeout expires, then it will exit
- Start with the server running and leave it running
  - This works fine as before
- Start with the server running, exit the server, then restart the server before the test completes
  - This initially makes requests
  - IMPROVED: Previously this would OOM even if the server restarted - IMPROVED: Currently this will reconnect and continue making requests if the server restarts

After - Improved Behavior

oha-pr-363-after.mp4

Before - Existing Behavior

oha-pr-363-before.mp4

- Partial fix for hatoo#361 - ONLY implemented for the `-z 10s` (work_until) case - TODO: - [ ] The futures are not aborted when the timer is hit, which will cause long running requests to delay the program exit - this is only due to a borrow/move problem that I cannot figure out - [ ] Implement for the non-`work_until` cases - [ ] Add a timeout to the TCP socket setup - this appears to be where some of the delay on shutdown is happening if the server closes after startup - [ ] Consider adding a delay to the reconnect loop so that it will not try to connect more than 1 time per second per concurrent connection - Without this the connect loop will spin at ~23k connect attempts/second for `-c 20`, for example - Test cases: - Start with the server not running at all (never connects) - Currently this will exit on time - IMPROVED: Previously this would attempt to connect once for each `-c`, fail, and immediately exit - IMPROVED: Currently this will repeatedly try to connect until the specified timeout expires, then it will exit - Start with the server running and leave it running - This works fine as before - Start with the server running, exit the server, then restart the server before the test completes - This initially makes requests - IMPROVED: Previously this would OOM even if the server restarted - IMPROVED: Currently this will reconnect and continue making requests if the server restarts

hatoo · 2024-01-06T09:01:41Z

Great catch! 👍

I understand that a TCP connection should be restarted when one of the parallel connections gets an IO/hyper error.

I updated work_until to be simpler I believe.
I want to focus work_until only on this PR and fix others later.

Add a timeout to the TCP socket setup - this appears to be where some of the delay on shutdown is happening if the server closes after startup

I can't see any effect of connection timeout in my environment.
Perhaps it depends on server implementation. Could you provide the server code and Node version?

Consider adding a delay to the reconnect loop so that it will not try to connect more than 1 time per second per concurrent connection - Without this the connect loop will spin at ~23k connect attempts/second for -c 20, for example

Is it a big problem for you?
I think this solution isn't ideal. I think the issue should be solved using hdrhistogram (not collecting all data naively).

huntharo · 2024-01-08T02:06:02Z

Great catch! 👍

I understand that a TCP connection should be restarted when one of the parallel connections gets an IO/hyper error.

👍

I updated work_until to be simpler I believe. I want to focus work_until only on this PR and fix others later.

Sounds good to me. I saw the alternatively approach with work_until... if it passes all the tests it's good with me as it's quite a bit simpler.

Add a timeout to the TCP socket setup - this appears to be where some of the delay on shutdown is happening if the server closes after startup

I can't see any effect of connection timeout in my environment. Perhaps it depends on server implementation. Could you provide the server code and Node version?

Yeah I'll need to check into it and make sure it's not something funny on my end. What I was seeing (I don't think this was captured in a video as it was an intermediate issue that I fixed) was that I would get tens of thousands of reconnect attempts, but then it would freeze and hang and then start going again after some of the connection attempts timed out. I'm not sure why some of the connection attempts hang like that, but it may be because of running out of ephemeral ports. Once you try to make ~65k connections max, from localhost, to localhost (that is a key as it's only one target IP), on a single port, and they all fail, that will exhaust the ephemeral ports and it may wait on the subsequent connects (depends on OS most likely) to see if it can get a port, which it cannot until one of the TIME_WAIT sockets expires).

Consider adding a delay to the reconnect loop so that it will not try to connect more than 1 time per second per concurrent connection - Without this the connect loop will spin at ~23k connect attempts/second for -c 20, for example

Is it a big problem for you? I think this solution isn't ideal.

The delay to the reconnect loop is a good idea because of the TIME_WAIT issue above. You won't notice it as often when connecting to a remote host with multiple IPs (like a cloud load balancer). But you will notice it when connecting to a single IP or localhost.

I think the issue should be solved using hdrhistogram (not collecting all data naively).

Yes, the collection and retaining of all of the data is a distinct issue that should be handled in another ticket, which I will create. It is a big deal, I think, because my tests start at 11k RPS and drifts down to 9k RPS. If I exit and restart oha then it immediately goes back to 11k RPS, so the issue is not due to the target getting slower, but rather due to oha getting slower, and I think it's getting slower because of the memory accumulation from the stats collection, but it could be from something else.

Speed slowing down

Speed recovers after restart

hatoo · 2024-01-10T12:05:04Z

Thanks.

Let me merge this PR and continue on #369, #370, #371

huntharo mentioned this pull request Jan 5, 2024

Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, causing system OOM #361

Open

huntharo and others added 3 commits January 5, 2024 14:10

Add connect timeouts / fix restart

9c4840e

Fix format

62166db

Update work_until and cargo fmt

fb90d37

Add connection timeout to #[not(unix)]

2d55312

hatoo force-pushed the issue-361/improve-http2-connection-error-handling branch from 3719022 to 2d55312 Compare January 10, 2024 11:37

This was referenced Jan 10, 2024

[HTTP2] Reconnect TCP connection when it fails #369

Closed

Consider connection timeout #370

Open

Consider adding reconnectiong delay #371

Open

hatoo merged commit 44cdfe3 into hatoo:master Jan 10, 2024
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 361 - Improve http2 connection error handling #363

Issue 361 - Improve http2 connection error handling #363

huntharo commented Jan 5, 2024 •

edited by hatoo

Loading

hatoo commented Jan 6, 2024 •

edited

Loading

huntharo commented Jan 8, 2024

hatoo commented Jan 10, 2024

Issue 361 - Improve http2 connection error handling #363

Issue 361 - Improve http2 connection error handling #363

Conversation

huntharo commented Jan 5, 2024 • edited by hatoo Loading

After - Improved Behavior

Before - Existing Behavior

hatoo commented Jan 6, 2024 • edited Loading

huntharo commented Jan 8, 2024

Speed slowing down

Speed recovers after restart

hatoo commented Jan 10, 2024

huntharo commented Jan 5, 2024 •

edited by hatoo

Loading

hatoo commented Jan 6, 2024 •

edited

Loading