Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 361 - Improve http2 connection error handling #363

Merged

Conversation

huntharo
Copy link
Contributor

@huntharo huntharo commented Jan 5, 2024

  • Partial fix for Frequent OOM kills - Isolated tohttp2 - 50 GB of RAM usage on MacBook, causing system OOM #361
  • ONLY implemented for the -z 10s (work_until) case
  • TODO:
    • The futures are not aborted when the timer is hit, which will cause long running requests to delay the program exit - this is only due to a borrow/move problem that I cannot figure out
    • Implement for the non-work_until cases
    • Add a timeout to the TCP socket setup - this appears to be where some of the delay on shutdown is happening if the server closes after startup
    • Consider adding a delay to the reconnect loop so that it will not try to connect more than 1 time per second per concurrent connection - Without this the connect loop will spin at ~23k connect attempts/second for -c 20, for example
  • Test cases:
    • Start with the server not running at all (never connects) - Currently this will exit on time - IMPROVED: Previously this would attempt to connect once for each -c, fail, and immediately exit - IMPROVED: Currently this will repeatedly try to connect until the specified timeout expires, then it will exit
    • Start with the server running and leave it running
      • This works fine as before
    • Start with the server running, exit the server, then restart the server before the test completes
      • This initially makes requests
      • IMPROVED: Previously this would OOM even if the server restarted - IMPROVED: Currently this will reconnect and continue making requests if the server restarts

After - Improved Behavior

oha-pr-363-after.mp4

Before - Existing Behavior

oha-pr-363-before.mp4

- Partial fix for hatoo#361
- ONLY implemented for the `-z 10s` (work_until) case
- TODO:
   - [ ] The futures are not aborted when the timer is hit, which will cause long running requests to delay the program exit - this is only due to a borrow/move problem that I cannot figure out
   - [ ] Implement for the non-`work_until` cases
   - [ ] Add a timeout to the TCP socket setup - this appears to be where some of the delay on shutdown is happening if the server closes after startup
   - [ ] Consider adding a delay to the reconnect loop so that it will not try to connect more than 1 time per second per concurrent connection - Without this the connect loop will spin at ~23k connect attempts/second for `-c 20`, for example
- Test cases:
  - Start with the server not running at all (never connects)
    - Currently this will exit on time
    - IMPROVED: Previously this would attempt to connect once for each `-c`, fail, and immediately exit
    - IMPROVED: Currently this will repeatedly try to connect until the specified timeout expires, then it will exit
  - Start with the server running and leave it running
    - This works fine as before
  - Start with the server running, exit the server, then restart the server before the test completes
     - This initially makes requests
     - IMPROVED: Previously this would OOM even if the server restarted
     - IMPROVED: Currently this will reconnect and continue making requests if the server restarts
@hatoo
Copy link
Owner

hatoo commented Jan 6, 2024

Great catch! 👍

I understand that a TCP connection should be restarted when one of the parallel connections gets an IO/hyper error.

I updated work_until to be simpler I believe.
I want to focus work_until only on this PR and fix others later.

Add a timeout to the TCP socket setup - this appears to be where some of the delay on shutdown is happening if the server closes after startup

I can't see any effect of connection timeout in my environment.
Perhaps it depends on server implementation. Could you provide the server code and Node version?

Consider adding a delay to the reconnect loop so that it will not try to connect more than 1 time per second per concurrent connection - Without this the connect loop will spin at ~23k connect attempts/second for -c 20, for example

Is it a big problem for you?
I think this solution isn't ideal. I think the issue should be solved using hdrhistogram (not collecting all data naively).

@huntharo
Copy link
Contributor Author

huntharo commented Jan 8, 2024

Great catch! 👍

I understand that a TCP connection should be restarted when one of the parallel connections gets an IO/hyper error.

👍

I updated work_until to be simpler I believe. I want to focus work_until only on this PR and fix others later.

Sounds good to me. I saw the alternatively approach with work_until... if it passes all the tests it's good with me as it's quite a bit simpler.

Add a timeout to the TCP socket setup - this appears to be where some of the delay on shutdown is happening if the server closes after startup

I can't see any effect of connection timeout in my environment. Perhaps it depends on server implementation. Could you provide the server code and Node version?

Yeah I'll need to check into it and make sure it's not something funny on my end. What I was seeing (I don't think this was captured in a video as it was an intermediate issue that I fixed) was that I would get tens of thousands of reconnect attempts, but then it would freeze and hang and then start going again after some of the connection attempts timed out. I'm not sure why some of the connection attempts hang like that, but it may be because of running out of ephemeral ports. Once you try to make ~65k connections max, from localhost, to localhost (that is a key as it's only one target IP), on a single port, and they all fail, that will exhaust the ephemeral ports and it may wait on the subsequent connects (depends on OS most likely) to see if it can get a port, which it cannot until one of the TIME_WAIT sockets expires).

Consider adding a delay to the reconnect loop so that it will not try to connect more than 1 time per second per concurrent connection - Without this the connect loop will spin at ~23k connect attempts/second for -c 20, for example

Is it a big problem for you? I think this solution isn't ideal.

The delay to the reconnect loop is a good idea because of the TIME_WAIT issue above. You won't notice it as often when connecting to a remote host with multiple IPs (like a cloud load balancer). But you will notice it when connecting to a single IP or localhost.

I think the issue should be solved using hdrhistogram (not collecting all data naively).

Yes, the collection and retaining of all of the data is a distinct issue that should be handled in another ticket, which I will create. It is a big deal, I think, because my tests start at 11k RPS and drifts down to 9k RPS. If I exit and restart oha then it immediately goes back to 11k RPS, so the issue is not due to the target getting slower, but rather due to oha getting slower, and I think it's getting slower because of the memory accumulation from the stats collection, but it could be from something else.

Speed slowing down

image

Speed recovers after restart

image

@hatoo hatoo force-pushed the issue-361/improve-http2-connection-error-handling branch from 3719022 to 2d55312 Compare January 10, 2024 11:37
@hatoo hatoo merged commit 44cdfe3 into hatoo:master Jan 10, 2024
38 checks passed
@hatoo
Copy link
Owner

hatoo commented Jan 10, 2024

Thanks.

Let me merge this PR and continue on #369, #370, #371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants