Conversation
|
Below is hey data, similar to data shown in #3487. All connections are run for 4 seconds, either 0.2 * 20 or 0.01 * 400. Each data group shows latency info for five different hey connection settings, each loading Puma more than the previous. The first row is always the number of threads available in the running Puma. The first two data sets use a 200 mS response. The third minimal data set uses a 10 mS response and shows significant improvements with fast response times. ────────────────────────────────────────────────── keep-alive comparison 200 mS response────────────────────────────────────────────────── non-keep-alive comparison 200 mS response────────────────────────────────────────────────── keep-alive comparison 10 mS response |
|
With this branch I'm still seeing the long tail |
|
Interesting. On my system (i9 ten core) I have: Then, jumped the connections up: |
|
Re why you're seeing different results than I am, I'm not sure. But, given the number of threads/processes that are being spun up, maybe a core problem? I've got a 10 'hyper-threaded' core Intel i9 proc (older), so JFYI, locally I ran hey with very large EDIT: Below is the data for master branch, which shows the 'long tail' problem. |
There's two words missing from my example: (Hint, it starts with a "b" and ends with "exec") I booted my puma server without DetailsI also validated that the long tail behavior in worker mode is now gone as well. I'm going to start digging into your commits to try to understand what exactly changed. Thanks for working on this. |
|
I hate it when that happens. Or, been there, done that. The current thread code calls the app, generates the response, then checks if the socket has another waiting request (with a wait period). 'Real world' keep-alive connections may have some delay between Puma transmitting the response and Puma's receipt of another request from the connection. The PR code now sends the connection back to the reactor, which means a keep-alive connection cannot monopolize the thread, which is the cause of the long-tail responses. An exception to this happens with pipelined/back-to-back requests. which are processed after the response is sent. |
|
I can also confirm the long tail is gone! How I started Puma:
|
| return false | ||
| else | ||
| begin | ||
| if fast_check && @to_io.wait_readable(FAST_TRACK_KA_TIMEOUT) |
There was a problem hiding this comment.
With this code removed, should we also remove the FAST_TRACK_KA_TIMEOUT assignment?
There was a problem hiding this comment.
Short: I think we should remove it now.
Long:
Previously I operated under a "don't change anything you don't have to" model of oss stability. That works for awhile, but eventually even unused code can cause it's own stability problems in weird ways.
I can no longer find a reference to FAST_TRACK_KA_TIMEOUT in the codebase. Let's delete it as part of this PR while we're looking at it.
schneems
left a comment
There was a problem hiding this comment.
I did a fairly comprehensive review. I took a ton of notes if you want to feel like you're trapped in my head while doing a PR review you can get the raw feed here https://gist.github.com/schneems/9a47e6695e90e636d3d79dbdbc9d33bd.
Overall: I want to move forward with this PR. I think it's correct. I now understand how it works. My one concern is about the removal of calling wait_until_not_full which I've described elsewhere as "thundering herd protection." Even with that concern, I think it's the right thing to do, I said more in a comment about it (and you can read my gist if you want).
Requested: I've made some cleanup suggestions and requests. I also requested some docs. This is some pretty deep-in-the-weeds logic we're dealing with. I want to set up future us (or other maintainers) for success while we're here and looking at it.
Amazing work here thanks a ton 💜
| return false | ||
| else | ||
| begin | ||
| if fast_check && @to_io.wait_readable(FAST_TRACK_KA_TIMEOUT) |
There was a problem hiding this comment.
Short: I think we should remove it now.
Long:
Previously I operated under a "don't change anything you don't have to" model of oss stability. That works for awhile, but eventually even unused code can cause it's own stability problems in weird ways.
I can no longer find a reference to FAST_TRACK_KA_TIMEOUT in the codebase. Let's delete it as part of this PR while we're looking at it.
| raise HttpParserError, | ||
| "HEADER is longer than allowed, aborting client early." | ||
| end | ||
|
|
There was a problem hiding this comment.
The method signature here is def reset(fast_check=true) but fast_check is no longer being used. We should remove it from the method to avoid confusion.
| # As an optimization, try to read the next request from the | ||
| # socket for a short time before returning to the reactor. | ||
| fast_check = @status == :run | ||
| Thread.pass |
There was a problem hiding this comment.
It's not clear why this was added. My best guess is that you're hoping the reactor is scheduled to run so it can pull in a new TCP connection if there's one waiting, but it would be good to confirm.
Effectively: If a future maintainer sees this, the comment should describe the conditions/assumptions that lead to adding it and it's desired effects. I'm thinking of a situation where someone would want to know if it's safe to remove or not. An alternative situation would be if someone moves it to optimize something else (like performance), how would they measure/know if it's better to have it above/below the next line or if it doesn't matter at all as long as it's before <something>.
There was a problem hiding this comment.
hoping the reactor is scheduled to run
Yes, that was the reason. Also, it might be a tough thing to verify either way. Maybe leave as is with a better comment is the best solution?
There was a problem hiding this comment.
Maybe leave as is with a better comment is the best solution?
I think a comment is fine. I'm just thinking of long term maintenance. So if someone wants to modify the code there's a hint as to whats safe or desirable. Here's my stab at some wording:
In this code path the keepalive request may be put back into the reactor. Ideally we want to serve requests in the order they came in, even when those requests are re-using a connection. This
Thread.passis an attempt to hint to the OS that the reactor can read in a new connection from the TCP socket which might have been sent after the currently processing request, but before the next request in its connection.
There was a problem hiding this comment.
LGTM. I wanted to test if the statement made any difference, but with MRI, JRuby & TruffleRuby, it might involve a lot of testing.
This weekend I looked at stats when using workers to see how well requests were distributed amongst the workers. As an example, with four workers, perfect distribution would be 25% for each worker. I think the worst I saw was a +/-15% diff, so lowest was 21.25%, highest was 28.75%. That's from memory, many runs were tighter than that.
There was a problem hiding this comment.
I'm confused why this is here even reading the above comments. What is the condition where you expect the reactor to be running simultaniously?
| @clustered = (options[:workers] || 0) > 1 | ||
|
|
There was a problem hiding this comment.
| @clustered = (options[:workers] || 0) > 1 |
This is only used in commented out code.
| # uncommenting this may cause 'long tail' response times when all | ||
| # workers are busy | ||
| # pool.wait_until_not_full if @clustered |
There was a problem hiding this comment.
Remove comment
| # uncommenting this may cause 'long tail' response times when all | |
| # workers are busy | |
| # pool.wait_until_not_full if @clustered |
The wait_until_not_full code should also be deleted, since this is the only place that called it.
This logic protected against "thundering herd" when one worker boots before the others to make sure it doesn't steal ALL the requests. I'm hoping that the sleep logic that was added in #2079 (and later iterated) might serve a similar purpose. However I don't think we have any off-the-shelf benchmarks/reproductions to validate that assumption.
Either way, it's clear that that logic is confusing and not a good way to achieve our desired outcome. I support removing it and iterating on another solution if needed.
| end | ||
|
|
||
| # Send keep-alive connections back to the reactor | ||
| unless next_request_ready |
There was a problem hiding this comment.
Optional style comment: This would read better to me as if !pipeline_request_ready. I checked and rubocop wouldn't mind that syntax.
Or alternatively get rid of the negation:
if pipeline_request_ready
continue
else
# ...
I like that better actually as it makes it more obvious we're in a loop that will execute again if that condition is true.
|
|
||
| @selector_items = 0 | ||
| @selector_items_max = 0 |
There was a problem hiding this comment.
Remove debugging code
| @selector_items = 0 | |
| @selector_items_max = 0 |
| @selector_items += 1 | ||
| if @selector_items > @selector_items_max | ||
| @selector_items_max = @selector_items | ||
| # STDOUT.syswrite "\n#{Process.pid} @selector_items_max #{@selector_items_max}\n" | ||
| end |
There was a problem hiding this comment.
Remove debugging code
| @selector_items += 1 | |
| if @selector_items > @selector_items_max | |
| @selector_items_max = @selector_items | |
| # STDOUT.syswrite "\n#{Process.pid} @selector_items_max #{@selector_items_max}\n" | |
| end |
| def wakeup!(client) | ||
| if @block.call client | ||
| @selector.deregister client.to_io | ||
| @selector_items -= 1 |
There was a problem hiding this comment.
| @selector_items -= 1 |
| # uncommenting this may cause 'long tail' response times when all | ||
| # workers are busy | ||
| # pool.wait_until_not_full if @clustered | ||
| sleep 0.001 while pool.out_of_band_running |
There was a problem hiding this comment.
Please add a doc explaining what this is doing.
| client.reset(fast_check) | ||
| # This indicates that the socket has pipelined (multiple) | ||
| # requests on it, so process them | ||
| next_request_ready = if client.has_buffer |
There was a problem hiding this comment.
It's not clear to me if the term "pipeline" here is a Puma-internal concept or if it's referring to HTTP pipelining. If the latter, technically Puma has already prepared and sent a response to the last request. Thus, the next request being ready doesn't mean the requests have been pipelined. It just means the client already sent another request on that connection.
Apologies if this is just me misunderstanding the term.
| else | ||
| pool.wait_until_not_full | ||
| # uncommenting this may cause 'long tail' response times when all | ||
| # workers are busy |
There was a problem hiding this comment.
I think you may mean to say "when all worker threads are busy".
There was a problem hiding this comment.
Sorry. I'm surprised I wrote that, as often the threads in the ThreadPool are called workers, which drives me nuts.
Maybe just "when all threads are busy"?
There was a problem hiding this comment.
I don't think we should leave in commented out code with a description for what it would do if it's uncommented.
I think the wording "when all threads are busy" is valid, but my preference would be either:
- Delete the code alltogether (older versions are still there in git)
- If you feel strongly it should stay in, then make it user configurable. For example I suggested
enable_keepalives :legacyas a possible codepath.
I'm fine removing it for now and only adding a config path if there are problems and we need to buy time for a longer term investigation and fix.
| requests < @max_fast_inline || | ||
| @thread_pool.busy_threads < @max_threads || | ||
| !client.listener.to_io.wait_readable(0) | ||
| client.requests_served < @max_fast_inline || |
There was a problem hiding this comment.
The PR description states:
Note that all keep-alive connections are closed when the number of requests hits the max_fast_inline setting.
But this logic indicates otherwise. You'll also need the busy_threads < max_threads to force the connection to close.
There was a problem hiding this comment.
I see what you're saying. Previously we had this logic in two places. One of them (here) said "take the keepalive path if any of the following are true" and the other one said "close the connection if requests served is greater than @max_fast_inline."
Now we have only one location that will continue to serve keepalive requests beyond @max_fast_inline as long as not all worker threads are spawned. I think that's okay actually. The @max_fast_inline config was a stop-gap to ensure that one keepalive connection didn't hog too many system resources. If there's still threads we could spawn, there's still capacity in the system. I think instead of adding back a second round of checks, we could add this caveat in the documentation.
In general my recommendation (the one I put on the Heroku docs) suggest setting minimum == maximum and for that case the current docs are already true.
The current docs are wrong anyway after this change:
# The number of requests to attempt inline before sending a client back to
# the reactor to be subject to normal ordering.
#
# The default is 10.
#
# @example
# max_fast_inline 20
#So they need to be updated either way. I suggest this:
# A "fast inline" request is a request that is re-using an existing
# connection via HTTP 1.1 "keepalive". Setting this value will close a connection
# after the specified number of requests are made when there are no additional
# Puma threads to spawn. If you want to ensure that a connection is closed after
# the given number of requests, set minimum thread count equal to maximum thread
# count.
#
# Previous design of Puma would serve these request from the same thread that handled the
# first request. In that previous model, setting this value would prevent
# one request from monopolizing too many server resources.
#
# Current puma design will serve buffered sequential requests from the same
# thread but any new requests that come in on the same connection will be
# sent back to the reactor to read the request body and later be picked up
# by a thread. With this model resources are distributed more evenly by
# default.
#
# The default is 10.
#
# @example
# max_fast_inline 1000
#It's a bit verbose, but I prefer my docs to say more rather than less. FWIW nginx sets this value to 1000 https://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_requests.
Eventually: In the long term (not now) we could consider renaming this config and increasing the limit. I'm unsure if the warning nginx gives about memory is intrinsic to the domain or is an implementation detail of how they're managing memory. But if we're able to show that we're not leaking/retaining memory between subsequent repeated requests, then I think raising the limit would be good.
|
Thank you for the thorough review. Really. It's a pretty serious change to Puma's operation, and I thought I could work on it for another week (or more), or instead open it as a draft and get more eyes testing & reviewing it. I'm still busy with family things (funny how the Excel/Word/coder family member gets assigned things by default). Sorry about the debug code, I wasn't sure if some of the things added might be needed for request 'distribution/balancing', so I left it in. In the past the reactor has been used mostly for slow requests. Also, thanks for the gist, that will help. I think I can get a block of time to work on this tomorrow or Sunday. |
|
My suggestions to remove the debug code took basically no time, no need for apology. Again, thanks for all the work. I think I might have gotten somewhere approaching this solution eventually, but you got here much faster and caught some things that wouldn't have been obvious to me. I also knew it was still in draft when I reviewed. I'm motivated to get a fix for this behavior. My ideal world would be that we could roll the whole thing out. Depending on your confidence level in the change, one option would be to ship code that has both codepaths and allow the end user to configure it if they need. We just introduced Another path forward could be cutting a beta/pre-release, but IMHO not a lot of people try them unless specifically asked.
100% understood. For some reason when people in my family think of "problems with a printer" they also think of me. My undergrad is in mechanical engineering and I've met many people in my life who when they found that out exclaimed "oh, my cousin is a mechanic too!" (narrator: They are not the same thing). Take care with the family 💜
Sounds good. If you need to drop this completely or for a period of time let me know how I/we can best support you. |
|
|
||
| client.set_timeout(@first_data_timeout) | ||
| client.set_timeout @first_data_timeout | ||
| if @reactor.add client |
There was a problem hiding this comment.
Please please don't mix style changes into a PR like this. It's an important change and it's confusing if this is just style (I believe it is) or functional.
There was a problem hiding this comment.
Removed style change and rebased
| # As an optimization, try to read the next request from the | ||
| # socket for a short time before returning to the reactor. | ||
| fast_check = @status == :run | ||
| Thread.pass |
There was a problem hiding this comment.
I'm confused why this is here even reading the above comments. What is the condition where you expect the reactor to be running simultaniously?
I've been meaning to check whether removing it has any affect. I think I added it very early, and it may not affect anything... |
|
Public comment describing some private conversations: Handling the case where we have I solicited feedback on why people use that setting today https://ruby.social/@Schneems/113289823703358711. It's not scientific, but it is quiet. I think the only reason someone might want to use that setting is to bypass the overhead of the reactor if they KNOW that a request body is already fully buffered and will be non-keepalive-d. (if a proxy in front of them handles that case and downgrades HTTP/1.1 to HTTP/1.0). Even then it's a relatively small perf gain for a fairly large cost in code complexity (handling running a server where we have no reactor to buffer new requests). My suggestion is: That we deprecate I can make a PR that adds a deprecation warning to The only unknown (in my mind) is if |
- Short: Removing this feature makes the internals easier to reason about and I do not believe there's a good reason to disable the reactor that we want to support. Research also points to disabling of the reactor to be an implementation detail in an initial effort to configure the ACCEPT behavior, and over time it morphed into effectively meaning "disable the reactor" - Long: ## Background The `queue_requests` config was originally introduced in #640 in 2015. It is marked as closing this issue #612 (opened in 2014, ten years ago). The use case in the issue for this option to disable the reactor was this: A worker with 1 thread needs to send a request to itself to finish. The example given is downloading a PDF from the same server. If the reactor is accepting requests as fast as it can, the same worker that sent the request might accept the request, however, since it cannot complete the first request (waiting on the second), the second request will never execute. i.e. it will deadlock. The problem described does not come from having a reactor that buffers request bodies, but rather from the application ACCEPT-ing a new request from the socket before it had capacity to process it. Disabling the reactor was effectively an implementation detail. The issue was opened nearly 10 years ago and the justification given for wanting to run with one max thread is that "not all code is thread safe". In the decade since, the Ruby community has come a long way. Sidekiq's threaded job queue has become the industry standard. It's become an expectation that all Ruby code should be threadsafe for quite some time now. An application can avoid this deadlock situation by either not requiring this functionality (making requests to itself) or running with more threads. It's also worth noting that the possibility of deadlock is not zero when moving to 2 threads, as both those threads might be executing a "give me a PDF" request, however the chances are decreased. I do not believe this to be a primary use case we need to support, and further, we've since mitigated the issue in other ways. In the same PR a `wait_until_not_full` method was introduced. But it was only called if queue_requests was disabled. This method still lives on today, but is called unconditionally before the socket is read. https://github.com/puma/puma/blob/a59afabe782ec8869bb709ca991b32c0b2cc95fd/lib/puma/server.rb#L363 What does `wait_until_not_full` do? It is what I call "anti-thundering-herd" measures where a worker will first check if it has capacity to handle a new request (it sees if all threads are busy or not) and only when it has capacity does it ACCEPT a new request from the socket. The situation for a thundering herd looks like this: Puma boots with multiple workers, one worker boots before the others and accepts more requests than it can process. The other workers boot and have nothing to work on and are idle. However it didn't always work as intended. This is one fix 9690d8f. I was present in the 2017 roundtable mentioned in the commit. Since then, this logic has been relatively stable. Since that was introduced, we also introduced `wait_for_less_busy_worker` which adds a tiny sleep to the worker to handle a related yet different scenario caused by OS thread scheduling. In this scenario, the OS might favor executing a busy worker rather than one with capacity to take new requests. More details are in the PR where it was added #2079. The solutio was to add a really tiny sleep to the busy workers to hint to the OS to let the less busy ones be scheduled. This became a default ## Queue requests confusion The `queue_requests` config that was originally introduced to gate ACCEPT behavior, has eventually come to mean "disable the reactor" in the current codebase as defaults have changed and morphed over the years. While the original docs introduced in that PR hint that enabling it deactivates keepalive functionality: > # Note that setting this to false disables HTTP keepalive and > # slow clients will occupy a handler thread while the request > # is being sent. A reverse proxy, such as nginx, can handle > # slow clients and queue requests before they reach puma. It wasn't exactly clear what that meant. i.e. when a keepalive request is made to puma with queue requests disabled, what should be the expected behavior? The released version of puma checks for several conditions before determining whether or not to send a "close" header to a keepalive request https://github.com/puma/puma/blob/d2b3b309bf76f252819cf6c89d72249895daf629/lib/puma/request.rb#L167-L176. The ability to deactivate this keepalive behavior is recent and is staged on main, it was introduced in #3496. Notably even HEAD does not deactivate this behavior based on the state of `queue_requests`. The current behavior is that while that logic returns `true` and the request is not placed back into the reactor (not possible when queue_requests is false) it will loop and continue to accept keepalive requests over the same connection because this code https://github.com/puma/puma/blob/9712d29bc156a7fa05a6f4ec6deda5ac20df50be/lib/puma/server.rb#L468-L470 is inside of a while loop. ## Keepalive connection This looping behavior for keepalive connections is not limited to when the reactor is disabled. There's a bug that currently means that keepalive requests effectively monopolize a thread for up to 10 sequential requests (by default) which is described here #3487. All signs point to needing to put the keepalive request into the reactor as a part of the mitigation. However, as pointed out in this comment #3506 (comment) when request queuing is disabled we are left with a difficult path forward. We can either choose to continue monopolizing the thread and preserve the behavior that exists today. Or we can close the connection which aligns with the docs saying that this setting will not work with keepalive requests. Rather than make that determination for the user, this change suggests we should remove that codepath as an invalid configuration option. ## Usage As part of informal research into useage of the feature I asked on mastodon and no one appears to be using it: - Link: https://ruby.social/@Schneems/113289823703358711 - Screenshot: https://www.dropbox.com/s/9t1ur0rpqz5zkku/Screenshot%202024-10-14%20at%2012.43.27%E2%80%AFPM.png?dl=0 I asked in slack groups and the only example someone could give was this issue comment where (funny enough) they're using it to try to diagnose the very same keepalive buggy behavior #2311 (comment). From my personal experience debugging Ruby applications at Heroku, I have seen this feature used in the wild, so I know it's used. It's just that when it is used, people are not aware that they're losing slow-client protection and disabling the reactor. Usually they're after some other side-effect that disabling it caused. In short, I don't believe it's heavilly used, and for the cases where it is used, I don't believe those are valid use cases. In the comments I added a note requesting people inform us of their use cases if they feel they have a valid one so we can investigate it and possibly add a different configuration option.
|
I'm working on a revision to this, the revision performs better. Should be done within a few days. |
|
Closing in favor of #3678 |
Description
PR - purpose and issues
The significant change in this PR affects 'keep-alive' connections/clients.
Currently, these connections may monopolize a thread, causing a small percentage of other clients to not be processed, which results in the 'long-tail' effect on response times (see #3487). This PR pushes the requests back to the reactor instead of being processed in
Puma::Server'sprocess_clientloop. This change results in response times that are much more even when Puma is heavily loaded.Note that all keep-alive connections are closed when the number of requests hits the
max_fast_inlinesetting.I started working on this by first trying to remove the long-tail effect when Puma is heavily loaded, then I checked the code with the test suite. Three main issues existed.
First, back-to-back/pipelined requests were not being handled properly. I decided that those requests should stay in
Puma::Server'sprocess_clientloop, similar to how successive keep-alive requests are currently processed.Secondly, 'out-of-band' hooks were not being handled properly.
Thirdly, some logic existed that delayed processing so as to divide request load among workers if Puma was run clustered. This delay slows down Puma when under a heavy load.
Concerns
Tools like hey and wrk generate a very fast request stream with almost no delay between the receipt of a response and sending the next request. Or, results from these tools may not show real world performance.
I'm not sure if an invalid request that is not the first request on a keep-alive or pipelined connection is always properly handled.
Check for how well clients are distributed between workers when running clustered.
Possible need for more tests.
I've left this a draft in the hope of review and testing by others.
Data from using hey will be in another message.
Closes #3487
Closes #3443
Closes #2311
Your checklist for this pull request
[ci skip]to the title of the PR.#issue" to the PR description or my commit messages.