Skip to content

Add opt to enable keep-alives#3496

Merged
schneems merged 10 commits into
puma:masterfrom
slizco:enhancement-add-opt-to-disable-keepalives
Sep 26, 2024
Merged

Add opt to enable keep-alives#3496
schneems merged 10 commits into
puma:masterfrom
slizco:enhancement-add-opt-to-disable-keepalives

Conversation

@slizco

@slizco slizco commented Sep 19, 2024

Copy link
Copy Markdown
Contributor

Description

Puma does not currently support a configuration option for enabling/disabling keep-alive connections to the server. Because a thread can only handle a single connection at a time, the Puma server under default config will end up serving requests out of order, preferring to process requests off it's already in flight connection. See #3487 for more details.

To get around this issue, Puma will close connections every max_fast_inline requests. When max_fast_inline=1, the server should close the connection after each request, but this only happens most of the time. It does not happen all the time, due to the following conditional:

puma/lib/puma/request.rb

Lines 163 to 169 in 48b89b5

# Close the connection after a reasonable number of inline requests
# if the server is at capacity and the listener has a new connection ready.
# This allows Puma to service connections fairly when the number
# of concurrent connections exceeds the size of the threadpool.
force_keep_alive = requests < @max_fast_inline ||
@thread_pool.busy_threads < @max_threads ||
!client.listener.to_io.wait_readable(0)

Thus it's still possible to end up with long tail latencies because a small percentage of requests will sit in the queue for awhile.

Sometimes, the number of busy threads is less than the max and sometimes, there are no new connections to accept on the socket. However, these decisions are made at a point in time and the state of the server is constantly changing. They are subject to race conditions since other threads are concurrently accessing these variables and taking actions that modify their values.

The absolute safest way to ensure a connection is handed back to the reactor class once the request has been handled is to outright disable keep-alives. This PR proposes exposing an enable_keep_alives option which defaults to true, as to least disrupt the current behavior. Disabling keep-alives may be done by setting enable_keep_alives: false in the Puma config.

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@schneems schneems left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. I like the feature and think we should have some way to disable this behavior consistently. I have two requested and one optional change.

Config set to enable instead of disable.

Short:

  • Request: Please invert the logic to enable_keep_alives
  • Request: Please add the default value to the docs above the method as well

Long:

I'm thinking of how users will read the value when they're looking at their config/puma.rb without docs at hand and what they think it will do. The current logic for calling this code:

disable_keep_alives

That code would set @disable_keep_alives=falsewhich seems backwards. I think that's a small issue of flipping the default argument, however...

I'm also looking at other config examples like

def quiet(which=true)
that are boolean. Many use the argument of which and are framed as enabling some behavior. Such as drain_on_shutdown or clean_thread_locals. I'm also thinking of how this might be viewed from a user and I like representing the state positively instead of negatively:

enable_keep_alives
enable_keep_alives false

To me, the first line would enable keep alives (already the default) while the second would disable it. Comparing to

disable_keep_alives
disable_keep_alives false

The first line I would expect to disable the keepalives, but the next line presents a double negative that's a bit hard for my brain to parse. It also has the side benefit of being able to use it directly in the conditional without &&= !<variable>.

Code style

Short: Optionally update the code style to the if/else example below

Long:

I played with the style a bit and I found the &&= not entirely intuitive. It's used in other places in the file, but is not common. I tried this, but didn't love it:

 force_keep_alive = @enable_keep_alives && (
   requests < @max_fast_inline || 
   @thread_pool.busy_threads < @max_threads || 
   !client.listener.to_io.wait_readable(0)
 )

That code was tight, but hard to parse. I think this is more verbose, but clearer

force_keep_alive = if @enable_keep_alives
  requests < @max_fast_inline ||
    @thread_pool.busy_threads < @max_threads ||
    !client.listener.to_io.wait_readable(0)
else
  false
end

Another option is:

 force_keep_alive = @enable_keep_alives
 force_keep_alive &&= 
   requests < @max_fast_inline || 
   @thread_pool.busy_threads < @max_threads || 
   !client.listener.to_io.wait_readable(0)

My main worry is that someone might miss the trailing &&= in the existing PR when reading. It's a bit of a bike-shed though. I don't love any of them, but the if/else is my least bad pick. Especially if we start adding in comments, it could increase the distance between the effects. I also want to update the comments that describe the code, but that should be in a new PR and not cloud up this one.

@slizco slizco changed the title Add opt to disable keep-alives Add opt to enable keep-alives Sep 23, 2024
@slizco slizco force-pushed the enhancement-add-opt-to-disable-keepalives branch from 8d2b68d to efe2c05 Compare September 23, 2024 21:02
@slizco

slizco commented Sep 23, 2024

Copy link
Copy Markdown
Contributor Author

@schneems I've updated this according to your suggestions. Let know your thoughts!

@schneems schneems left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great. I made some suggestions to comments/docs I would like to apply. I'm approving now and will apply those changes and merge once the suite passes.

Thanks a ton for working on this!

Comment thread test/test_puma_server.rb Outdated
Comment thread test/test_puma_server.rb Outdated
Comment thread lib/puma/dsl.rb Outdated
@dentarg

dentarg commented Sep 25, 2024

Copy link
Copy Markdown
Member

Some lint fixes needed: https://github.com/puma/puma/actions/runs/11038443413/job/30661661652?pr=3496

@dentarg dentarg removed the waiting-for-review Waiting on review from anyone label Sep 25, 2024
Comment thread lib/puma/dsl.rb Outdated
@schneems

Copy link
Copy Markdown
Member

@dentarg thanks. I fixed up the trailing whitespace, now I'm seeing failures in CI

Prepare all required actions
Getting action download info
Download action repository 'actions/checkout@v4' (SHA:692973e3d937129bcbf40652eb9f2f61becf3332)
Download action repository 'ruby/setup-ruby@v1' (SHA:f3[21](https://github.com/puma/puma/actions/runs/11039500154/job/30665340740#step:1:25)cf5a4d1533575411f8752cf25b86478b0442)
Warning: Failed to download action 'https://internal-api.service.iad.github.net/repos/ruby/setup-ruby/tarball/f321cf5a4d1533575411f8752cf25b86478b0442'. Error: Name or service not known (internal-api.service.iad.github.net:443)
Warning: Back off 12.968 seconds before retry.
Warning: Failed to download action 'https://internal-api.service.iad.github.net/repos/ruby/setup-ruby/tarball/f321cf5a4d1533575411f8752cf25b86478b0442'. Error: Name or service not known (internal-api.service.iad.github.net:443)
Warning: Back off 18.971 seconds before retry.
Error: Name or service not known (internal-api.service.iad.github.net:443)

Not seeing anything on GitHub's status site

Comment thread lib/puma/dsl.rb
@schneems schneems merged commit 37c9de5 into puma:master Sep 26, 2024
@schneems

Copy link
Copy Markdown
Member

Third time was the charm. CI passed, merging.

@slizco

slizco commented Sep 26, 2024

Copy link
Copy Markdown
Contributor Author

Thanks @schneems and @dentarg!

schneems added a commit that referenced this pull request Oct 15, 2024
- Short: Removing this feature makes the internals easier to reason about and I do not believe there's a good reason to disable the reactor that we want to support. Research also points to disabling of the reactor to be an implementation detail in an initial effort to configure the ACCEPT behavior, and over time it morphed into effectively meaning "disable the reactor"

- Long: 

## Background

The `queue_requests` config was originally introduced in #640 in 2015. It is marked as closing this issue #612 (opened in 2014, ten years ago). 

The use case in the issue for this option to disable the reactor was this: 

A worker with 1 thread needs to send a request to itself to finish. The example given is downloading a PDF from the same server. If the reactor is accepting requests as fast as it can, the same worker that sent the request might accept the request, however, since it cannot complete the first request (waiting on the second), the second request will never execute. i.e. it will deadlock.

The problem described does not come from having a reactor that buffers request bodies, but rather from the application ACCEPT-ing a new request from the socket before it had capacity to process it. Disabling the reactor was effectively an implementation detail.

The issue was opened nearly 10 years ago and the justification given for wanting to run with one max thread is that "not all code is thread safe". In the decade since, the Ruby community has come a long way. Sidekiq's threaded job queue has become the industry standard. It's become an expectation that all Ruby code should be threadsafe for quite some time now. An application can avoid this deadlock situation by either not requiring this functionality (making requests to itself) or running with more threads. It's also worth noting that the possibility of deadlock is not zero when moving to 2 threads, as both those threads might be executing a "give me a PDF" request, however the chances are decreased. I do not believe this to be a primary use case we need to support, and further, we've since mitigated the issue in other ways.

In the same PR a `wait_until_not_full` method was introduced. But it was only called if queue_requests was disabled. This method still lives on today, but is called unconditionally before the socket is read. https://github.com/puma/puma/blob/a59afabe782ec8869bb709ca991b32c0b2cc95fd/lib/puma/server.rb#L363

What does `wait_until_not_full` do? It is what I call "anti-thundering-herd" measures where a worker will first check if it has capacity to handle a new request (it sees if all threads are busy or not) and only when it has capacity does it ACCEPT a new request from the socket. The situation for a thundering herd looks like this: Puma boots with multiple workers, one worker boots before the others and accepts more requests than it can process. The other workers boot and have nothing to work on and are idle. However it didn't always work as intended. This is one fix 9690d8f. I was present in the 2017 roundtable mentioned in the commit. Since then, this logic has been relatively stable.

Since that was introduced, we also introduced `wait_for_less_busy_worker` which adds a tiny sleep to the worker to handle a related yet different scenario caused by OS thread scheduling. In this scenario, the OS might favor executing a busy worker rather than one with capacity to take new requests. More details are in the PR where it was added #2079. The solutio was to add a really tiny sleep to the busy workers to hint to the OS to let the less busy ones be scheduled. This became a default 

## Queue requests confusion

The `queue_requests` config that was originally introduced to gate ACCEPT behavior, has eventually come to mean "disable the reactor" in the current codebase as defaults have changed and morphed over the years.

While the original docs introduced in that PR hint that enabling it deactivates keepalive functionality:

>      # Note that setting this to false disables HTTP keepalive and
>      # slow clients will occupy a handler thread while the request
>      # is being sent. A reverse proxy, such as nginx, can handle
>      # slow clients and queue requests before they reach puma.

It wasn't exactly clear what that meant. i.e. when a keepalive request is made to puma with queue requests disabled, what should be the expected behavior?

The released version of puma checks for several conditions before determining whether or not to send a "close" header to a keepalive request https://github.com/puma/puma/blob/d2b3b309bf76f252819cf6c89d72249895daf629/lib/puma/request.rb#L167-L176. The ability to deactivate this keepalive behavior is recent and is staged on main, it was introduced in #3496. Notably even HEAD does not deactivate this behavior based on the state of `queue_requests`.

The current behavior is that while that logic returns `true` and the request is not placed back into the reactor (not possible when queue_requests is false) it will loop and continue to accept keepalive requests over the same connection because this code https://github.com/puma/puma/blob/9712d29bc156a7fa05a6f4ec6deda5ac20df50be/lib/puma/server.rb#L468-L470 is inside of a while loop.

## Keepalive connection

This looping behavior for keepalive connections is not limited to when the reactor is disabled. There's a bug that currently means that keepalive requests effectively monopolize a thread for up to 10 sequential requests (by default) which is described here #3487.

All signs point to needing to put the keepalive request into the reactor as a part of the mitigation. However, as pointed out in this comment #3506 (comment) when request queuing is disabled we are left with a difficult path forward. We can either choose to continue monopolizing the thread and preserve the behavior that exists today. Or we can close the connection which aligns with the docs saying that this setting will not work with keepalive requests. Rather than make that determination for the user, this change suggests we should remove that codepath as an invalid configuration option.

## Usage

As part of informal research into useage of the feature I asked on mastodon and no one appears to be using it:

- Link: https://ruby.social/@Schneems/113289823703358711
- Screenshot: https://www.dropbox.com/s/9t1ur0rpqz5zkku/Screenshot%202024-10-14%20at%2012.43.27%E2%80%AFPM.png?dl=0

I asked in slack groups and the only example someone could give was this issue comment where (funny enough) they're using it to try to diagnose the very same keepalive buggy behavior #2311 (comment).

From my personal experience debugging Ruby applications at Heroku, I have seen this feature used in the wild, so I know it's used. It's just that when it is used, people are not aware that they're losing slow-client protection and disabling the reactor. Usually they're after some other side-effect that disabling it caused.

In short, I don't believe it's heavilly used, and for the cases where it is used, I don't believe those are valid use cases. In the comments I added a note requesting people inform us of their use cases if they feel they have a valid one so we can investigate it and possibly add a different configuration option.
@schneems schneems mentioned this pull request Oct 15, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants