Skip to content

Fix long tail response problem#3506

Closed
MSP-Greg wants to merge 6 commits intopuma:masterfrom
MSP-Greg:00-long-tail
Closed

Fix long tail response problem#3506
MSP-Greg wants to merge 6 commits intopuma:masterfrom
MSP-Greg:00-long-tail

Conversation

@MSP-Greg
Copy link
Member

@MSP-Greg MSP-Greg commented Sep 29, 2024

Description

PR - purpose and issues

The significant change in this PR affects 'keep-alive' connections/clients.

Currently, these connections may monopolize a thread, causing a small percentage of other clients to not be processed, which results in the 'long-tail' effect on response times (see #3487). This PR pushes the requests back to the reactor instead of being processed in Puma::Server's process_client loop. This change results in response times that are much more even when Puma is heavily loaded.

Note that all keep-alive connections are closed when the number of requests hits the max_fast_inline setting.

I started working on this by first trying to remove the long-tail effect when Puma is heavily loaded, then I checked the code with the test suite. Three main issues existed.

First, back-to-back/pipelined requests were not being handled properly. I decided that those requests should stay in Puma::Server's process_client loop, similar to how successive keep-alive requests are currently processed.

Secondly, 'out-of-band' hooks were not being handled properly.

Thirdly, some logic existed that delayed processing so as to divide request load among workers if Puma was run clustered. This delay slows down Puma when under a heavy load.

Concerns

  • Tools like hey and wrk generate a very fast request stream with almost no delay between the receipt of a response and sending the next request. Or, results from these tools may not show real world performance.

  • I'm not sure if an invalid request that is not the first request on a keep-alive or pipelined connection is always properly handled.

  • Check for how well clients are distributed between workers when running clustered.

  • Possible need for more tests.

I've left this a draft in the hope of review and testing by others.

Data from using hey will be in another message.

Closes #3487
Closes #3443
Closes #2311

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Sep 29, 2024

Below is hey data, similar to data shown in #3487. All connections are run for 4 seconds, either 0.2 * 20 or 0.01 * 400.

Each data group shows latency info for five different hey connection settings, each loading Puma more than the previous. The first row is always the number of threads available in the running Puma.

The first two data sets use a 200 mS response. The third minimal data set uses a 10 mS response and shows significant improvements with fast response times.

────────────────────────────────────────────────── keep-alive comparison 200 mS response
Branch: master           Puma:     -t5:5
                         ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00   5      23.8580    0.2027  0.2079  0.2095  0.2100  0.2107  0.2120  0.2188  0.2235
1.33   7      21.3811    0.2025  0.2076  0.2095  0.2103  0.2154  0.4175  2.3044  2.3044
1.67   8      21.2259    0.2065  0.2093  0.2099  0.2112  0.2140  0.8549  2.3122  2.5250
2.00  10      23.2740    0.2072  0.2092  0.2096  0.2102  0.2122  2.3153  2.3259  2.5282
3.00  15      22.0826    0.2025  0.2089  0.2096  0.2114  2.3043  4.3776  5.0322  5.0322

Branch: 00-long-tail     Puma:     -t5:5
                         ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00   5      24.1246    0.2015  0.2024  0.2086  0.2100  0.2114  0.2126  0.2191  0.2241
1.33   7      23.7857    0.2090  0.2134  0.2208  0.3937  0.4025  0.4061  0.4226  0.4226
1.67   8      23.6094    0.2168  0.2275  0.3917  0.4025  0.4087  0.4106  0.4238  0.4252
2.00  10      23.8717    0.4091  0.4153  0.4178  0.4186  0.4199  0.4209  0.4292  0.4317
3.00  15      23.9169    0.6050  0.6135  0.6214  0.6277  0.6302  0.6320  0.6386  0.6386


Branch: master           Puma: -w2 -t5:5
                         ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00  10      47.5677    0.2017  0.2028  0.2096  0.2104  0.2120  0.2129  0.2184  0.2202
1.33  13      32.4865    0.2017  0.2047  0.2090  0.2099  0.2140  0.4150  2.2956  2.2956
1.67  17      41.4058    0.2018  0.2066  0.2093  0.2100  0.2123  0.8438  2.4776  2.4776
2.00  20      45.2614    0.2017  0.2039  0.2092  0.2100  0.2122  2.2528  2.3099  2.5261
3.00  30      42.7187    0.2024  0.2085  0.2096  0.2112  1.6485  3.0916  3.1624  3.1624

Branch: 00-long-tail     Puma: -w2 -t5:5
                         ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00  10      42.1236    0.2036  0.2088  0.2103  0.2122  0.2216  0.3487  0.3984  0.4045
1.33  13      44.7444    0.2055  0.2102  0.2159  0.3842  0.4055  0.4101  0.4221  0.4221
1.67  17      40.5307    0.2087  0.2170  0.4090  0.4189  0.4384  0.5613  0.6042  0.6042
2.00  20      43.5276    0.2857  0.3876  0.4173  0.4255  0.5110  0.5568  0.6008  0.6123
3.00  30      44.1125    0.4261  0.5856  0.6204  0.6377  0.6649  0.7637  0.8287  0.8287


Branch: master           Puma: -w4 -t5:5
                         ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00  20      95.4649    0.2020  0.2053  0.2088  0.2098  0.2120  0.2134  0.2179  0.2181
1.33  27      60.4560    0.2015  0.2050  0.2090  0.2100  0.2122  0.2139  1.8682  1.8682
1.67  33      72.5127    0.2018  0.2035  0.2093  0.2105  0.2251  1.6525  1.6769  1.6769
2.00  40      87.0676    0.2009  0.2022  0.2086  0.2104  1.4284  1.4444  1.4625  1.4671
3.00  60      77.8056    0.2012  0.2047  0.2096  0.2123  1.8633  1.8719  1.8955  1.8955

Branch: 00-long-tail     Puma: -w4 -t5:5
                         ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00  20      94.2740    0.2017  0.2041  0.2087  0.2099  0.2118  0.2128  0.2191  0.2288
1.33  27      65.4170    0.2023  0.2070  0.2123  0.4048  0.4169  0.4180  0.4247  0.4247
1.67  33      80.0755    0.2089  0.2170  0.3986  0.4094  0.4139  0.4177  0.4234  0.4234
2.00  40      74.3279    0.2150  0.3892  0.4078  0.4176  0.6102  0.6141  0.6188  0.6237
3.00  60      79.0030    0.2155  0.4113  0.6133  0.6390  0.8070  0.8116  0.8333  0.8333
────────────────────────────────────────────────── non-keep-alive comparison 200 mS response
Branch: master           Puma:     -t5:5
-disable-keepalive       ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00   5      23.3236    0.2077  0.2096  0.2145  0.2172  0.2190  0.2220  0.2281  0.2309
1.33   7      23.1483    0.2163  0.2220  0.2404  0.3985  0.4112  0.4173  0.4280  0.4280
1.67   8      23.7300    0.2173  0.2261  0.3994  0.4153  0.4194  0.4215  0.4257  0.4304
2.00  10      23.4794    0.4170  0.4199  0.4243  0.4279  0.4331  0.4358  0.4412  0.4424
3.00  15      23.5403    0.6194  0.6264  0.6314  0.6390  0.6449  0.6468  0.6559  0.6559

Branch: 00-long-tail     Puma:     -t5:5
-disable-keepalive       ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00   5      23.3970    0.2072  0.2089  0.2126  0.2160  0.2178  0.2218  0.2265  0.2265
1.33   7      23.6786    0.2119  0.2164  0.2246  0.3957  0.4061  0.4097  0.4133  0.4133
1.67   8      23.5164    0.2162  0.2230  0.3847  0.4060  0.4118  0.4144  0.4273  0.4290
2.00  10      23.9209    0.4036  0.4075  0.4115  0.4166  0.4195  0.4202  0.4223  0.4227
3.00  15      23.8473    0.6129  0.6206  0.6261  0.6291  0.6306  0.6319  0.6333  0.6333


Branch: master           Puma: -w2 -t5:5
-disable-keepalive       ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00  10      47.0680    0.2065  0.2087  0.2116  0.2150  0.2172  0.2181  0.2239  0.2239
1.33  13      45.9784    0.2139  0.2178  0.2248  0.3646  0.4044  0.4076  0.4251  0.4251
1.67  17      46.8638    0.2201  0.2265  0.4091  0.4158  0.4191  0.4208  0.4252  0.4252
2.00  20      46.9818    0.4140  0.4177  0.4217  0.4254  0.4300  0.4329  0.4389  0.4389
3.00  30      46.1165    0.4446  0.6270  0.6326  0.6381  0.6471  0.6510  0.6587  0.6587

Branch: 00-long-tail     Puma: -w2 -t5:5
-disable-keepalive       ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00  10      43.4246    0.2052  0.2075  0.2105  0.2135  0.2167  0.3965  0.4113  0.4122
1.33  13      47.3862    0.2061  0.2087  0.2146  0.3749  0.3986  0.4027  0.4189  0.4189
1.67  17      47.3194    0.2132  0.2208  0.4022  0.4105  0.4152  0.4208  0.4291  0.4291
2.00  20      46.1104    0.2263  0.4032  0.4143  0.4218  0.4366  0.6029  0.6127  0.6274
3.00  30      45.7597    0.4172  0.4458  0.6156  0.6317  0.8064  0.8186  0.8357  0.8357


Branch: master           Puma: -w4 -t5:5
-disable-keepalive       ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00  20      86.5973    0.2055  0.2066  0.2103  0.2128  0.2151  0.2163  0.4150  0.4223
1.33  27      89.2958    0.2120  0.2140  0.2187  0.4062  0.4160  0.4192  0.4234  0.4234
1.67  33      93.2801    0.2177  0.2230  0.4094  0.4155  0.4199  0.4216  0.4246  0.4246
2.00  40      93.9408    0.4058  0.4160  0.4195  0.4232  0.4259  0.4278  0.4311  0.4320
3.00  60      92.7423    0.2156  0.6150  0.6220  0.6273  0.6335  0.6417  0.8234  0.8234

Branch: 00-long-tail     Puma: -w4 -t5:5
-disable-keepalive       ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00  20      79.9562    0.2050  0.2076  0.2094  0.2122  0.2157  0.4068  0.4154  0.4157
1.33  27      82.7885    0.2069  0.2099  0.2140  0.3991  0.4073  0.4116  0.4238  0.4238
1.67  33      87.2283    0.2093  0.2147  0.4008  0.4080  0.4122  0.4147  0.4211  0.4211
2.00  40      88.0952    0.2142  0.4026  0.4107  0.4160  0.6030  0.6083  0.6162  0.6211
3.00  60      87.6443    0.2143  0.4195  0.6121  0.6181  0.6252  0.8049  0.8175  0.8175
────────────────────────────────────────────────── keep-alive comparison 10 mS response
Branch: master           Puma: -w4 -t5:5
                         ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00  20    1667.8435    0.0103  0.0105  0.0107  0.0109  0.0111  0.0120  0.0213  0.1219
1.33  27    1736.9362    0.0103  0.0104  0.0105  0.0108  0.0116  0.0507  0.0795  0.1286
1.67  33    1769.9278    0.0103  0.0104  0.0105  0.0108  0.0128  0.0924  0.1049  0.1485
2.00  40    1759.1426    0.0103  0.0104  0.0105  0.0108  0.0114  0.1269  0.1321  0.1577
3.00  60    1778.4014    0.0103  0.0104  0.0105  0.0107  0.0142  0.2376  0.2451  0.3552

Branch: 00-long-tail     Puma: -w4 -t5:5
                         ─────────────────────────── Latency ───────────────────────────
Mult/Conn     req/sec      10%     25%     50%     75%     90%     95%     99%    100%
1.00  20    1734.6554    0.0104  0.0105  0.0107  0.0111  0.0135  0.0167  0.0196  0.0263
1.33  27    1879.7792    0.0106  0.0110  0.0125  0.0172  0.0197  0.0203  0.0212  0.0336
1.67  33    1820.3980    0.0109  0.0119  0.0166  0.0206  0.0235  0.0281  0.0311  0.0375
2.00  40    1849.7299    0.0111  0.0145  0.0203  0.0236  0.0312  0.0344  0.0416  0.0533
3.00  60    1885.2031    0.0133  0.0204  0.0305  0.0413  0.0501  0.0533  0.0644  0.0743

@schneems
Copy link
Contributor

schneems commented Oct 2, 2024

With this branch I'm still seeing the long tail

$ cat Gemfile | grep puma
gem "puma", github: "MSP-Greg/puma", branch: "00-long-tail"
$ puma -t3 -p 3000
$ hey -c 4 -n 200 http://localhost:3000/

Summary:
  Total:	14.3295 secs
  Slowest:	2.2595 secs
  Fastest:	0.2006 secs
  Average:	0.2671 secs
  Requests/sec:	13.9572

  Total data:	2600 bytes
  Size/request:	13 bytes

Response time histogram:
  0.201 [1]	|
  0.407 [185]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.612 [7]	|■■
  0.818 [1]	|
  1.024 [0]	|
  1.230 [0]	|
  1.436 [0]	|
  1.642 [0]	|
  1.848 [1]	|
  2.054 [4]	|■
  2.260 [1]	|

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Oct 2, 2024

@schneems

Interesting. On my system (i9 ten core) I have:

Puma starting in single mode...
* Puma version: 6.4.3 (ruby 3.4.0-p-1) ("The Eagle of Durango")
*  Min threads: 3
*  Max threads: 3
*  Environment: development
*          PID: 461
* Listening on http://127.0.0.1:40001
Use Ctrl-C to stop
~$ hey -c 4 -n 200 http://127.0.0.1:40001/sleep0.2

Summary:
  Total:        13.4876 secs
  Slowest:      0.4042 secs
  Fastest:      0.2007 secs
  Average:      0.2677 secs
  Requests/sec: 14.8284

  Total data:   1800 bytes
  Size/request: 9 bytes

Response time histogram:
  0.201 [1]     |
  0.221 [133]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.241 [0]     |
  0.262 [0]     |
  0.282 [0]     |
  0.302 [0]     |
  0.323 [0]     |
  0.343 [0]     |
  0.364 [0]     |
  0.384 [0]     |
  0.404 [66]    |■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 0.2014 secs
  25% in 0.2024 secs
  50% in 0.2042 secs
  75% in 0.3968 secs
  90% in 0.4012 secs
  95% in 0.4019 secs
  99% in 0.4036 secs

Then, jumped the connections up:

~$ hey -c 6 -n 300 http://127.0.0.1:40001/sleep0.2

Summary:
  Total:        20.1753 secs
  Slowest:      0.4131 secs
  Fastest:      0.2040 secs
  Average:      0.4013 secs
  Requests/sec: 14.8696

  Total data:   2700 bytes
  Size/request: 9 bytes

Response time histogram:
  0.204 [1]     |
  0.225 [2]     |
  0.246 [0]     |
  0.267 [0]     |
  0.288 [0]     |
  0.309 [0]     |
  0.329 [0]     |
  0.350 [0]     |
  0.371 [0]     |
  0.392 [0]     |
  0.413 [297]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 0.4007 secs
  25% in 0.4018 secs
  50% in 0.4035 secs
  75% in 0.4045 secs
  90% in 0.4053 secs
  95% in 0.4066 secs
  99% in 0.4085 secs

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Oct 2, 2024

@schneems

Re why you're seeing different results than I am, I'm not sure. But, given the number of threads/processes that are being spun up, maybe a core problem? I've got a 10 'hyper-threaded' core Intel i9 proc (older), so Etc.nprocessors reports 20.

JFYI, locally I ran hey with very large -n numbers, as I wanted to make sure Puma hit 'steady-state' for the tests. I don't ever recall seeing numbers like you've shown...

EDIT: Below is the data for master branch, which shows the 'long tail' problem.

~$ hey -c 4 -n 200 http://127.0.0.1:40001/sleep0.2

Summary:
  Total:        14.2555 secs
  Slowest:      2.2078 secs
  Fastest:      0.2003 secs
  Average:      0.2560 secs
  Requests/sec: 14.0297

  Total data:   1800 bytes
  Size/request: 9 bytes

Response time histogram:
  0.200 [1]     |
  0.401 [187]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.602 [6]     |■
  0.803 [0]     |
  1.003 [1]     |
  1.204 [0]     |
  1.405 [0]     |
  1.606 [0]     |
  1.806 [0]     |
  2.007 [4]     |■
  2.208 [1]     |


Latency distribution:
  10% in 0.2005 secs
  25% in 0.2006 secs
  50% in 0.2008 secs
  75% in 0.2010 secs
  90% in 0.2013 secs
  95% in 0.4014 secs
  99% in 2.0057 secs

Status code distribution:
  [200] 200 responses
~$ hey -c 6 -n 300 http://127.0.0.1:40001/sleep0.2

Summary:
  Total:        20.7210 secs
  Slowest:      3.8233 secs
  Fastest:      0.2003 secs
  Average:      0.3849 secs
  Requests/sec: 14.4780

  Total data:   2700 bytes
  Size/request: 9 bytes

Response time histogram:
  0.200 [1]     |
  0.563 [273]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.925 [0]     |
  1.287 [0]     |
  1.649 [0]     |
  2.012 [0]     |
  2.374 [20]    |■■■
  2.736 [5]     |■
  3.099 [0]     |
  3.461 [0]     |
  3.823 [1]     |


Latency distribution:
  10% in 0.2004 secs
  25% in 0.2005 secs
  50% in 0.2007 secs
  75% in 0.2009 secs
  90% in 0.2015 secs
  95% in 2.2177 secs
  99% in 2.4247 secs

Status code distribution:
  [200] 300 responses

@schneems
Copy link
Contributor

schneems commented Oct 3, 2024

Re why you're seeing different results than I am, I'm not sure.

There's two words missing from my example:

$ cat Gemfile | grep puma
gem "puma", github: "MSP-Greg/puma", branch: "00-long-tail"
$ puma -t3 -p 3000

(Hint, it starts with a "b" and ends with "exec")

I booted my puma server without bundle exec so it wasn't using your code 🤦 . Fixed it and I can confirm the long tail behavior no longer exists. Longest request is 400 ms

Details
$ cat Gemfile
source "https://rubygems.org"

gem "puma", github: "MSP-Greg/puma", branch: "00-long-tail"
gem "rack"
$ be puma -t3 -p3000
Puma starting in single mode...
* Puma version: 6.4.3 (ruby 3.3.1-p55) ("The Eagle of Durango")
*  Min threads: 3
*  Max threads: 3
*  Environment: development
*          PID: 45875
* Listening on http://0.0.0.0:3000
Use Ctrl-C to stop
$ hey -c 4 -n 200 http://localhost:3000/

Summary:
  Total:	13.6890 secs
  Slowest:	0.4162 secs
  Fastest:	0.2011 secs
  Average:	0.2698 secs
  Requests/sec:	14.6102

  Total data:	2600 bytes
  Size/request:	13 bytes

Response time histogram:
  0.201 [1]	|
  0.223 [135]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.244 [0]	|
  0.266 [0]	|
  0.287 [0]	|
  0.309 [0]	|
  0.330 [0]	|
  0.352 [0]	|
  0.373 [0]	|
  0.395 [1]	|
  0.416 [63]	|■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 0.2036 secs
  25% in 0.2046 secs
  50% in 0.2065 secs
  75% in 0.4026 secs
  90% in 0.4079 secs
  95% in 0.4095 secs
  99% in 0.4123 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0001 secs, 0.2011 secs, 0.4162 secs
  DNS-lookup:	0.0001 secs, 0.0000 secs, 0.0024 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0003 secs
  resp wait:	0.2694 secs, 0.2009 secs, 0.4123 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0006 secs

Status code distribution:
  [200]	200 responses

I also validated that the long tail behavior in worker mode is now gone as well. I'm going to start digging into your commits to try to understand what exactly changed. Thanks for working on this.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Oct 4, 2024

I hate it when that happens. Or, been there, done that.

The current thread code calls the app, generates the response, then checks if the socket has another waiting request (with a wait period). 'Real world' keep-alive connections may have some delay between Puma transmitting the response and Puma's receipt of another request from the connection.

The PR code now sends the connection back to the reactor, which means a keep-alive connection cannot monopolize the thread, which is the cause of the long-tail responses.

An exception to this happens with pipelined/back-to-back requests. which are processed after the response is sent.

@dentarg
Copy link
Member

dentarg commented Oct 4, 2024

I can also confirm the long tail is gone!

How I started Puma: echo 'app { sleep(0.2); [200, {}, ["OK"]] }' | ruby -Ilib bin/puma -w 0 -t 5:5 --config /dev/stdin

hey -n 200 -c 6 http://127.0.0.1:9292 against this PR

$ hey -n 200 -c 6 http://127.0.0.1:9292

Summary:
  Total:  8.1721 secs
  Slowest:  0.4131 secs
  Fastest:  0.2009 secs
  Average:  0.2432 secs
  Requests/sec: 24.2289

  Total data: 396 bytes
  Size/request: 2 bytes

Response time histogram:
  0.201 [1] |
  0.222 [156] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.243 [3] |■
  0.265 [0] |
  0.286 [0] |
  0.307 [0] |
  0.328 [0] |
  0.349 [0] |
  0.371 [0] |
  0.392 [32]  |■■■■■■■■
  0.413 [6] |■■


Latency distribution:
  10% in 0.2029 secs
  25% in 0.2050 secs
  50% in 0.2084 secs
  75% in 0.2180 secs
  90% in 0.3890 secs
  95% in 0.3911 secs
  99% in 0.4131 secs

Details (average, fastest, slowest):
  DNS+dialup: 0.0001 secs, 0.2009 secs, 0.4131 secs
  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:  0.0000 secs, 0.0000 secs, 0.0002 secs
  resp wait:  0.2430 secs, 0.2008 secs, 0.4114 secs
  resp read:  0.0001 secs, 0.0000 secs, 0.0003 secs

Status code distribution:
  [200] 198 responses

hey -n 200 -c 6 http://127.0.0.1:9292 against master (5a8e485)

$ hey -n 200 -c 6 http://127.0.0.1:9292

Summary:
  Total:  11.0885 secs
  Slowest:  2.2613 secs
  Fastest:  0.2004 secs
  Average:  0.2403 secs
  Requests/sec: 17.8563

  Total data: 396 bytes
  Size/request: 2 bytes

Response time histogram:
  0.200 [1] |
  0.407 [191] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.613 [2] |
  0.819 [1] |
  1.025 [0] |
  1.231 [0] |
  1.437 [0] |
  1.643 [0] |
  1.849 [1] |
  2.055 [1] |
  2.261 [1] |


Latency distribution:
  10% in 0.2016 secs
  25% in 0.2033 secs
  50% in 0.2046 secs
  75% in 0.2056 secs
  90% in 0.2149 secs
  95% in 0.3852 secs
  99% in 2.2613 secs

Details (average, fastest, slowest):
  DNS+dialup: 0.0001 secs, 0.2004 secs, 2.2613 secs
  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:  0.0000 secs, 0.0000 secs, 0.0002 secs
  resp wait:  0.2400 secs, 0.2003 secs, 2.2583 secs
  resp read:  0.0001 secs, 0.0000 secs, 0.0006 secs

Status code distribution:
  [200] 198 responses

I also tested wrk (with this branch)

wrk --connections 6 --duration 10 --threads 6 --latency http://localhost:9292 against master (5a8e485)

$ ~/src/wrk/wrk --connections 6 --duration 10 --threads 6 --latency http://localhost:9292
Running 10s test @ http://localhost:9292
  6 threads and 6 connections
closing connection after 10 requests
closing connection after 10 requests
closing connection after 10 requests
closing connection after 11 requests
closing connection after 12 requests
closing connection after 10 requests
closing connection after 20 requests
closing connection after 22 requests
closing connection after 23 requests
closing connection after 23 requests
closing connection after 22 requests
closing connection after 20 requests
closing connection after 30 requests
closing connection after 32 requests
closing connection after 33 requests
closing connection after 33 requests
closing connection after 32 requests
closing connection after 30 requests
closing connection after 40 requests
closing connection after 42 requests
closing connection after 41 requests
closing connection after 45 requests
closing connection after 39 requests
closing connection after 36 requests
closing connection after 42 requests
connection 0: 45 requests completed
connection 0: 41 requests completed
connection 0: 39 requests completed
connection 0: 36 requests completed
connection 0: 42 requests completed
closing connection after 43 requests
connection 0: 43 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   276.92ms  240.50ms   1.84s    93.51%
    Req/Sec     3.84      0.81     5.00     81.30%
  Latency Distribution
     50%  204.45ms
     75%  205.37ms
     90%  414.74ms
     99%    1.44s
  246 requests in 10.28s, 9.98KB read
  Socket errors: connect 0, read 0, write 0, timeout 1
Requests/sec:     23.92
Transfer/sec:      0.97KB

wrk --connections 6 --duration 10 --threads 6 --latency http://localhost:9292 against this PR

$ ~/src/wrk/wrk --connections 6 --duration 10 --threads 6 --latency http://localhost:9292
Running 10s test @ http://localhost:9292
  6 threads and 6 connections
closing connection after 10 requests
closing connection after 10 requests
closing connection after 10 requests
closing connection after 10 requests
closing connection after 10 requests
closing connection after 10 requests
closing connection after 20 requests
closing connection after 20 requests
closing connection after 20 requests
closing connection after 20 requests
closing connection after 20 requests
closing connection after 20 requests
closing connection after 30 requests
closing connection after 30 requests
closing connection after 30 requests
closing connection after 30 requests
closing connection after 30 requests
closing connection after 30 requests
closing connection after 40 requests
closing connection after 40 requests
closing connection after 40 requests
closing connection after 40 requests
closing connection after 40 requests
closing connection after 43 requests
closing connection after 40 requests
connection 0: 43 requests completed
connection 0: 40 requests completed
closing connection after 42 requests
closing connection after 41 requests
connection 0: 41 requests completed
closing connection after 42 requests
connection 0: 42 requests completed
closing connection after 41 requests
connection 0: 42 requests completed
connection 0: 41 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   244.17ms   69.96ms 407.31ms   80.32%
    Req/Sec     3.65      0.86     5.00     75.10%
  Latency Distribution
     50%  210.84ms
     75%  216.94ms
     90%  384.22ms
     99%  392.50ms
  249 requests in 10.29s, 10.17KB read
Requests/sec:     24.19
Transfer/sec:      0.99KB


I also ran tests like we did in #3443 without any sleep in the rack app and with workers for Puma.

How I started Puma: echo 'app { [200, {}, ["OK"]] }' | ruby -Ilib bin/puma -w 2 -t 1:1 --config /dev/stdin

hey -n 5000 -c 100 http://127.0.0.1:9292 against master

$ hey -n 5000 -c 100 http://127.0.0.1:9292

Summary:
  Total:  3.7466 secs
  Slowest:  2.2994 secs
  Fastest:  0.0000 secs
  Average:  0.0305 secs
  Requests/sec: 1334.5371

  Total data: 10000 bytes
  Size/request: 2 bytes

Response time histogram:
  0.000 [1] |
  0.230 [4843]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.460 [26]  |
  0.690 [61]  |■
  0.920 [29]  |
  1.150 [6] |
  1.380 [3] |
  1.610 [6] |
  1.839 [2] |
  2.069 [14]  |
  2.299 [9] |


Latency distribution:
  10% in 0.0000 secs
  25% in 0.0000 secs
  50% in 0.0001 secs
  75% in 0.0001 secs
  90% in 0.0032 secs
  95% in 0.0340 secs
  99% in 0.8865 secs

Details (average, fastest, slowest):
  DNS+dialup: 0.0000 secs, 0.0000 secs, 2.2994 secs
  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:  0.0000 secs, 0.0000 secs, 0.0012 secs
  resp wait:  0.0304 secs, 0.0000 secs, 2.2992 secs
  resp read:  0.0000 secs, 0.0000 secs, 0.0011 secs

Status code distribution:
  [200] 5000 responses

hey -n 5000 -c 100 http://127.0.0.1:9292 against this PR

$ hey -n 5000 -c 100 http://127.0.0.1:9292

Summary:
  Total:  0.1377 secs
  Slowest:  0.0308 secs
  Fastest:  0.0000 secs
  Average:  0.0023 secs
  Requests/sec: 36320.8768

  Total data: 10000 bytes
  Size/request: 2 bytes

Response time histogram:
  0.000 [1] |
  0.003 [4480]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.006 [27]  |
  0.009 [10]  |
  0.012 [30]  |
  0.015 [23]  |
  0.018 [39]  |
  0.022 [70]  |■
  0.025 [188] |■■
  0.028 [129] |■
  0.031 [3] |


Latency distribution:
  10% in 0.0001 secs
  25% in 0.0001 secs
  50% in 0.0002 secs
  75% in 0.0003 secs
  90% in 0.0050 secs
  95% in 0.0236 secs
  99% in 0.0253 secs

Details (average, fastest, slowest):
  DNS+dialup: 0.0000 secs, 0.0000 secs, 0.0308 secs
  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:  0.0000 secs, 0.0000 secs, 0.0007 secs
  resp wait:  0.0022 secs, 0.0000 secs, 0.0280 secs
  resp read:  0.0000 secs, 0.0000 secs, 0.0015 secs

Status code distribution:
  [200] 5000 responses

hey -n 5000 -c 100 -disable-keepalive http://127.0.0.1:9292 against this PR

$ hey -n 5000 -c 100 -disable-keepalive http://127.0.0.1:9292

Summary:
  Total:  0.1950 secs
  Slowest:  0.0086 secs
  Fastest:  0.0023 secs
  Average:  0.0038 secs
  Requests/sec: 25639.9464

  Total data: 10000 bytes
  Size/request: 2 bytes

Response time histogram:
  0.002 [1] |
  0.003 [16]  |
  0.004 [2541]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.004 [1816]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.005 [167] |■■■
  0.005 [68]  |■
  0.006 [37]  |■
  0.007 [58]  |■
  0.007 [106] |■■
  0.008 [130] |■■
  0.009 [60]  |■


Latency distribution:
  10% in 0.0031 secs
  25% in 0.0033 secs
  50% in 0.0035 secs
  75% in 0.0037 secs
  90% in 0.0046 secs
  95% in 0.0070 secs
  99% in 0.0080 secs

Details (average, fastest, slowest):
  DNS+dialup: 0.0004 secs, 0.0023 secs, 0.0086 secs
  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:  0.0000 secs, 0.0000 secs, 0.0013 secs
  resp wait:  0.0034 secs, 0.0014 secs, 0.0052 secs
  resp read:  0.0000 secs, 0.0000 secs, 0.0021 secs

Status code distribution:
  [200] 5000 responses

return false
else
begin
if fast_check && @to_io.wait_readable(FAST_TRACK_KA_TIMEOUT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this code removed, should we also remove the FAST_TRACK_KA_TIMEOUT assignment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short: I think we should remove it now.
Long:
Previously I operated under a "don't change anything you don't have to" model of oss stability. That works for awhile, but eventually even unused code can cause it's own stability problems in weird ways.

I can no longer find a reference to FAST_TRACK_KA_TIMEOUT in the codebase. Let's delete it as part of this PR while we're looking at it.

Copy link
Contributor

@schneems schneems left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a fairly comprehensive review. I took a ton of notes if you want to feel like you're trapped in my head while doing a PR review you can get the raw feed here https://gist.github.com/schneems/9a47e6695e90e636d3d79dbdbc9d33bd.

Overall: I want to move forward with this PR. I think it's correct. I now understand how it works. My one concern is about the removal of calling wait_until_not_full which I've described elsewhere as "thundering herd protection." Even with that concern, I think it's the right thing to do, I said more in a comment about it (and you can read my gist if you want).

Requested: I've made some cleanup suggestions and requests. I also requested some docs. This is some pretty deep-in-the-weeds logic we're dealing with. I want to set up future us (or other maintainers) for success while we're here and looking at it.

Amazing work here thanks a ton 💜

return false
else
begin
if fast_check && @to_io.wait_readable(FAST_TRACK_KA_TIMEOUT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short: I think we should remove it now.
Long:
Previously I operated under a "don't change anything you don't have to" model of oss stability. That works for awhile, but eventually even unused code can cause it's own stability problems in weird ways.

I can no longer find a reference to FAST_TRACK_KA_TIMEOUT in the codebase. Let's delete it as part of this PR while we're looking at it.

raise HttpParserError,
"HEADER is longer than allowed, aborting client early."
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method signature here is def reset(fast_check=true) but fast_check is no longer being used. We should remove it from the method to avoid confusion.

# As an optimization, try to read the next request from the
# socket for a short time before returning to the reactor.
fast_check = @status == :run
Thread.pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why this was added. My best guess is that you're hoping the reactor is scheduled to run so it can pull in a new TCP connection if there's one waiting, but it would be good to confirm.

Effectively: If a future maintainer sees this, the comment should describe the conditions/assumptions that lead to adding it and it's desired effects. I'm thinking of a situation where someone would want to know if it's safe to remove or not. An alternative situation would be if someone moves it to optimize something else (like performance), how would they measure/know if it's better to have it above/below the next line or if it doesn't matter at all as long as it's before <something>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hoping the reactor is scheduled to run

Yes, that was the reason. Also, it might be a tough thing to verify either way. Maybe leave as is with a better comment is the best solution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave as is with a better comment is the best solution?

I think a comment is fine. I'm just thinking of long term maintenance. So if someone wants to modify the code there's a hint as to whats safe or desirable. Here's my stab at some wording:

In this code path the keepalive request may be put back into the reactor. Ideally we want to serve requests in the order they came in, even when those requests are re-using a connection. This Thread.pass is an attempt to hint to the OS that the reactor can read in a new connection from the TCP socket which might have been sent after the currently processing request, but before the next request in its connection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I wanted to test if the statement made any difference, but with MRI, JRuby & TruffleRuby, it might involve a lot of testing.

This weekend I looked at stats when using workers to see how well requests were distributed amongst the workers. As an example, with four workers, perfect distribution would be 25% for each worker. I think the worst I saw was a +/-15% diff, so lowest was 21.25%, highest was 28.75%. That's from memory, many runs were tighter than that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why this is here even reading the above comments. What is the condition where you expect the reactor to be running simultaniously?

Comment on lines +75 to +76
@clustered = (options[:workers] || 0) > 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@clustered = (options[:workers] || 0) > 1

This is only used in commented out code.

Comment on lines +365 to +367
# uncommenting this may cause 'long tail' response times when all
# workers are busy
# pool.wait_until_not_full if @clustered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment

Suggested change
# uncommenting this may cause 'long tail' response times when all
# workers are busy
# pool.wait_until_not_full if @clustered

The wait_until_not_full code should also be deleted, since this is the only place that called it.

This logic protected against "thundering herd" when one worker boots before the others to make sure it doesn't steal ALL the requests. I'm hoping that the sleep logic that was added in #2079 (and later iterated) might serve a similar purpose. However I don't think we have any off-the-shelf benchmarks/reproductions to validate that assumption.

Either way, it's clear that that logic is confusing and not a good way to achieve our desired outcome. I support removing it and iterating on another solution if needed.

end

# Send keep-alive connections back to the reactor
unless next_request_ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional style comment: This would read better to me as if !pipeline_request_ready. I checked and rubocop wouldn't mind that syntax.

Or alternatively get rid of the negation:

if pipeline_request_ready
  continue
else 
  # ...

I like that better actually as it makes it more obvious we're in a loop that will execute again if that condition is true.

Comment on lines +32 to +34

@selector_items = 0
@selector_items_max = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove debugging code

Suggested change
@selector_items = 0
@selector_items_max = 0

Comment on lines +114 to +118
@selector_items += 1
if @selector_items > @selector_items_max
@selector_items_max = @selector_items
# STDOUT.syswrite "\n#{Process.pid} @selector_items_max #{@selector_items_max}\n"
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove debugging code

Suggested change
@selector_items += 1
if @selector_items > @selector_items_max
@selector_items_max = @selector_items
# STDOUT.syswrite "\n#{Process.pid} @selector_items_max #{@selector_items_max}\n"
end

def wakeup!(client)
if @block.call client
@selector.deregister client.to_io
@selector_items -= 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@selector_items -= 1

# uncommenting this may cause 'long tail' response times when all
# workers are busy
# pool.wait_until_not_full if @clustered
sleep 0.001 while pool.out_of_band_running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a doc explaining what this is doing.

client.reset(fast_check)
# This indicates that the socket has pipelined (multiple)
# requests on it, so process them
next_request_ready = if client.has_buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me if the term "pipeline" here is a Puma-internal concept or if it's referring to HTTP pipelining. If the latter, technically Puma has already prepared and sent a response to the last request. Thus, the next request being ready doesn't mean the requests have been pipelined. It just means the client already sent another request on that connection.

Apologies if this is just me misunderstanding the term.

else
pool.wait_until_not_full
# uncommenting this may cause 'long tail' response times when all
# workers are busy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may mean to say "when all worker threads are busy".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I'm surprised I wrote that, as often the threads in the ThreadPool are called workers, which drives me nuts.

Maybe just "when all threads are busy"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should leave in commented out code with a description for what it would do if it's uncommented.

I think the wording "when all threads are busy" is valid, but my preference would be either:

  • Delete the code alltogether (older versions are still there in git)
  • If you feel strongly it should stay in, then make it user configurable. For example I suggested enable_keepalives :legacy as a possible codepath.

I'm fine removing it for now and only adding a config path if there are problems and we need to buy time for a longer term investigation and fix.

requests < @max_fast_inline ||
@thread_pool.busy_threads < @max_threads ||
!client.listener.to_io.wait_readable(0)
client.requests_served < @max_fast_inline ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states:

Note that all keep-alive connections are closed when the number of requests hits the max_fast_inline setting.

But this logic indicates otherwise. You'll also need the busy_threads < max_threads to force the connection to close.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying. Previously we had this logic in two places. One of them (here) said "take the keepalive path if any of the following are true" and the other one said "close the connection if requests served is greater than @max_fast_inline."

Now we have only one location that will continue to serve keepalive requests beyond @max_fast_inline as long as not all worker threads are spawned. I think that's okay actually. The @max_fast_inline config was a stop-gap to ensure that one keepalive connection didn't hog too many system resources. If there's still threads we could spawn, there's still capacity in the system. I think instead of adding back a second round of checks, we could add this caveat in the documentation.

In general my recommendation (the one I put on the Heroku docs) suggest setting minimum == maximum and for that case the current docs are already true.

The current docs are wrong anyway after this change:

    # The number of requests to attempt inline before sending a client back to
    # the reactor to be subject to normal ordering.
    #
    # The default is 10.
    #
    # @example
    #   max_fast_inline 20
    #

So they need to be updated either way. I suggest this:

    # A "fast inline" request is a request that is re-using an existing
    # connection via HTTP 1.1 "keepalive". Setting this value will close a connection
    # after the specified number of requests are made when there are no additional
    # Puma threads to spawn. If you want to ensure that a connection is closed after
    # the given number of requests, set minimum thread count equal to maximum thread
    # count.
    #
    # Previous design of Puma would serve these request from the same thread that handled the
    # first request. In that previous model, setting this value would prevent
    # one request from monopolizing too many server resources.
    #
    # Current puma design will serve buffered sequential requests from the same
    # thread but any new requests that come in on the same connection will be
    # sent back to the reactor to read the request body and later be picked up
    # by a thread. With this model resources are distributed more evenly by
    # default.
    #
    # The default is 10.
    #
    # @example
    #   max_fast_inline 1000
    #

It's a bit verbose, but I prefer my docs to say more rather than less. FWIW nginx sets this value to 1000 https://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_requests.

Eventually: In the long term (not now) we could consider renaming this config and increasing the limit. I'm unsure if the warning nginx gives about memory is intrinsic to the domain or is an implementation detail of how they're managing memory. But if we're able to show that we're not leaking/retaining memory between subsequent repeated requests, then I think raising the limit would be good.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Oct 4, 2024

@schneems

Thank you for the thorough review. Really.

It's a pretty serious change to Puma's operation, and I thought I could work on it for another week (or more), or instead open it as a draft and get more eyes testing & reviewing it.

I'm still busy with family things (funny how the Excel/Word/coder family member gets assigned things by default).

Sorry about the debug code, I wasn't sure if some of the things added might be needed for request 'distribution/balancing', so I left it in. In the past the reactor has been used mostly for slow requests.

Also, thanks for the gist, that will help.

I think I can get a block of time to work on this tomorrow or Sunday.

@schneems
Copy link
Contributor

schneems commented Oct 5, 2024

My suggestions to remove the debug code took basically no time, no need for apology. Again, thanks for all the work. I think I might have gotten somewhere approaching this solution eventually, but you got here much faster and caught some things that wouldn't have been obvious to me. I also knew it was still in draft when I reviewed. I'm motivated to get a fix for this behavior.

My ideal world would be that we could roll the whole thing out. Depending on your confidence level in the change, one option would be to ship code that has both codepaths and allow the end user to configure it if they need. We just introduced enable_keepalives config on HEAD, we could make it accept true/false/:legacy where :legacy is the current behavior on master. I don't think it's strictly necessary to support a fallback, but having it might increase our tolerance to change (if people experiencing problems can easily mitigate it themselves).

Another path forward could be cutting a beta/pre-release, but IMHO not a lot of people try them unless specifically asked.

I'm still busy with family things (funny how the Excel/Word/coder family member gets assigned things by default).

100% understood. For some reason when people in my family think of "problems with a printer" they also think of me. My undergrad is in mechanical engineering and I've met many people in my life who when they found that out exclaimed "oh, my cousin is a mechanic too!" (narrator: They are not the same thing). Take care with the family 💜

I think I can get a block of time to work on this tomorrow or Sunday.

Sounds good. If you need to drop this completely or for a period of time let me know how I/we can best support you.


client.set_timeout(@first_data_timeout)
client.set_timeout @first_data_timeout
if @reactor.add client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please please don't mix style changes into a PR like this. It's an important change and it's confusing if this is just style (I believe it is) or functional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed style change and rebased

# As an optimization, try to read the next request from the
# socket for a short time before returning to the reactor.
fast_check = @status == :run
Thread.pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why this is here even reading the above comments. What is the condition where you expect the reactor to be running simultaniously?

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Oct 9, 2024

@evanphx

I'm confused why this is here even reading the above comments. What is the condition where you expect the reactor to be running simultaniously?

I've been meaning to check whether removing it has any affect. I think I added it very early, and it may not affect anything...

@schneems
Copy link
Contributor

Public comment describing some private conversations: Handling the case where we have queue_requests false and are receiving a keepalive connection (as mentioned in #3506 (comment)) is actually quite difficult. Evan dug up the PR for adding queue_requests option b784cb7 and in it it spells out that if you use queue_requests false you're opting out of keepalive and slow client protections:

      # Note that setting this to false disables HTTP keepalive and
      # slow clients will occupy a handler thread while the request
      # is being sent. A reverse proxy, such as nginx, can handle
      # slow clients and queue requests before they reach puma.

I solicited feedback on why people use that setting today https://ruby.social/@Schneems/113289823703358711. It's not scientific, but it is quiet. I think the only reason someone might want to use that setting is to bypass the overhead of the reactor if they KNOW that a request body is already fully buffered and will be non-keepalive-d. (if a proxy in front of them handles that case and downgrades HTTP/1.1 to HTTP/1.0). Even then it's a relatively small perf gain for a fairly large cost in code complexity (handling running a server where we have no reactor to buffer new requests).

My suggestion is: That we deprecate queue_requests in the current version of Puma and remove it in the next major release. If someone has a better reason to need/want it, then I'm open to understanding why and possibly adding more fine-grained configuration for them to get what they want. That eliminates this problem and allows us to move forward.

I can make a PR that adds a deprecation warning to queue_requests.

The only unknown (in my mind) is if wait_until_not_full behavior needs to go or not. It's modified here, in an alternative branch master...emp/no_fast on my benchmark and it seems to eliminate the long-tail.

schneems added a commit that referenced this pull request Oct 15, 2024
- Short: Removing this feature makes the internals easier to reason about and I do not believe there's a good reason to disable the reactor that we want to support. Research also points to disabling of the reactor to be an implementation detail in an initial effort to configure the ACCEPT behavior, and over time it morphed into effectively meaning "disable the reactor"

- Long: 

## Background

The `queue_requests` config was originally introduced in #640 in 2015. It is marked as closing this issue #612 (opened in 2014, ten years ago). 

The use case in the issue for this option to disable the reactor was this: 

A worker with 1 thread needs to send a request to itself to finish. The example given is downloading a PDF from the same server. If the reactor is accepting requests as fast as it can, the same worker that sent the request might accept the request, however, since it cannot complete the first request (waiting on the second), the second request will never execute. i.e. it will deadlock.

The problem described does not come from having a reactor that buffers request bodies, but rather from the application ACCEPT-ing a new request from the socket before it had capacity to process it. Disabling the reactor was effectively an implementation detail.

The issue was opened nearly 10 years ago and the justification given for wanting to run with one max thread is that "not all code is thread safe". In the decade since, the Ruby community has come a long way. Sidekiq's threaded job queue has become the industry standard. It's become an expectation that all Ruby code should be threadsafe for quite some time now. An application can avoid this deadlock situation by either not requiring this functionality (making requests to itself) or running with more threads. It's also worth noting that the possibility of deadlock is not zero when moving to 2 threads, as both those threads might be executing a "give me a PDF" request, however the chances are decreased. I do not believe this to be a primary use case we need to support, and further, we've since mitigated the issue in other ways.

In the same PR a `wait_until_not_full` method was introduced. But it was only called if queue_requests was disabled. This method still lives on today, but is called unconditionally before the socket is read. https://github.com/puma/puma/blob/a59afabe782ec8869bb709ca991b32c0b2cc95fd/lib/puma/server.rb#L363

What does `wait_until_not_full` do? It is what I call "anti-thundering-herd" measures where a worker will first check if it has capacity to handle a new request (it sees if all threads are busy or not) and only when it has capacity does it ACCEPT a new request from the socket. The situation for a thundering herd looks like this: Puma boots with multiple workers, one worker boots before the others and accepts more requests than it can process. The other workers boot and have nothing to work on and are idle. However it didn't always work as intended. This is one fix 9690d8f. I was present in the 2017 roundtable mentioned in the commit. Since then, this logic has been relatively stable.

Since that was introduced, we also introduced `wait_for_less_busy_worker` which adds a tiny sleep to the worker to handle a related yet different scenario caused by OS thread scheduling. In this scenario, the OS might favor executing a busy worker rather than one with capacity to take new requests. More details are in the PR where it was added #2079. The solutio was to add a really tiny sleep to the busy workers to hint to the OS to let the less busy ones be scheduled. This became a default 

## Queue requests confusion

The `queue_requests` config that was originally introduced to gate ACCEPT behavior, has eventually come to mean "disable the reactor" in the current codebase as defaults have changed and morphed over the years.

While the original docs introduced in that PR hint that enabling it deactivates keepalive functionality:

>      # Note that setting this to false disables HTTP keepalive and
>      # slow clients will occupy a handler thread while the request
>      # is being sent. A reverse proxy, such as nginx, can handle
>      # slow clients and queue requests before they reach puma.

It wasn't exactly clear what that meant. i.e. when a keepalive request is made to puma with queue requests disabled, what should be the expected behavior?

The released version of puma checks for several conditions before determining whether or not to send a "close" header to a keepalive request https://github.com/puma/puma/blob/d2b3b309bf76f252819cf6c89d72249895daf629/lib/puma/request.rb#L167-L176. The ability to deactivate this keepalive behavior is recent and is staged on main, it was introduced in #3496. Notably even HEAD does not deactivate this behavior based on the state of `queue_requests`.

The current behavior is that while that logic returns `true` and the request is not placed back into the reactor (not possible when queue_requests is false) it will loop and continue to accept keepalive requests over the same connection because this code https://github.com/puma/puma/blob/9712d29bc156a7fa05a6f4ec6deda5ac20df50be/lib/puma/server.rb#L468-L470 is inside of a while loop.

## Keepalive connection

This looping behavior for keepalive connections is not limited to when the reactor is disabled. There's a bug that currently means that keepalive requests effectively monopolize a thread for up to 10 sequential requests (by default) which is described here #3487.

All signs point to needing to put the keepalive request into the reactor as a part of the mitigation. However, as pointed out in this comment #3506 (comment) when request queuing is disabled we are left with a difficult path forward. We can either choose to continue monopolizing the thread and preserve the behavior that exists today. Or we can close the connection which aligns with the docs saying that this setting will not work with keepalive requests. Rather than make that determination for the user, this change suggests we should remove that codepath as an invalid configuration option.

## Usage

As part of informal research into useage of the feature I asked on mastodon and no one appears to be using it:

- Link: https://ruby.social/@Schneems/113289823703358711
- Screenshot: https://www.dropbox.com/s/9t1ur0rpqz5zkku/Screenshot%202024-10-14%20at%2012.43.27%E2%80%AFPM.png?dl=0

I asked in slack groups and the only example someone could give was this issue comment where (funny enough) they're using it to try to diagnose the very same keepalive buggy behavior #2311 (comment).

From my personal experience debugging Ruby applications at Heroku, I have seen this feature used in the wild, so I know it's used. It's just that when it is used, people are not aware that they're losing slow-client protection and disabling the reactor. Usually they're after some other side-effect that disabling it caused.

In short, I don't believe it's heavilly used, and for the cases where it is used, I don't believe those are valid use cases. In the comments I added a note requesting people inform us of their use cases if they feel they have a valid one so we can investigate it and possibly add a different configuration option.
@schneems schneems mentioned this pull request Oct 15, 2024
7 tasks
@MSP-Greg
Copy link
Member Author

I'm working on a revision to this, the revision performs better. Should be done within a few days.

@MSP-Greg
Copy link
Member Author

Closing in favor of #3678

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

5 participants