Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SSL Context - minissl.c, minissl.rb, extconf.rb #2519

Merged
merged 1 commit into from Jan 26, 2021

Conversation

MSP-Greg
Copy link
Member

@MSP-Greg MSP-Greg commented Dec 31, 2020

Description

While working on an update to the test suite, client connections were slow with ssl tests.

Currently, a new context is created for each accepted client connection. A context also reads two files for the cert and key.

Refactored to use one context created in MiniSSL::Server.new.

Code is included in a misc folder that reports on the times needed to open the socket and write a simple request. It logs the time of each client's GET request, each having about a 100k body, then sorts the times, and outputs percentile times. See the socket_times.md in the misc folder for info.

Times for this PR (see 'test ssl connection times` step):

https://github.com/MSP-Greg/puma/actions/runs/453868127

Times for current master:

https://github.com/MSP-Greg/puma/actions/runs/453894649

The CI for the PR has an additional 'test tcp connection times' step, which helps tell the two apart.

Times for this PR are approx two to three times faster than master on both Ubuntu & macOS.

But, there are a few odd things:

  1. Checking it locally with Windows WSL2/Ubuntu, I noticed a larger speed increase, but that may be specific to that implementation. Oddly, the tcp times were almost the same as Actions CI.

  2. The longest macOS times are really high with the PR, compared to master. Not sure about what could be causing the issue.

I'd appreciate if anyone running *nix or macOS natively can check the results.

To run locally, compile Puma, then (from the Puma folder) run misc/run_socket_times.sh. Save the misc folder in master, and the tests will also run there. You may need to run the following:

chmod +x ./misc/run_ssl_socket_times.sh

For more info on running misc/run_socket_times.sh, see socket_times.md

Note that the last three commits may be removed for the PR...

Thanks, and Happy New Year, Greg

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added an entry to History.md if this PR fixes a bug or adds a feature. If it doesn't need an entry to HISTORY.md, I have added [changelog skip] or [ci skip] to the pull request title.
  • I have added appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@MSP-Greg MSP-Greg changed the title Fix SSL Context - minissl.c, minissl.rb, extconf.rb Fix SSL Context - minissl.c, minissl.rb, extconf.rb [changelog skip] Dec 31, 2020
@nateberkopec nateberkopec added maintenance waiting-for-review Waiting on review from anyone labels Jan 4, 2021
@@ -328,15 +331,15 @@ def to_io
def accept
@ctx.check
io = @socket.accept
engine = Engine.server @ctx
engine = Engine.server (IS_JRUBY ? @ctx : @sslctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this engine/sslctx is pretty hot-path, so I think we should instead just have an engine method which only has to run this check/ternary once. memoize or just use metaprogramming, your choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, moved the JRuby checks to initialize...

@@ -6,6 +6,6 @@ long_header_hash = {}
long_header_hash["X-My-Header-#{i}"] = SecureRandom.hex(25)
end

response = SecureRandom.hex(100_000) # A 100kb document
response = SecureRandom.hex(50_000) # A 100kb document
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for the change here? should update the comment at least

Copy link
Member Author

@MSP-Greg MSP-Greg Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ruby -rsecurerandom -e "puts SecureRandom.hex(100_000).length"

outputs 200_000, or SecureRandom.hex(b_len) creates a byte sequence of b_len bytes, and then converts it to a hex string. Each byte turns into 2 characters, so...

I'll split it out to a separate PR.

@nateberkopec
Copy link
Member

I understand this is still marked as draft, but unless we have a really compelling reason I think that:

  1. All benchmarks should be in the benchmarks folder, preferably using wrk.
  2. All tests should be in the test folder.

@nateberkopec nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Jan 13, 2021
@MSP-Greg
Copy link
Member Author

Nate,

I ran the code locally with Windows and WSL2/Ubuntu, and there often was a speed increase.

But, I can't test macOS, and part the benefit may also be memory use, as the current code creates a new context for every ssl client, and if many are 'persistent' by 'keep-alive', that means there are many more context objects.

I was hoping a few people might test it in other environments...

@nateberkopec
Copy link
Member

I can't get it to run on master. I copy the misc folder from this branch, then run it:

[18252] - Worker 0 (PID: 18255) booted, phase: 0
[18252] - Worker 1 (PID: 18256) booted, phase: 0
Traceback (most recent call last):
	3: from ./misc/socket_times.rb:90:in `<main>'
	2: from ./misc/socket_times.rb:76:in `run'
	1: from ./misc/socket_times.rb:76:in `each'
./misc/socket_times.rb:76:in `block in run': undefined method `+' for nil:NilClass (NoMethodError)

@MSP-Greg
Copy link
Member Author

I'm looking at it, I may rebase and push...

@MSP-Greg
Copy link
Member Author

Nate,

Sorry, too many branches. Rebased & updated.

To use in master, along with copying the misc folder, one needs to copy test/rackup/realistic_response.ru.

@nateberkopec
Copy link
Member

this branch:

[81202] Puma starting in cluster mode...
[81202] * Puma version: 5.1.1 (ruby 3.0.0-p-1) ("At Your Service")
[81202] *  Min threads: 5
[81202] *  Max threads: 5
[81202] *  Environment: development
[81202] *   Master PID: 81202
[81202] *      Workers: 2
[81202] *     Restarts: (✔) hot (✔) phased
[81202] * Listening on ssl://127.0.0.1:40010?cert=examples/puma/cert_puma.pem&key=examples/puma/puma_keypair.pem&verify_mode=none
[81202] Use Ctrl-C to stop
[81202] * Starting control server on http://127.0.0.1:40001
[81202] - Worker 1 (PID: 81206) booted, phase: 0
[81202] - Worker 0 (PID: 81205) booted, phase: 0

Total 1000 connections (20 loops of 50 clients) - total request time
          5%     10%     20%     40%     50%     60%     80%     90%     95% 
  mS    3.14    3.73    4.46    5.71    6.41    7.10    9.33   11.25   12.86

[81202] - Gracefully shutting down workers...
Command restart sent success
[81202] * Restarting...
[81202] Puma starting in cluster mode...
[81202] * Puma version: 5.1.1 (ruby 3.0.0-p-1) ("At Your Service")
[81202] *  Min threads: 5
[81202] *  Max threads: 5
[81202] *  Environment: development
[81202] *   Master PID: 81202
[81202] *      Workers: 2
[81202] *     Restarts: (✔) hot (✔) phased
[81202] * Inherited ssl://127.0.0.1:40010?cert=examples/puma/cert_puma.pem&key=examples/puma/puma_keypair.pem&verify_mode=none
[81202] Use Ctrl-C to stop
[81202] * Starting control server on http://127.0.0.1:40001
[81202] - Worker 1 (PID: 81222) booted, phase: 0
[81202] - Worker 0 (PID: 81221) booted, phase: 0

Total 1000 connections (20 loops of 50 clients) - total request time
          5%     10%     20%     40%     50%     60%     80%     90%     95% 
  mS    3.46    4.10    5.10    6.95    7.88    9.22   11.63   13.36   15.45

[81202] - Gracefully shutting down workers...
Command stop sent success
[81202] === puma shutdown: 2021-01-15 10:16:13 -0700 ===
[81202] - Goodbye!

master:

❯ ./misc/run_socket_times.sh
[81628] Puma starting in cluster mode...
[81628] * Puma version: 5.1.1 (ruby 3.0.0-p-1) ("At Your Service")
[81628] *  Min threads: 5
[81628] *  Max threads: 5
[81628] *  Environment: development
[81628] *   Master PID: 81628
[81628] *      Workers: 2
[81628] *     Restarts: (✔) hot (✔) phased
[81628] * Listening on ssl://127.0.0.1:40010?cert=examples/puma/cert_puma.pem&key=examples/puma/puma_keypair.pem&verify_mode=none
[81628] Use Ctrl-C to stop
[81628] * Starting control server on http://127.0.0.1:40001
[81628] - Worker 0 (PID: 81631) booted, phase: 0
[81628] - Worker 1 (PID: 81632) booted, phase: 0

Total 1000 connections (20 loops of 50 clients) - total request time
          5%     10%     20%     40%     50%     60%     80%     90%     95% 
  mS    5.28    7.11    9.71   11.90   13.00   14.18   16.97   19.47   21.84

Command restart sent success
[81628] - Gracefully shutting down workers...
[81628] * Restarting...
[81628] Puma starting in cluster mode...
[81628] * Puma version: 5.1.1 (ruby 3.0.0-p-1) ("At Your Service")
[81628] *  Min threads: 5
[81628] *  Max threads: 5
[81628] *  Environment: development
[81628] *   Master PID: 81628
[81628] *      Workers: 2
[81628] *     Restarts: (✔) hot (✔) phased
[81628] * Inherited ssl://127.0.0.1:40010?cert=examples/puma/cert_puma.pem&key=examples/puma/puma_keypair.pem&verify_mode=none
[81628] Use Ctrl-C to stop
[81628] * Starting control server on http://127.0.0.1:40001
[81628] - Worker 1 (PID: 81646) booted, phase: 0
[81628] - Worker 0 (PID: 81645) booted, phase: 0

Total 1000 connections (20 loops of 50 clients) - total request time
          5%     10%     20%     40%     50%     60%     80%     90%     95% 
  mS    5.09    7.53    9.64   12.71   14.00   15.23   18.62   21.11   22.77

[81628] - Gracefully shutting down workers...
Command stop sent success
[81628] === puma shutdown: 2021-01-15 10:17:44 -0700 ===
[81628] - Goodbye!

looks like a pretty significant improvement

misc/sockets.rb Outdated

DARWIN = !!RUBY_PLATFORM[/darwin/]

HOST = TestPuma.const_defined?(:SvrBase) ? SvrBase::HOST : '127.0.0.1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's SvrBase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a more DSL based framework for the tests. SvrBase is a parent class for SvrPOpen & SvrInProc, which are the two classes for starting servers. socket.rb and ci_test.ru are also part of the framework.

socket.rb is the client socket class, and ci_test.ru is a rackup that works with socket.rb. It allows one to set the size of the response body and a delay in the app via headers.

I'm trying to make the framework work both for CI and generic testing/benching one might want to do locally. It's also setup to allow one to throw a lot of clients at Puma, and easily change their parameters (bind type, body size, etc).

I'm trying to get it stable (without retry) and also need to add sending multiple requests with a single client...

@MSP-Greg
Copy link
Member Author

Nate,

Thank you for running it. Not sure which version you ran, the most recent doesn't need anything other than the files in the misc folder. Data for three runs I did is below, using WSL2/Ubuntu. I varied the number of client threads, as I'm not sure if Windows Ubuntu handles files the same as native Ubuntu. Regardless, all three showed faster times with the PR...

─── Total 1000 connections (20 loops of 50 clients) - total request time ───
SSL Ctx   5%     10%     20%     40%     50%     60%     80%     90%     95%
  mS    2.44    2.89    3.67    5.24    5.99    6.65    8.24    9.39   10.10
  mS    2.27    2.75    3.80    5.29    6.02    6.78    8.58    9.69   10.45

Master    5%     10%     20%     40%     50%     60%     80%     90%     95%
  mS   22.79   27.29   31.78   36.10   37.69   39.39   43.25   46.47   48.97
  mS   24.59   27.96   31.49   35.46   37.68   39.39   43.60   46.82   49.49


─── Total 250 connections (5 loops of 50 clients) - total request time ─────
SSL Ctx   5%     10%     20%     40%     50%     60%     80%     90%     95%
  mS    2.00    2.03    2.12    2.25    2.31    2.40    2.87    3.34    3.71
  mS    1.98    2.01    2.08    2.18    2.25    2.32    2.67    3.23    3.74

Master    5%     10%     20%     40%     50%     60%     80%     90%     95%
  mS    5.55    5.67    5.84    6.23    6.52    9.51   11.04   11.88   15.57
  mS    5.36    5.49    5.70    6.14    6.88    9.58   10.89   14.54   15.32


─── Total 50 connections (1 loops of 50 clients) - total request time ──────
SSL Ctx   5%     10%     20%     40%     50%     60%     80%     90%     95%
  mS    2.12    2.16    2.18    2.21    2.23    2.27    2.41    2.71    5.05
  mS    2.10    2.14    2.17    2.19    2.21    2.26    2.41    3.40    5.83

Master    5%     10%     20%     40%     50%     60%     80%     90%     95%
  mS    6.95    7.07    7.78    8.21    8.34    8.44    8.84    9.25    9.40
  mS    7.15    7.51    7.73    8.20    8.34    8.47    8.72    9.07    9.45

@nateberkopec
Copy link
Member

No idea why, but benchmarks/wrk/ssl_realistic_response.sh is running 10x slower for me on this branch vs master.

@nateberkopec
Copy link
Member

It looks like it's mostly some issue with the tail of the distribution:

connection 0: 270 requests completed
connection 1: 272 requests completed
connection 0: 271 requests completed
connection 1: 269 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   173.45ms  338.32ms   1.32s    85.16%
    Req/Sec    72.39     47.01   150.00     62.81%
  Latency Distribution
     50%   17.33ms
     75%   22.92ms
     90%  785.02ms
     99%    1.30s 
  1082 requests in 32.10s, 208.16MB read
  Socket errors: connect 0, read 1078, write 0, timeout 32
Requests/sec:     33.71
Transfer/sec:      6.48MB

@nateberkopec
Copy link
Member

Here's master, note no socket errors:

Running 30s test @ https://localhost:9292
  2 threads and 4 connections
connection 0: 1938 requests completed
connection 1: 1946 requests completed
connection 0: 1942 requests completed
connection 1: 1940 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.46ms    1.03ms  23.82ms   66.73%
    Req/Sec   129.48     11.34   160.00     81.56%
  Latency Distribution
     50%   15.63ms
     75%   16.10ms
     90%   16.55ms
     99%   17.97ms
  7766 requests in 30.10s, 1.46GB read
Requests/sec:    257.98
Transfer/sec:     49.63MB

@MSP-Greg
Copy link
Member Author

Not good, and odd. I just tried to install wrk, and make failed during 'install'. As soon as I figure that out...

@MSP-Greg
Copy link
Member Author

I got wrk running (small Makefile change that was submitted as a PR to wrk on Dec 10, 2018). Odd things going on.

If I make the response smaller, this PR runs with no read errors.

With Puma at '-w2 -t10:10', wrk using '4 threads and 4 connections', performance was the same. With wrk using '4 threads and 8 connections', this PR was about 10% faster.

The throughput of both dropped considerably with '4 threads and 8 connections'. I'm wondering about that, as I would think that 'Requests/sec' would see a max, but it wouldn't drop much if one hit the server with more requests. In this case, it is really dropping.

Still trying to figure out the read errors...

@nateberkopec
Copy link
Member

With Puma at '-w2 -t10:10', wrk using '4 threads and 4 connections',

The benchmark is designed specifically the way it is so that you're putting some pressure on the concurrency of the Puma process. With 4 connections coming from wrk and just 4 threads in Puma read to receive them, you're going to see some contention for the worker pool (which will be 100% utilized) that you won't see with 10 puma threads.

@MSP-Greg
Copy link
Member Author

Re wrk, I started using clients/connections of multiples of 100, like 100, 200, or 300. Some applications of Puma might only have four concurrent clients (maybe a REST server?), but I think that's probably rare.

Re ssl, most of what this PR affects is the time/resources needed to initialize a new ssl client (the handshake), so using wrk won't really show the benefit.

Copy link
Member

@evanphx evanphx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple little things

@@ -28,6 +28,8 @@ typedef struct {
int bytes;
} ms_cert_buf;

static VALUE eError;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this a a different way. static VALUEs are bad form.

Copy link
Member Author

@MSP-Greg MSP-Greg Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was existing code, I didn't look over all of that. I'll remove. Thanks.

Removed.

@@ -409,7 +437,7 @@ VALUE engine_extract(VALUE self) {
ms_conn* conn;
int bytes;
size_t pending;
char buf[512];
char buf[4096];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with just the context change, and was testing with small response bodies. Nate noticed an issue that appeared with longer response bodies, I then started debugging MiniSSL::Socket#write and the various read/write operations in it.

I didn't note it in the code, but the OpenSSL 1.1.1 man pages show 4096 is the default, and I verified that length was the max length returned by MiniSSL::Engine#extract.

I didn't look at the corresponding operation on the read side, thought I'd leave it as is.

JFYI, I'm working on a test update, and it has an ssl test with 200 kB response bodies...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just throw a comment above this line why it's this size. I should have done that with the 512 actually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Maybe I got a little AR, comments below, easy to find next time...

  // https://www.openssl.org/docs/manmaster/man3/BIO_f_buffer.html
  // crypto/bio/bf_buff.c DEFAULT_BUFFER_SIZE

Contexts are currently created for every accepted ssl client.  That isn't needed, but more importantly, Contexts read two files.

For a busy server, this results in slow clients.

Update to create the Context in MinSSL::Server.new, and pass it to MinSSL::Engine.server.

Only affects MRI builds.
@nateberkopec nateberkopec merged commit 37a3bc6 into puma:master Jan 26, 2021
@MSP-Greg MSP-Greg deleted the ssl-context-2 branch January 27, 2021 20:38
JuanitoFatas pushed a commit to JuanitoFatas/puma that referenced this pull request Sep 9, 2022
Contexts are currently created for every accepted ssl client.  That isn't needed, but more importantly, Contexts read two files.

For a busy server, this results in slow clients.

Update to create the Context in MinSSL::Server.new, and pass it to MinSSL::Engine.server.

Only affects MRI builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants