New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SSL Context - minissl.c, minissl.rb, extconf.rb #2519
Conversation
4aa4d05
to
66b9d8c
Compare
lib/puma/minissl.rb
Outdated
@@ -328,15 +331,15 @@ def to_io | |||
def accept | |||
@ctx.check | |||
io = @socket.accept | |||
engine = Engine.server @ctx | |||
engine = Engine.server (IS_JRUBY ? @ctx : @sslctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this engine/sslctx is pretty hot-path, so I think we should instead just have an engine
method which only has to run this check/ternary once. memoize or just use metaprogramming, your choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, moved the JRuby checks to initialize...
test/rackup/realistic_response.ru
Outdated
@@ -6,6 +6,6 @@ long_header_hash = {} | |||
long_header_hash["X-My-Header-#{i}"] = SecureRandom.hex(25) | |||
end | |||
|
|||
response = SecureRandom.hex(100_000) # A 100kb document | |||
response = SecureRandom.hex(50_000) # A 100kb document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the reason for the change here? should update the comment at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruby -rsecurerandom -e "puts SecureRandom.hex(100_000).length"
outputs 200_000, or SecureRandom.hex(b_len)
creates a byte sequence of b_len
bytes, and then converts it to a hex string. Each byte turns into 2 characters, so...
I'll split it out to a separate PR.
I understand this is still marked as draft, but unless we have a really compelling reason I think that:
|
Nate, I ran the code locally with Windows and WSL2/Ubuntu, and there often was a speed increase. But, I can't test macOS, and part the benefit may also be memory use, as the current code creates a new context for every ssl client, and if many are 'persistent' by 'keep-alive', that means there are many more context objects. I was hoping a few people might test it in other environments... |
I can't get it to run on master. I copy the misc folder from this branch, then run it:
|
I'm looking at it, I may rebase and push... |
66b9d8c
to
e326323
Compare
Nate, Sorry, too many branches. Rebased & updated. To use in master, along with copying the misc folder, one needs to copy |
e326323
to
e9e914b
Compare
this branch:
master:
looks like a pretty significant improvement |
misc/sockets.rb
Outdated
|
||
DARWIN = !!RUBY_PLATFORM[/darwin/] | ||
|
||
HOST = TestPuma.const_defined?(:SvrBase) ? SvrBase::HOST : '127.0.0.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's SvrBase
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on a more DSL based framework for the tests. SvrBase
is a parent class for SvrPOpen
& SvrInProc
, which are the two classes for starting servers. socket.rb
and ci_test.ru
are also part of the framework.
socket.rb
is the client socket class, and ci_test.ru
is a rackup that works with socket.rb
. It allows one to set the size of the response body and a delay in the app via headers.
I'm trying to make the framework work both for CI and generic testing/benching one might want to do locally. It's also setup to allow one to throw a lot of clients at Puma, and easily change their parameters (bind type, body size, etc).
I'm trying to get it stable (without retry) and also need to add sending multiple requests with a single client...
Nate, Thank you for running it. Not sure which version you ran, the most recent doesn't need anything other than the files in the misc folder. Data for three runs I did is below, using WSL2/Ubuntu. I varied the number of client threads, as I'm not sure if Windows Ubuntu handles files the same as native Ubuntu. Regardless, all three showed faster times with the PR...
|
No idea why, but |
It looks like it's mostly some issue with the tail of the distribution:
|
Here's master, note no socket errors:
|
Not good, and odd. I just tried to install wrk, and make failed during 'install'. As soon as I figure that out... |
I got wrk running (small Makefile change that was submitted as a PR to wrk on Dec 10, 2018). Odd things going on. If I make the response smaller, this PR runs with no read errors. With Puma at '-w2 -t10:10', wrk using '4 threads and 4 connections', performance was the same. With wrk using '4 threads and 8 connections', this PR was about 10% faster. The throughput of both dropped considerably with '4 threads and 8 connections'. I'm wondering about that, as I would think that 'Requests/sec' would see a max, but it wouldn't drop much if one hit the server with more requests. In this case, it is really dropping. Still trying to figure out the read errors... |
The benchmark is designed specifically the way it is so that you're putting some pressure on the concurrency of the Puma process. With 4 connections coming from wrk and just 4 threads in Puma read to receive them, you're going to see some contention for the worker pool (which will be 100% utilized) that you won't see with 10 puma threads. |
Re wrk, I started using clients/connections of multiples of 100, like 100, 200, or 300. Some applications of Puma might only have four concurrent clients (maybe a REST server?), but I think that's probably rare. Re ssl, most of what this PR affects is the time/resources needed to initialize a new ssl client (the handshake), so using wrk won't really show the benefit. |
e9e914b
to
f3d124f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple little things
ext/puma_http11/mini_ssl.c
Outdated
@@ -28,6 +28,8 @@ typedef struct { | |||
int bytes; | |||
} ms_cert_buf; | |||
|
|||
static VALUE eError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this a a different way. static VALUEs
are bad form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was existing code, I didn't look over all of that. I'll remove. Thanks.
Removed.
@@ -409,7 +437,7 @@ VALUE engine_extract(VALUE self) { | |||
ms_conn* conn; | |||
int bytes; | |||
size_t pending; | |||
char buf[512]; | |||
char buf[4096]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with just the context change, and was testing with small response bodies. Nate noticed an issue that appeared with longer response bodies, I then started debugging MiniSSL::Socket#write
and the various read/write operations in it.
I didn't note it in the code, but the OpenSSL 1.1.1 man pages show 4096 is the default, and I verified that length was the max length returned by MiniSSL::Engine#extract
.
I didn't look at the corresponding operation on the read side, thought I'd leave it as is.
JFYI, I'm working on a test update, and it has an ssl test with 200 kB response bodies...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just throw a comment above this line why it's this size. I should have done that with the 512 actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Maybe I got a little AR, comments below, easy to find next time...
// https://www.openssl.org/docs/manmaster/man3/BIO_f_buffer.html
// crypto/bio/bf_buff.c DEFAULT_BUFFER_SIZE
7672e1c
to
d253869
Compare
Contexts are currently created for every accepted ssl client. That isn't needed, but more importantly, Contexts read two files. For a busy server, this results in slow clients. Update to create the Context in MinSSL::Server.new, and pass it to MinSSL::Engine.server. Only affects MRI builds.
d253869
to
a46423e
Compare
Contexts are currently created for every accepted ssl client. That isn't needed, but more importantly, Contexts read two files. For a busy server, this results in slow clients. Update to create the Context in MinSSL::Server.new, and pass it to MinSSL::Engine.server. Only affects MRI builds.
Description
While working on an update to the test suite, client connections were slow with ssl tests.
Currently, a new context is created for each accepted client connection. A context also reads two files for the cert and key.
Refactored to use one context created in MiniSSL::Server.new.
Code is included in a
misc
folder that reports on the times needed to open the socket and write a simple request. It logs the time of each client's GET request, each having about a 100k body, then sorts the times, and outputs percentile times. See the socket_times.md in the misc folder for info.Times for this PR (see 'test ssl connection times` step):
https://github.com/MSP-Greg/puma/actions/runs/453868127
Times for current master:
https://github.com/MSP-Greg/puma/actions/runs/453894649
The CI for the PR has an additional 'test tcp connection times' step, which helps tell the two apart.
Times for this PR are approx two to three times faster than master on both Ubuntu & macOS.
But, there are a few odd things:
Checking it locally with Windows WSL2/Ubuntu, I noticed a larger speed increase, but that may be specific to that implementation. Oddly, the tcp times were almost the same as Actions CI.
The longest macOS times are really high with the PR, compared to master. Not sure about what could be causing the issue.
I'd appreciate if anyone running *nix or macOS natively can check the results.
To run locally, compile Puma, then (from the Puma folder) run
misc/run_socket_times.sh
. Save the misc folder in master, and the tests will also run there. You may need to run the following:For more info on running
misc/run_socket_times.sh
, seesocket_times.md
Note that the last three commits may be removed for the PR...
Thanks, and Happy New Year, Greg
Your checklist for this pull request
[changelog skip]
or[ci skip]
to the pull request title.[ci skip]
to the title of the PR.#issue
" to the PR description or my commit messages.