-
-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Massive performance degradation in OpenSsl 3.0 if used in a heavy multi threaded server application #17064
Comments
It would be really helpful to describe what are the concrete operations you do in the multiple parallel threads. |
Are you able to get stack traces for particularly "hot" locks? It would be useful to see what codepaths are encountering this and might suggest ways to improve performance. For example algorithm fetching can be quite an expensive operation. Doing explicit fetches up front and the using the pre-fetched algorithm could have performance benefits and avoid a lot of code requiring locking. |
Thank you for your fast response. Since our server is a kind of a media server we use the crypto stuff mainly for SRTP. |
The stack traces don't quite supply enough "depth" to see where they are being called from. However it does look like it is related to fetching. If you have code like this: void *thread_worker(void *arg)
{
int i = 0;
for (int i = 0; i < 1000; i++) {
EVP_CipherInit_ex(ctx, EVP_aes128_cbc(), null, key, iv, e);
/* more stuff */
}
}
int main(void)
{
/* Create lots of threads based on thread_worker */
} Then you might want to consider refactoring it to look more like this: EVP_CIPHER *aes128cbc;
void *thread_worker(void *arg)
{
int i = 0;
for (int i = 0; i < 1000; i++) {
EVP_CipherInit_ex(ctx, aes128cbc, null, key, iv, e);
/* more stuff */
}
}
int main(void)
{
aes128cbc = EVP_CIPHER_fetch(NULL, "AES-128-CBC", NULL);
/* Create lots of threads based on thread_worker */
} Similarly with digest fetches. |
Thank you for your suggestion! |
This sounds a lot like it belongs in either a FAQ or the migration guide.
|
The other option is that we add a cache, which will probably still involve taking a read lock.
|
The underlying problem is implicit fetching not caching. Implicit fetching is done late and always calls fetch. Fetches are cached but can be expensive -- here it is lock contention causing problems. I suspect that there is only one library context and accesses to it are locked, and some locks are write: hence contention. This is expected in a heavy multi threaded server. This was a known trade-off that was made a long time back in order to maintain compatibility: old applications will run but not necessarily be fast. The update to calling fetch up front wasn't considered to be too onerous. |
I tried what @mattcaswell suggested. I could try to find out what is calling all this locking stuff. |
Hmm. Are you able to expand out |
Ok...keep going! The presence of |
The previous screenshots are from the original version. |
Seems the contention is on fetching the HMAC algorithm from the legacy EVP_PKEY_CTX implementation and the fetch of the digest algorithm for the HMAC. The first thing could be removed by using the EVP_MAC API directly. Unfortunately the digest algorithm fetch would be still there. Perhaps preinitializing the EVP_MAC_CTX once and duplicating it in the threads would help? |
Yes, as @t8m says it appears to be due mostly to the EVP_DigestSignInit call in https://www.openssl.org/docs/man3.0/man3/EVP_MAC_init.html Avoiding the digest fetch is trickier but @t8m's method should work, i.e. create an There also seems to be an EVP_CIPHER_fetch still happening. This is happening in an |
@mattcaswell The function I will also try the rest of your suggestions. Even if this should work I see this only as a half workaround. We have also the requirement to use the FIPS provider which could be enabled during the runtime. But additional we also use another 3rd party library, the Google GRPC lib which itself also uses the OpenSsl. And also every other client working in this way has to overcome this problem with all these workarounds. |
I suppose to make the fetches lockless we would have to add some kind of OSSL_LIB_CTX_freeze() kind of API that would disallow load/unload of providers into the libctx. Then the cached implementations could be just under a read lock. |
I actually think we should consider refactoring the OSSL_LIB_CTX code. It's based around CRYPTO_EX_DATA which makes conceptual sense - bit it makes the locking considerably harder than it needs to be. Take a look at the locking in |
I commented out to much code for the testing purpose. :( |
They should not need to be called. In the code snippet I posted way back above in this thread, the only fetch that should be happening is at the start up. Subsequent to that |
That call only happens if the cipher isn't from a provider. I.e. it is one that hasn't been fetched already. If you pre-fetch the cipher, that code path is not taken. Line 166 is where this magic happens. |
@thkdev2 - are you sure you are passing the pre-fetched cipher to the EVP_CipherInit_ex call? As @paulidale says - on line 166 |
Thank you both for the hint. |
It's unbearable. OpenSSL 1.1.1 is now considered outdated and v3 is unusable.
|
Is there a status update that I can get here? Reading through this, it seems like the problem still exits in some form, but several improvements have been made. I resurrected the perf tests that @hlandau made earlier and re-ran them again on my system and performance for those seems to be at parity.
|
I would like performance numbers please, but while they are interesting, I'm not sure their relevant to this issue (though I'm not 100% sure). From what I read on this issue, this particular issue might be constrained to windows. Part of the issue in re-reading this is that I feel like we've lost the focus of the thread here. We started with a description of the problem, fixed a bunch of things in the interim, and now I'm not sure where the disposition of the issue is. So as much as new numbers, I was hoping to get a refresh on the problem statement, as to what the original reporters are experiencing in terms of performance, on what platforms, with what workloads, etc |
I think you are right @nhorman i haven't tested 3.3.0 extensively yet, but my preliminary tests indicate that concurrency is now on the same order of magnitude when comparing to 1.1.1. My use cases are around using python with gevent/threads and doing loads of http requests. I wrote a Dockerfile to be able to build python using either openssl version and then run a simple test script: https://gist.github.com/glic3rinu/0878f9c2d1e72dc07932325bca8f6a4a |
oh, that is helpful @glic3rinu , thank you. That makes these larger scale test cases significantly more accessible. I think if I can reproduce similar results using that we can perhaps consider this issue closed and open new issues for subsequent performance degradation |
I ran some quick tests against www.google.com, 3K requests with 300 concurrent threads (I was being rate limited by google so those are all 302 body-less responses, more variation could be probably observed with bigger http responses, but not sure how to quickly test it):
|
Ok, so I'm reading that as a 27% decrease from 1.1.1 in 3.3.0, which is much better than 3.0.12, but still not great. I don't suppose in that testing harness you are able to run perf on the 3.3.0 server and report the flame graph back, are you? |
unfortunately i am not very familiar with C profiling, I did try to profile it with the python's builtin cProfile and as expected the major timing differences are happening inside openssl C bindings, but i don't get to see whats going on inside just the entry calls. e.g.
|
@nhorman this is an actionable thing we can work on. Creation of an SSL_CTX is known to be expensive. At the time of writing 3.0 we implemented certain caches in the SSL_CTX with the thinking that SSL object creation is common, but SSL_CTX creation is less frequent. Since then we have come to realise that many applications create an SSL_CTX per connection. A better model is probably to move the caches into the OSSL_LIB_CTX instead. With the changes made by @Sashan in #24414 we now have a mechanism for accessing lib ctx data indexes from inside libssl which was previously a barrier to pursuing this. |
We should also have a variant of our handshake performance test that reflects this model of connection creation. |
I've reproduced the results that @glic3rinu (or close to it, 281 req/sec on openssl 1.1.1v, 230 req/sec on openssl 3.3.0) Noteable hotspots as reported by perf using the reproducer in the gist here https://gist.github.com/glic3rinu/0878f9c2d1e72dc07932325bca8f6a4a openssl 3.3.0:
openssl 1.1.1v:
this suggests to me that we're slowing down most significantly in our loading/and decoding of X509 certificates. I believe that @vdukhovni merged a patch here #24140 that may help with this. going to try the head of the master branch for comparison |
hmm simmilar results on the master branch, so no help there. Looking a little more closely at where our cycle samples are comming from: openssl 1.1.1v:
openssl-master:
Need to look at this closely, but this appears to suggest that the addition of the decoder code is primarily responsible for the slowdown here, specifically the OSSL_DECODER_from_data function and its subordonates. |
Hey I updated the test script to be able to reuse ssl context across calls (using urllib library instead of requests). I've also made a few other changes to make the runtime more consistent (timeouts and better thread pool). Preliminary results between both versions seem to be on par when reusing context 🥳
|
i concur, I've rerun here with the new tests, and gotten simmilar results (387 req/s on 1.1.1v and 386 req/s with 3.3.0) So it seems @mattcaswell was correct in that reusing contexts was the slowdown here (perf, perhaps corroboratingly shows a signficant reduction in the number of cycles spent in OSSL_DECODER_from_data) So I suppose the take aways here are:
Am I missing anything here? Is there additional work to be done on this ticket, or can it be closed? |
This was discussed in another bug, but this is not good general advice. SSL_CTX sharing reflects application semantics.
Simply saying "Reuse SSL_CTX whenever possible" forgets that there was a purpose to this API. OpenSSL 3.x reused it to work around some performance regressions, but it is only a workaround. It does not work for applications whose semantics demand that they make separate SSL_CTXs. Making new SSL_CTXs is an important use case for OpenSSL 3.x to performantly support, if it wishes to be a backwards-compatible, robust, and full-featured tookit for TLS. |
@nhorman as I said above "A better model is probably to move the caches into the OSSL_LIB_CTX instead" - so we should ensure we have an issue to do this work. |
The DECODER slow down also still warrants further work. While we can do optimisations to avoid calling decoders in certain circumstances there will still be workloads that need to use them. |
@mattcaswell I agree on the decoder work, but aren't we already tracking that here openssl/project#100 |
Ah - yes, right. |
We have updated our multi threaded server application from OpenSsl 1.0.2 to 3.0 which is working but nevertheless unusable.
We are using mainly the AES and SHA method.
Our code is running on Windows and Linux.
We always use a self built library of a release version.
If we do a load test we have many parallel instances which are working in different threads.
Using the OpenSsl 1.0.2 on a 32 core machine results on a load of about 7% (of 100%) in Windows and about 800% (of 3200%) in Linux.
Only switching to the OpenSsl 3.0.0 results on a load of about 21% (of 100%) in Windows and about 3000% (of 3200%) in Linux with the same test.
The Linux version performs so badly that our overload/blocking checker kills the server!
So the OpenSsl 3.0 is absolutely unusable from our perspective.
I took a look with a profiler in linux and found out that the most time is spent in the pthread wait functions called by CRYPTO_THREAD_read_lock, CRYPTO_THREAD_unlock and CRYPTO_THREAD_write_lock called by ossl_lib_ctx_ge_data and others.
I know that the times also counted if the threads are sleeping.
But of course it's clear that all the threads are colliding at such a hot code path which is de facto single threaded now.
And this is a major change to the 1.x version where the instances running in different threads are really run in parallel.
I don't know if there is something we can do in our use to improve the situation.
Otherwise we see this as a serious bug preventing us to use the OpenSsl 3.0.
Let me know if you need further information.
The text was updated successfully, but these errors were encountered: