Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.0 performance degraded due to locking #20286

Open
Tracked by #453
richsalz opened this issue Feb 14, 2023 · 183 comments
Open
Tracked by #453

3.0 performance degraded due to locking #20286

richsalz opened this issue Feb 14, 2023 · 183 comments
Assignees
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 help wanted triaged: bug The issue/pr is/fixes a bug triaged: performance The issue/pr reports/fixes a performance concern triaged: refactor The issue/pr requests/implements refactoring

Comments

@richsalz
Copy link
Contributor

Using some internal QUIC tests, we see this timing in 1.1:

$ USE_GQUIC_VERSIONS=1 perf record -F 999 -g ./quic_lib_tests  --gtest_repeat=100  --gtest_filter=*ZeroRttDisabled*
$ perf report

+   31.73%    31.55%  quic_lib_tests  quic_lib_tests       [.] bn_sqr8x_internal
+    9.28%     9.28%  quic_lib_tests  quic_lib_tests       [.] mul4x_internal
+    4.91%     4.91%  quic_lib_tests  quic_lib_tests       [.] sha256_block_data_order_avx

In 3.0 we see this:


$ USE_GQUIC_VERSIONS=1 perf record -F 999 -g ./quic_lib_tests  --gtest_repeat=100  --gtest_filter=*ZeroRttDisabled*
$ perf report

+   11.02%    10.99%  quic_lib_tests  quic_lib_tests       [.] bn_sqr8x_internal
+    8.38%     8.08%  quic_lib_tests  libpthread-2.31.so   [.] __pthread_rwlock_rdlock
+    7.65%     7.51%  quic_lib_tests  libpthread-2.31.so   [.] __pthread_rwlock_unlock
+    4.98%     4.78%  quic_lib_tests  quic_lib_tests       [.] getrn
+    4.14%     4.11%  quic_lib_tests  quic_lib_tests       [.] mul4x_internal
+    3.37%     2.57%  quic_lib_tests  quic_lib_tests       [.] ossl_tolower
     3.30%     3.30%  quic_lib_tests  quic_lib_tests       [.] ossl_lh_strcasehash
+    2.72%     2.13%  quic_lib_tests  quic_lib_tests       [.] OPENSSL_strcasecmp
+    2.29%     2.05%  quic_lib_tests  quic_lib_tests       [.] ossl_lib_ctx_get_data
+    1.93%     1.93%  quic_lib_tests  quic_lib_tests       [.] sha256_block_data_order_avx

This seems to be part of OSSL_DECODER_CTX_new_for_pkey 16% of the time is spent in locking, on a single threaded binary. And 10% is in a string hashtable lookup.

If anyone on the project is going to look at this, I will try to get a small reproducer. But our the time for our QUIC tests is doubling.

@richsalz richsalz added the issue: bug report The issue was opened to report a bug label Feb 14, 2023
@beldmit
Copy link
Member

beldmit commented Feb 14, 2023

Yes, a small reproducer would be great. I'll try to look at it.

@richsalz
Copy link
Contributor Author

Here is a small reproducer in C++, written by an Akamai colleague.

The code is at https://gist.github.com/richsalz/31a98a3095fa36ab6a66082a0c028a6f

Here are the compile commands and timing. This small reproducer is 17 times slower. When used in our QUIC implementation, small transfers average twice as slowly. Microsoft's MSQUIC (their production-quality open source stack on Windows and Azure) avoids use of 3.0 for this exact reason; they use the 1.1 branch of @quictls/openssl

$ g++ -O2 -Wall -std=c++17 test_25519.cc -I openssl1/common/include/ openssl1/common/lib/libcrypto.a -lpthread -ldl -o bin.11
$ g++ -O2 -Wall -std=c++17 test_25519.cc -I openssl3/common/include/ openssl3/common/lib/libcrypto.a -lpthread -ldl -o bin.30

$ time ./bin.11

real	0m3.484s
user	0m3.481s
sys	0m0.002s

$ time ./bin.30 

real	0m53.588s
user	0m53.566s
sys	0m0.020s

@beldmit
Copy link
Member

beldmit commented Feb 16, 2023

I have some good news. Current master (and 3.1, I presume) is 4 times faster than 3.0.

I will still look if there is smth obvious to fix in 3.1

@beldmit
Copy link
Member

beldmit commented Feb 16, 2023

ossl_namemap_name2num_n with descendants (including locks/unlocks) liiks like a hotspot to me - 45% according to callgrind in master. @paulidale @hlandau - could you please take a look?

@beldmit
Copy link
Member

beldmit commented Feb 16, 2023

e.g. could we avoid CRYPTO_strndup/OPENSSL_free calls here (~10% in this synthetic example)?

@slontis
Copy link
Member

slontis commented Feb 17, 2023

NOTE: @beldmit If you are looking at timing, do the calls twice, and measure starting from the second one..
The first will cache the fetches..

@beldmit
Copy link
Member

beldmit commented Feb 17, 2023

I am mostly looking at the amount of calls.

@beldmit
Copy link
Member

beldmit commented Feb 17, 2023

Commenting the CRYPTO_strndup/OPENSSL_free causes gazillion tests to fail and 3 times speedup. Weird.

@beldmit
Copy link
Member

beldmit commented Feb 17, 2023

Nevermind, looked into the code :)

@paulidale
Copy link
Contributor

This is on my to look at list but I'm pretty busy following last week's face to face and I've a huge pile of other stuff to look at.

@beldmit, I have difficulty believing that strndup/free is the culprit. Locking is more likely as @richsalz reported. 3.1 should have improved things in this regard by avoiding the more egregious locking problems. I'm hopeful but I cannot say if any others remain at this point. Time will tell.

However, I'm not certain that locking fixes alone will be sufficient. @richsalz's code directly encounters one of the known problems: decoders are slower in 3.0 due to the provider architecture. We need to be able to support a provider loading a key in a format that libcrypto doesn't know about and this means multiple parsing attempts. There simply isn't any other option. That doesn't mean we can't streamline things further -- we intend to look more deeply into this but contractor time is a limited resource and there are many other priorities. Rest assured that performance is definitely on the backlog and is important to the project.

As for MSQUIC, there is a quick (sic) win available by refactoring the QUIC code to not create a new SSL_CTX on every connection. Doing this in 1.1.1 is a moderately heavy operation, doing it in 3.0 is quite a bit heavier. I suspect that avoiding this will improve things significantly, both for 1.1.1 and 3.0/3.1. Unfortunately, I do not see myself getting time to contribute such a patch to MSQUIC any time soon.

Volunteers who can contribute such changes are welcomed of course.

@beldmit
Copy link
Member

beldmit commented Feb 20, 2023

Yes, long-term run shows that strdup/free are not tge culprit. I'll attach the callgrind results later

@beldmit
Copy link
Member

beldmit commented Feb 20, 2023

callgrind.out.655458.txt
What looks strange to me is 153 millions of call to ossl_decoder_fast_is_a with 20% of overal time (with descendants).

@mattcaswell mattcaswell added triaged: bug The issue/pr is/fixes a bug and removed issue: bug report The issue was opened to report a bug labels Feb 20, 2023
@richsalz
Copy link
Contributor Author

We need to be able to support a provider loading a key in a format that libcrypto doesn't know about and this means multiple parsing attempts. T

The project went too far. In order to support things that aren't currently available, the project is hurting -- a lot -- current users. That is the wrong trade-off. (This kind of "make it flexible, we can't see future needs" has been an endemic problem in OpenSSL code since the beginning, going back to SSLeay.)

@nibanks you might want to take a look at Pauli's suggestions above. My feeling, however, is that "quite a bit heavier" for flexibility that nobody can currently use is a mistake.

@nibanks
Copy link

nibanks commented Feb 20, 2023

cc @wfurt to possibly help here

Perf with 3.0 is really a huge issue with MsQuic and I'm not sure we can really adopt it seriously as is. We can't just artificially work around things just for the sake of making a perf test look good. We need to make sure to exercise real world scenarios.

@beldmit
Copy link
Member

beldmit commented Feb 20, 2023

Dear @nibanks
I'd split things apart.

  1. The redesign to provider causes significant slowdown in some specific areas. It's a problem to be solved on OpenSSL level for corner cases or in general.

  2. The architectural difference between 1.1.1 and 3.0 series is significant itself. So the approaches worked well for 1.1.1 should be reconsidered for 3.0. @paulidale provided a totally relevant recommendation and ignoring it looks like a suboptimal decision to me.

@wfurt
Copy link

wfurt commented Feb 20, 2023

Is there list of know deficiencies @beldmit? I look at the linked microsoft/msquic#2588 and it make sense. We could look for patterns like that if if is clear what the caveats are. Aside from Quic, I also work on SslStream that empowers all TLS operations in .NET. While we did the work to support OpenSSL 3 we did not run any benchmarks AFAIK as at that time no distro we support was using it. I'm going to put it in my TODO list as well.

@beldmit
Copy link
Member

beldmit commented Feb 20, 2023

@wfurt #17627 contains some of these issues. Loading keys/certificates is currently much slower than in 1.1.1, but IRL it's usually one-time operation.

@wtarreau
Copy link

On our side, we spent some time trying to use a simple HTTP client / load generator on Ubuntu 22 and were stumped to see it saturate at around 400 connections per second for 48 threads while on other machines it worked fine. We noticed that there was something completely odd regarding locking as it was impossible for the process to use much more than one CPU and that most of the time was spent in futex_wake() and friends. After trying many things we finally locally built 1.1.1 and linked against it and suddenly the performance reached 74000 cps, or almost 200 times more.

Additionally, on the server side we already noticed an extreme use of locking, with chains starting like this one and involving pthread_rwlocks:

  92.80%  [kernel]                  [k] native_queued_spin_lock_slowpath.part.0

I instrumented the code in hope to find "the" culprit and found something like 80 calls to any flavor of pthread_rwlock_*() in a single accept/get/response/close cycle! 1.1.1 already had quite a bunch of them (I don't remember how many) but significantly less overhead, as if it was used on shorter chains. In the end we reimplemented our own locks (same principle as the lock callbacks that were unfortunately dropped from 1.1) and could more than double the server-side performance thanks to using much lighter locks.

So yes, there definitely is a big locking problem there. The rule of thumb for a call to any pthread_* function should normally be "am I really willing to give up the CPU and switch to another task, possibly thrashing all the L1 cache here?". Of course that's not exact but in terms of cost and impacts it's extremely close to reality. pthread_* functions should only be for high-level code (browsers etc) and rare cases. It's way too expensive for low-level or library code.

I failed to find any single occurrence of an atomic operation in the whole codebase, which makes me think that all of them are instead implemented by hand around a lock, so that's a dead end, when you consider that scalable code normally tries hard to avoid atomic ops because they slow things down, and that pthread_* functions do involve atomic ops and a syscall in the contented case that is at least 100 times slower than the atomic op itself...

For now we're condemned to stick to 1.1.1, as 3.0 is not just "degraded", it's simply not realistically usable at all for anything beyond the openssl command-line utility for us. This will pose security issues in field over time once 1.1.1 stops being supported, as power users are only starting to discover the big regression now by progressively switching to new distros. But for now we have no other choice.

@paulidale
Copy link
Contributor

@wtarreau, there was a bug in 3.0 where multiple threads ended up completely serialised which killed multiprocessor performance and, from your description, I think you hit this. This was fixed in #18151 which was merged to later 3.0 releases.

OpenSSL does use atomic operations for reference counting (mostly). It was quite a performance win when it was introduced.

@wfurt, the rule of thumb is pre-load what you're going to use. Fetch is an additional cost over 1.1.1, so pre-fetch. Certificate/key loading is slower, so pre-load where possible. Avoid the convenience calls like EVP_sha256() because these do a (late) fetch internally.

@richsalz, the post quantum provider is using provider based decoders. I am glad that the flexibility was included to enable such.

@richsalz
Copy link
Contributor Author

@paulidale, Yes, I know about OpenQuantumSafe. It could have been done by them creating EVP_PKEY's, etc., by hand. That might not be as clean, admittedly, but you could have added EX_DATA to those objects, for example. Again, this is hurting everyone for what is still a niche offering.

Or you could do the "freeze" concept I offered in #15600.

Or you could have an internal hard-wired path that avoids lookup for default-enabled keys.

Is "avoid convenience calls like EVP_sha256" documented anywhere?

Lots of ways to approach this, and the project doesn't seem to be spending much effort on it, unless I am missing something.

@wtarreau
Copy link

@paulidale regarding 18151 I doubt given that it was fixed 9 months ago already and the problem was still present in November or so (I don't remember what exact version was the last one by then, probably 3.0.7). And even the last tests we'd run on 3.1-dev by then were less bad than 3.0 but still far from 1.1.1. OK interesting for atomic ops, I'm just surprised not to have been able to spot them though.

@slontis
Copy link
Member

slontis commented Feb 22, 2023

Is "avoid convenience calls like EVP_sha256" documented anywhere?

It appears not.. It should be so here it is..
#20354

@mattcaswell
Copy link
Member

I instrumented the code in hope to find "the" culprit and found something like 80 calls to any flavor of pthread_rwlock_*() in a single accept/get/response/close cycle!

@wtarreau - I wonder if you have any documentation or notes from that time which might help us identify which specific locks were causing the problem? I'm assuming the "80 calls" were not from 80 different call sites, but certain calls being repeated multiple times?

@wtarreau
Copy link

Hi Matt,

most of the notes I had were regarding 1.1.1 which we'll stick to, and for now I don't have access to this machine. But I think I will eventually again relatively soon, so as time permits I'll try to collect more info. The thing is that it's difficult to observe the call places because the time wasted calling locks is not evenly spread due to context-switches and sleeps in the kernel. In my case it's by redefining my own pthread_rwlock_*() functions that I had the idea of just adding a call counter and was amazed by the number of calls per connection. I do have some variants of these functions for debugging purposes which disable exponential back-off and which more easily permit to collect call graphs using perf, they're just a bit less easy to use and less precise.

@mattcaswell
Copy link
Member

But I think I will eventually again relatively soon, so as time permits I'll try to collect more info.

The other thing to bear in mind is that there are numerous performance improvements in 3.1. It's still not at 1.1.1 levels, but its better than 3.0. If you do re-run the exercise, doing it against 3.1 would be most helpful for us to figure out what we still have left to do.

@nhorman
Copy link
Contributor

nhorman commented Feb 6, 2024

@richsalz
FWIW, I've got a new rcu locked hash table implementation here:
https://github.com/nhorman/openssl/tree/new-lhash

Currently its roughly at parity with 3.0
master + new hash table:
real 0m4.095s
user 0m4.075s
sys 0m0.003s

but there are several more tables that need to be converted to leverage lockless operation. If anyone would like to try it out

@richsalz
Copy link
Contributor Author

richsalz commented Feb 6, 2024

"roughly at parity" Can you post side-by-side comparisons? Also looking forward to seeing lock contention drop dramatically once you convert the various crypto-loookup table stuff. Nice work.

@beldmit
Copy link
Member

beldmit commented Feb 6, 2024

@nhorman do you really mean at parity with 3.0, not an earlier version?

@nhorman
Copy link
Contributor

nhorman commented Feb 6, 2024

Sorry, I meant to say that the new hash implementation is roughly at parity with the master branch (sometimes a percent or two faster), but effectively parity. But standby, I've noticed something regarding our L1 cache miss rate that I'm chasing down

@wtarreau
Copy link

wtarreau commented Feb 6, 2024

Are you sure the L1 cache miss isn't caused by the excess use of atomics at various places that cause cache lines to bounce back and forth between cores ?

@nhorman
Copy link
Contributor

nhorman commented Feb 6, 2024

@wtarreau thats some of it, yes. In particular EVP_MD_up_ref seems to be causing a good deal of contention, but there are several data structures for the tests I've been running (the handshake test in the tools repository). I'm observing them with perf record -a -g -e cache-misses if you want to look as well

@wtarreau
Copy link

wtarreau commented Feb 6, 2024

OK. I'm not surprised, I noticed the number of locks per connection reduced between some version ranges from 3.0 to 3.1-dev, the time spent in locks as well, while performance dropped at the same time! The only way for perf to drop despite less time spent in locks was clearly a sign of an increase of atomics usage (probably placed inside certain small loops instead of taking a lock once around it for a short time), and that further aggravates performance and makes it much harder to locate at the same time :-/ I can't try at the moment but I'll probably give it a try some time in the future.

@nhorman
Copy link
Contributor

nhorman commented Feb 6, 2024

The new hash table implementation here:
https://github.com/nhorman/openssl/tree/new-lhash
definately has the potential to add atomic calls, but all use cases are currently operating in a lockless mode

It definately seems to be becomming clear that when we accelerate hash table lookups in lockless mode we wind up, in the higher multithreaded cases, creating more cache line contention on other reads (atomics and otherwise)

@nhorman
Copy link
Contributor

nhorman commented Feb 6, 2024

@richsalz I don't have final data for your test case (just ran it today quickly based on @beldmit request). Mostly I've been working with the handshake test in our tools repo. You can see those numbers here:
openssl/project#440 (comment)

@richsalz
Copy link
Contributor Author

richsalz commented Feb 6, 2024

Okay thanks. But still, good work!

@t8m
Copy link
Member

t8m commented Feb 13, 2024

I've tried the test_25519 reproducer with all the current 3.x branches and 1.1.1. Here are the results:

Branch Real Time avg Performance penalty from 1.1.1
1.1.1 2.40 s 0%
3.0 27.9 s 1146 %
3.1 4.56 s 90 %
3.2 3.17 s 32 %

I think we are pretty near the 1.1.1 performance on this particular benchmark that's basically testing a repeated decoding of 25519 key with 3.2.

@beldmit
Copy link
Member

beldmit commented Feb 13, 2024

callgrind.out.986463.txt
FTR, 3.2 callgrind results.

@paulidale
Copy link
Contributor

Rather than going to RCU with lhash, I suggest that lhash is replaced by something more recent. While lhash was a great idea at the time, it has serious flaws now:

  1. OpenSSL's use of hash tables are mostly add only. There are only a couple of instances where item removal is required. Lhash was designed for regular addition and removal. It is really clever and efficiently resizing (both larger and smaller) when required, but our use generally doesn't require size reduction (which comes at a cost).
  2. Lhash runs with a deliberately high load factor. Often above unity. This pretty much guarantees collisions which need to be resolved.
  3. Lhash's collision resolution is to search a linked list of clashing items. Linear time worst case.
  4. Lhash resizing is purely power of two based which is far from the optimal ϕ based resizing.

Things have moved forward a lot in the last 40+ years since the lhash algorithm was invented. Hash tables these days should run with a lower load factor (memory being cheap), they should have efficient collision resolution and they should be cache efficient. There has been a lot of recent work on cache friendly hash tables. There has also been work on lockless hash tables. My suggestion is a cache friendly hash table in preference to a lockless one, but both should be considered.

There has also been some very recent developments in hash tables that are provably optimal in a space vs time respect. We must consider these too. E.g. https://www.quantamagazine.org/scientists-find-optimal-balance-of-data-storage-and-time-20240208/

@hlandau
Copy link
Member

hlandau commented Feb 15, 2024

Replacing LHASH, both API and implementation, would get a thumbs up from me. The existing LHASH code can be left as a near-deprecated public API for applications which still use it. Unfortunately LHASH is used as part of our public API in places so we can't realistically formally deprecate it.

I certainly have some ideas on the design of a replacement (internal) API.

I'd like to see any replacement minimise allocations as much as possible and maybe even support moves for objects which can support it (partially intrusive design). I also want it to generate code which allows hash and equality function pointer calls to be devirtualised and inlined.

Some food for thought:

@hlandau
Copy link
Member

hlandau commented Feb 15, 2024

Let's move the LHASH replacement discussion here: #23599

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment