-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tls: OpenSSL 3/1.1.1 is using shared ERR_STATE in all workers #3695
Comments
RFC: OpenSSL 3 POC branch: https://github.com/kamailio/kamailio/tree/space88man/openssl3-poc |
Thank you for the detailed analysis and also the POC. I remember from a custom module that I maintained for some year that using threads in Kamailio can bring some challenges due to its multi-process architectures. So its probably need to be discussed more thoroughly. The other option you mentioned would be to just don't do the TLS initialization in rank 0, right? If we need to touch all openssl using modules anyway, maybe this is an easier and less intrusive way? |
One solution is to have each module declare a
Then this thread will disappear after BTW this study explains why even OpenSSL 1.1.1 is so odd - per child replicated I have also gone back to look at the OpenSSL 1.1.1 implementation - by putting all initialization ( To be clear, the dlsym-pthreads stuff( If we fix the handling of rank 0 thread-local variables correctly, then all the pain that Kamailio has experience with OpenSSL over the years might be over! |
@space88man: thanks for digging deep into this one! I am fine to try the proposed approach in the PR #3696 and see how it goes, the only question would be about the impact of other libs that link behind with libssl, like libcurl or libmysqlclient. Does a similar approach needs for the modules linked with such libraries? Regarding multi-threading in Kamailio, there are couple of modules already using threads, like |
I have mentioned this before as an issue and it was a driving factor behind creating the API to the curl module. Having multiple modules using Curl that initiates OpenSSL (or non-curl modules initiating OpenSSL) will lead to problems. I remember that Kevin Fleming while working with Asterisk wrote a wrapper library that initialised OpenSSL once only for all modules. I have no idea if that still exists, but if it does, it could be an inspiration. |
The tls module initializes very early the libssl, which was ok for libssl 0.9.x, 1.0.x. With libssl 1.1.x we had to import random number generator to go around some of the libssl-specific multi-threading. With libssl 3.x seems to be more impact, somehow related to what was introduced in libssl 1.1.x, but expanded to other globals. In other words, I don't think that a wrapper library done long time ago, pre libssl 1.1.x/3.x can help nowadays. |
Ok thanks guys for your feedback: I am going to proceed with a series of commits to master. These can then be reverted easily. To each commit message I will also add the label thread-local @miconda you are correct regarding libssl-initialization threads in rank 0; they are run-and-done type and will complete before |
@space88man: one remark regarding the commit message format in the PR #3696, do not make first line like:
Where I assume POC stands for
|
I have implemented no RAND replacement in OpenSSL 1.1.1. The initial set of commits for 1.1.1/3.x are in master. Verification:
OpenSSL 1.1.1:
|
Closing now with commits on master—reopen with separate issues for OpenSSL 1.1.1/OpenSSL 3. |
- the 2nd lock was put in place as defensive programming for shm contention - GH #3695: the underlying issue is early init of thread-locals
Description
References: #3635
Early initialization by
tls
of OpenSSL 3/1.1.1 in rank 0 results in the use of shared stateERR_STATE *
. Under heavy traffic the workers will corrupt the state. This is a race condition which results in intermittent crashes (as in #3635).Initialization calls such as
OPENSSL_init_ssl
are responsible for creating and initializingERR_STATE
in rank 0, which is then inherited and reused by all worker processes without reinitialization. This results in memory corruption (not observable on a lightly loaded system).This bug is much less evident with OpenSSL 1.1.1 as that version of the library has less aggressive dynamic memory management (particularly in
crypto/err/err.c
). Most kamailio + OpenSSL 3.x bug reports show that SIGSEGV happens mostly insidecrypto/err/*.[ch]
.[Update] OpenSSL 1.1.1—for a similar reason a set of thread-local variables(
public_drbg
,private_drbg
) is the reason why there is a need for RAND replacement intls_rand.c
. These variables are inherited by the worker without proper initialization—hence failure of the RAND system if it is not replaced.Note: this is not related to shared memory allocation contention as this protected by a (multi-process) futex. Ping @miconda re:
1a9b0b6
Since qm/fm et al are already protected by a multi-process futex this commit is redundant (it puts a pthread mutex around the futex). I have been able to reproduce OpenSSL 3 crashes with heavy loading with this commit.
The shared object in question is the thread-local
ERR_STATE
. It should be thread-local but:shows that multiple workers are accessing the same object.
Troubleshooting
N/A
Reproduction
ERR_STATE *
is initialized in rank 0Debugging Data
Dump
ERR_STATE *
from two different processes: observe that these are identical meaning both workers are using the same struct.Log Messages
SIP Traffic
N/A
Possible Solutions
ERR_STATE *
is thread-local so will not be propagated to the main threadtls
hooks in a worker thread—this ensures that they get their own copy ofERR_STATE *
—and will not be affected by rank 0Additional Information
OpenSSL(1.1.1, 3.x) has initialize-once and initialize-once-per-thread semantics. The (per-thread-singleton) object
ERR_STATE
is of the type initialize-once-per-thread.when kamailio does OpenSSL initialization in rank 0, the workers inherit all "initialize-once*" objects. If these objects are intended to be mutable the workers will contend for the same state
due to the design of the OpenSSL 3 a lot of this state (static variables/functions, one time initialization, no accessors) cannot be reset in child processes. Initialize-once-per-thread state can only be "renewed" in new threads.
at the point of
fork()
: all mutable thread-local state created in rank 0 should be uninitialized. Currently (as of OpenSSL 3.2.0) the main culprit isERR_STATE
. I haven't found any other points of contention.I have a local branch based on master, it skips all OpenSSL initialization in rank 0 (by the time of
fork()
-pthread_getspecific(err_thread_local)
is still returning NULL), and it passes all my load testing. To complicate the scenario I loadoutbound.so
in the config and finesseERR_STATE *
initialization by running most ofoutbound_mod.c:mod_init
inpthread_create
.Rank 0 calls that set
ERR_STATE *
In kamailio start-up these are the functions that call
ossl_err_state_get_int
: this will setERR_STATE
in rank 0 and all child processes. This start-up does not include any other modules that use OpenSSL.OPENSSL_init_ssl()
intls_h_mod_pre_init_f()
SSL_load_error_strings()
intls_h_mod_pre_init_f()
RAND_xxxxxxxxxxxx
calls that enable threading in the randctx intls_h_mod_pre_init_f()
tls_fix_domains_cfg()
inmod_child()
forPROC_INIT
Example module using OpenSSL:
outbound.so
- this callsossl_err_get_state_int()
inmod_init()
; to avoid initializingERR_STATE
in the primary thread of rank 0, run the crypto parts ofmod_init()
inpthread_create(...)
.The text was updated successfully, but these errors were encountered: