New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.0.12 reveals pkcs11 engine / SoftHSM cleanup issue causing a segfault on exit #22508
Comments
I'll try to look at it. |
@sebastianas could you please run the command in question under valgrind with and without this commit? |
|
thats....interesting In the valgrind run without the commit, theres a series of leaks, which is quite odd, given that the commit in question increases the refcount of the private key in the pkcs11 token by 1 (above what it would have been without the commit), so I would have, if anything expected to see leaks with the commit rather than without More interesting is the valgrind run with the commit In this case there are a number of (effectively) use after free cases in the hsm module, triggered by the C++ runtime running an atexit handler which deletes the hsm singleton instance that holds all the data access the openssl module accesses through the engine api. Whats more odd is how the commit in question would have prevented this behavior, given that atexit handlers should always run in reverse order (i.e. the loading of the HSM module should always occur after OPENSSL_init where we register our atexit handler), so I'm not sure how this doesn't happen regardless of the inclusion status of the commit in question |
@nhorman why does it increase the refcount? I call the _free function in the commit in question so the refcount shouldn't change |
There are 2 issues here. Issue 1, softhsm should really , really , really, use their own libctx,a nd stop stomping on the default context that should be reserved for the main application. Issue 2 atexit() handler, which is often called before the engine tries to close all sessions, one could argue that openssl is wrong in installing an atexit() handler w/o being directed by the application which is the only entity that has a chance of knowin in which order atexit() handlers should be installed (as it influences the order in which they are run. But really softhsm should handle this with fixing issue 1, then it can use its own context and not get messed up when it is requested to close sessions after openssl already destroyed the default library context. Note that softhsm is not the only pkcs11 module that has these issues, so much so I had to add this PR to the pkcs11-provider which experienced the same problems as the engine early on: |
@beldmit I might be mistaken, but for every keytype in that patch, you call {keytype}_get1, followed by EVP_set1. Each of those calls ups the refcnt of the keytype. You do call keytype_free, but that only releases one of the two counts. |
@simo5 that all makes sense, but I would be a bit concerned about the fact that the atexit handler in question is/may be part of the c++ runtime (see the free backtrace in the valgrind log). If that's the case, the c++ runtime is freeing global objects that the main application still has references too, and will use in the openssl atexit handler. Honestly I'm not sure how to fix that without abandoning the use of atexit entirely. |
FYI: Please see (closed) issues #17059 and #17537 for more detailed analysis of issues around |
@nhorman If softhsm is setting up an atexit() handler, then they absolutely need to fix that too (but I do not believe this is the case), they are invoked through at least 2/3 layers of indirection, it is not possible to safely use an atexit() handler in that case, and it is not needed in the first place, pkcs11 modules have a C_Finalize() function that dictates when they must and are safe to be unloaded. |
@rsbeckerca yeah I am painfully aware of those discussion, I thin openssl is wrong at setting up atexit(), applications need to unload it. |
I tend to agree. Fortunately for my platform, memory leaks during |
People seem to run into this atexit handler regularly. The reason we do this by default is so that something like valgrind doesn't report issues.
We might have to rethink how we should do this. For instance we might need to keep a reference counter of how many times we've been initialized. But that's probably harder than it sounds.
|
Much harder than it sounds which is why it's not moving forward well. |
From what (little) I can tell, I don't think that atexit handler is getting setup from any other component directly. Based on the stack trace, it looks like the c++ runtime may be using atexit to clean up global objects. Not saying it's right, just saying that I can't see any other way for the delete method of an object to get called directly from an atexit handler |
The suggestion I have for this (used in other situations) is to maintain an opaque structure that the caller has to pass around for cleanup purposes instead of using |
Anyway let me debug tomorrow. If my patch is wrong and can't easily be fixed, then it will be better to roll it back. Otherwise if it is correct or can easily be fixed, the other problems should be dealt with. |
I do not think you can do that for the default handler, as initialization is implicit and simply does not happen multiple times just because different libraries from within the same process call into openssl, the first one will cause initialization and all the others wont. You could keep track of all objects and defer de-initialization until all of them have been freed. But that would be functionally equivalent to not installing an atexit() handler if applications (or libraries calling into openssl) are misbehaved and do not clean up all the contexts they have allocated at exit. |
It isn't atexit, there is usually special code in the |
However, there is one thing that can be done, and that is to ensure that on atexit() (edit: and on OPENSSL_cleanup() ) the correct ordering is used to free internal openssl data. Providers must be the first* thing ever that is released, and nothing else should be released until providers are freed in the inverse order they have been loaded (in case any provider depends on other provider to perform their shutdown). Of course the devil is in the details as various context may hold pointers to *data hold by providers ... so those context need to be freed before the providers are ... otherwise when the providers vtables ->free calls ae called things will again end up in tears ... This would at least cover up the main issues with softhsm imploding when the pks11 provider calls C_CLoseSession or C_Finalize methods. softhsm fails in most of these cases because the default openssl context is already half deallocated before the providers (or engines) call the pkcs11 drivers finalizers. So there is at least an ordering bug in the openssl cleanup logic that may not be easy to fix without a lot of introspection on what all the objects themselves reference. |
I think you will find that different implementations make |
@paulidale that doesn't quite mesh with what the valgrind log shows:
Its certainly possible that the C++ runtime is doing something special in the main wrapper, but one way or another, its using the exit handler path to make it happen. |
@rsbeckerca atexit should guarantee that registered function order is exactly the reverse of the registration order (ie.. calling atexit(a), atexit(b), atexit(c) implies that when exit is called, the registered functions should be called explicitly in c, b, a order. That said, I'm struggling to see how this can ever work. I say that because OPENSSL_init_crypto should always be called prior to any dynamic engine being loaded (at least, I think it should). Given the OPENSSL_init_crypto is where the OPENSSL_cleanup handler is atexit registered, we should be guararnted that it will run after any subsequent atexit calls are made. Assuming that softhsm registers an atexit handler (be it via explicit call, which I can't find), or via some wrapper magic in the c++ runtime, we should be guaranteed that the softhsm atexit handler will run prior to the openssl atexit handler. The implication here being that we should be guaranteed that the softhsm singleton object gets freed prior to the pcks11 library shutting down, causing the error the use after free error we are seeing. I'm sure I'm missing something, but right now this seems very broken. |
@nhorman "Should" is the operative word. In some exotic architectures (you know of whom I speak), the memory space of the DLL and main program is different, meaning that the lists maintained by |
@rsbeckerca that's the weird thing. POSIX does mandate an unambiguous execution order: But in an ironic twist, it seems to me that, if and only if that required order is honored, will this breakage occur. I.e. if on a more esoteric platform that atexit order cannot be guaranteed, the execution of an applications atexit handler occurs prior to the softhsm atexit handler, this problem would not occur, as openssl would release any reference to the softhsm prior to its implied delete Clearly I'm missing something here, need to look at this more closely |
I am talking about the order in which openssl frees its internal structures, I am pretty sure it is not correct, not about atexit ordering |
the C_CloseAllSessions, or C_CLoseSession, or C_Finalize are all called byt the engine shutting down, the engine is shut down as part of openssl's atexit(), the problem is that when it calls the engine's shutdown it is already too late, some of the opnessl context has already been freed, and when softhsm calls into openssl as part of closing sessions (likely to free some objects it allocated, like EVP_PKEY's) it ends up double freeing or encountering null pointers or in general bad status in openssl context, which ends in tears (and segfaults). |
I don't think that's what's going on. I agree that there is an ordering problem here in terms of what is getting freed when,.but if you look at the invalid reads in the valgrind log, and the segfault backtrace, both reference data that are class private to the softhsm, and don't appear to be directly referenced by openssl. I.e. an atexit handler called the delete method form the hsm class, then later,.the openssl atexit handler attempted.to tear down the engine that used the hsm, and.wound up trying to read/write data that had already been freed in the hsm class.object instance |
I think what you're missing is what happens to global destructors. The |
My initial analysis was wrong. I submitted an update in #22508 (comment). I'm sorry for the confusion. I think that fixing OpenSSL is the proper solution. Executing |
This should definitely never happen, OPENSSL_cleanup() should always happen before the process exit()s. OPENSSL_cleanup() is not something you want to ever call in a library anyway. |
SoftHSM can actually be fixed in a few different ways, they can:
I've got a proposed patch for the latter option here, but I'm not sure what direction they are going to take It also might be worthwhile to augment the openssl application such that its main routine calls OPENSSL_cleanup immediately prior to calling exit. Thats the fix that @sebastianas tested and confirmed to work in this environment. That prevents openssl repo built apps from encountering this failure (or any other simmilar failure from a library that manually registers an atexit handler to do cleanup), but leaves other applications to fix this independently (likely in the same way) Fixing openssl to more broadly not use an exit handler, as discussed previously in this issue is likely the better long term fix, but is a heavy lift in terms of potential API changes, and not something that can be done immediately. I think the best path forward is:
I believe @hlandau opened a tracker for (2) already, though I can't find it at the moment (@hlandau can you link that here please)? If there are no objections, I'll open a PR for (1) shortly |
You're missing the point. I was talking about the order of executing |
#22544 Opened for review to fix this in our shipped applications |
It would be interesting to know if #22951 changes anything with this issue or it does not have any impact. |
On 2023-12-05 03:48:29 [-0800], Tomáš Mráz wrote:
It would be interesting to know if #22951 changes anything with this issue or it does not have any impact.
I rebuilt 3.1.4 with
ENGINE_load_private_key(): Do foreign key detection without side effects
fixup! ENGINE_load_private_key(): Do foreign key detection without si…
and it crashes. The backtrace looks to be the same. So it does not have
an impact.
Sebastian
|
How do we move forward here? Should pkcs11/ SoftHSM be touched in order solve this? |
I think we need to think about this as a few separate questions:
|
I think this is the only reasonable approach, the fact SoftHSM uses atexit handlers in a loadable module is broken for any application using it directly, not just openssl. The PKCS#11 API contract is that it is the application that controls the module shutdown via the finalizer function, using an atexit handler violates the contract. An application that installs an exit handler to free its resources on exit will face the same segfault issue when using the PKCS#11 API directly as well. So this specific segfault is effectively a SoftHSM issue and not an OpenSSL issue. |
I would agree with that. That said, given the fact that (1) softHSM has completely ignored the issue, despite several people raising it and (2) the fact that any other C++ library would be susceptible to this issues, that would seem to me to constrain our forward looking actions to:
|
The title is "3.0.12 reveals pkcs11 engine..." with comment "...the testsuite for libp11-0.4.12 breaks in 3.0.12, works in 3.0.11....". Question is why code like next: RSA *rsa = EVP_PKEY_get1_RSA(pkey);
EVP_PKEY_set1_RSA(pkey, rsa);
RSA_free(rsa) leads to crash? So far so good. Set the method, preset the key and engine code should work. Why then crash? In NSS engine and PKIX-SSH code I do not use use finish method. Instead this "application data" is created with own free method, sample : /* ensure RSA context index */
if (ssh_pkcs11_rsa_ctx_index < 0)
ssh_pkcs11_rsa_ctx_index = RSA_get_ex_new_index(0,
NULL, NULL, NULL, CRYPTO_EX_pkcs11_rsa_free); My conclusion is that #19965 reveals issue with PKCS#11 engine code. Note that engine argument to openssl utility sets engine methods as default. |
openssl's engine pkcs11 + softhsm doesn't work with recent openssl: openssl/openssl#22508
Interesting :
Off topic ibm cryptoki pkcs#11 engine is also death project. I note this about 4 years ago. Last few weeks I spend time to test again engine_pkcs11 now from p11 library project. The old status was: test_ssh_eng_pkcs11 # NOTE
# 2016-02-13 :
# - pass if pkcs11 engine is not registered as default
# - fail with EC as public key can not be loaded(parsed) yet
# (opensc limitation) Note that above status was reason to publish on 29 Feb 2016 PKIX-SSH version 8.8 with "pkcs11 module support EC keys". There was no working engine. It seems to me now issues with engine are resolved: Now engine works for me without to crash in all cases:
for protocol cat > engine-pkcs11 <<EOF
Engine pkcs11
VERBOSE
MODULE_PATH = $pkcs11
EOF cat > openssl-pkcs11.cnf <<EOF
openssl_conf = config
[ config ]
engines = engine_section
[ engine_section ]
engine1 = engine_pkcs11
[engine_pkcs11]
engine_id = pkcs11
#init = 1
#dynamic_path = $engine_pkcs11
#use-ssh-conf#VERBOSE = EMPTY
#use-ssh-conf#MODULE_PATH = $pkcs11
EOF |
About p11 crash in regression test.
What is difference: pkix-ssh client performs explicit startup and clean-up of openssl library. It seems to me this is only difference to openssl utilities. Openssl utility forces engine as default algorithm provider. But this is similar if engine is loaded by path from openssl configuration. Tested with pkix-ssh no crash. I guess that openssl team will revert modification that exposes failure in non-well maintained pksc#11 engine projects. |
Hi, Now PKIX-SSH has following extra tests in master repository.
Those tests pass with and without OpenSSL clean-up. Above means that cause of issue is OpenSSL utiliity ? Off topic: Second commit include work-around specific to pkcs#11 engine from libp11 project. For respective commit for technical details. Description: PKIX-SSH checks for some consitency between optional private key material and public part. Unfortunately libp11 creates EC keys with bogus private part. Looks like work-around OpenSSL issue. But is this case is better to left OpenSSL utility to crash instead to provide fake key. PKIX-SSH core try to clean fake part as is expected token do not export private part. And now yet another OpenSSL regression culd be encountered. So to use EC keys from this pkcs11 engine OpenSSL 1.1.1 must be relative new. Sample failure: > openssl version
OpenSSL 1.1.1l-fips 24 Aug 2021 SUSE release 150500.17.22.1 [1] SoftHSM slot ID 0x1 uninitialized, login (no label)
Found slot: SoftHSM slot ID 0x6b823d3c
Found token: test0
Found 4 private keys:
1 P id=0004 label=p11-ec521
2 P id=0002 label=p11-ec256
3 P id=0001 label=p11-rsa
4 P id=0003 label=p11-ec384
Cannot clean empty EC private key!
Load key "engine:pkcs11:0002" - error in libcrypto If vendor decide to update to 1.1.1.r .... |
The #23063 PR was merged, this should resolve this problem as a side effect. However the underlying issue is still there. |
- Reenabled testing (closes: #48229). openssl3-3.1.5-alt2 - Backported upstream fix for openssl/openssl#22508.
Fix for openssl/openssl#22508 landed in Sisyphus. TODO: convert this to the patch
Fix for openssl/openssl#22508 landed in Sisyphus. TODO: convert this to the patch
Fix for openssl/openssl#22508 landed in Sisyphus. TODO: convert this to the patch
Fix for openssl/openssl#22508 landed in Sisyphus. TODO: convert this to the patch
Fix for openssl/openssl#22508 landed in Sisyphus.
Fix for openssl/openssl#22508 landed in Sisyphus.
Fix for openssl/openssl#22508 landed in Sisyphus.
Hi,
the testsuite for libp11-0.4.12 breaks in 3.0.12, works in 3.0.11.
The testsuite:
https://sources.debian.org/src/libp11/0.4.12-1/debian/tests/engine/
The segfault occurs while generating a certificate request doing
The segfault:
This can bisected to commit 02b87cc
Sebastian
The text was updated successfully, but these errors were encountered: