Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.0.12 reveals pkcs11 engine / SoftHSM cleanup issue causing a segfault on exit #22508

Open
sebastianas opened this issue Oct 25, 2023 · 119 comments
Assignees
Labels
branch: master Merge to master branch triaged: bug The issue/pr is/fixes a bug
Milestone

Comments

@sebastianas
Copy link
Contributor

Hi,

the testsuite for libp11-0.4.12 breaks in 3.0.12, works in 3.0.11.
The testsuite:
https://sources.debian.org/src/libp11/0.4.12-1/debian/tests/engine/
The segfault occurs while generating a certificate request doing

  openssl req -engine pkcs11 -new -key 'pkcs11:model=SoftHSM%20v2;object=test-key;pin-value=1234' \
         -keyform engine -out /tmp/tmp.NmBE19AHMC/req.pem -text -x509 -subj /CN=libp11-DEP8

The segfault:

Engine "pkcs11" set.                                                                                                                                                                           
                                                                                                                                                                                               
Program received signal SIGSEGV, Segmentation fault.                                                                                                                                           
SlotManager::getSlot (this=0x0, slotID=slotID@entry=1769803198) at ./src/lib/slot_mgr/SlotManager.cpp:174                                                                                      
174     ./src/lib/slot_mgr/SlotManager.cpp: No such file or directory.                                                                                                                         
(gdb) bt                                                                                                                                                                                       
#0  SlotManager::getSlot (this=0x0, slotID=slotID@entry=1769803198) at ./src/lib/slot_mgr/SlotManager.cpp:174                                                                                  
#1  0x00007ffff7783fd9 in SoftHSM::C_CloseAllSessions (this=0x5555556587d0, slotID=slotID@entry=1769803198) at ./src/lib/SoftHSM.cpp:1386                                                      
#2  0x00007ffff7764414 in C_CloseAllSessions (slotID=1769803198) at ./src/lib/main.cpp:347                                                                                                     
#3  0x00007ffff7f099ba in pkcs11_slot_unref (slot=slot@entry=0x5555556ab730) at ./src/p11_slot.c:433                                                                                           
#4  0x00007ffff7f09a50 in pkcs11_release_slot (slot=0x5555556698d0) at ./src/p11_slot.c:477                                                                                                    
#5  pkcs11_release_all_slots (slots=0x5555556698d0, nslots=<optimized out>) at ./src/p11_slot.c:464                                                                                            
#6  0x00007ffff7f0a358 in PKCS11_release_all_slots (pctx=<optimized out>, slots=<optimized out>, nslots=<optimized out>) at ./src/p11_front.c:111                                              
#7  0x00007ffff7f02f4e in ctx_finish (ctx=0x5555556450e0) at ./src/eng_back.c:352                                                                                                              
#8  0x00007ffff7f00e38 in engine_finish (engine=<optimized out>) at ./src/eng_front.c:163                                                                                                      
#9  0x00007ffff7be3953 in engine_unlocked_finish (e=0x555555657b20, unlock_for_handlers=unlock_for_handlers@entry=0) at ../crypto/engine/eng_init.c:64                                         
#10 0x00007ffff7be5f86 in int_cleanup_cb_doall (p=0x5555556586b0) at ../crypto/engine/eng_table.c:183                                                                                          
#11 int_cleanup_cb_doall (p=0x5555556586b0) at ../crypto/engine/eng_table.c:177                                                                                                                
#12 0x00007ffff7c2e10c in doall_util_fn (arg=0x0, func_arg=0x0, func=0x7ffff7be5f60 <int_cleanup_cb_doall>, use_arg=0, lh=0x555555657230) at ../crypto/lhash/lhash.c:197                       
#13 OPENSSL_LH_doall (lh=0x555555657230, func=func@entry=0x7ffff7be5f60 <int_cleanup_cb_doall>) at ../crypto/lhash/lhash.c:205                                                                 
#14 0x00007ffff7be6365 in lh_ENGINE_PILE_doall (doall=0x7ffff7be5f60 <int_cleanup_cb_doall>, lh=<optimized out>) at ../crypto/engine/eng_local.h:159                                           
#15 engine_table_cleanup (table=0x7ffff7e836f8 <rsa_table>) at ../crypto/engine/eng_table.c:192                                                                                                
#16 0x00007ffff7be3bda in engine_cleanup_cb_free (item=0x555555658a00) at ../crypto/engine/eng_lib.c:169                                                                                       
#17 0x00007ffff7cb0170 in OPENSSL_sk_pop_free (st=0x555555658ba0, func=0x7ffff7be3bd0 <engine_cleanup_cb_free>) at ../crypto/stack/stack.c:426                                                 
#18 0x00007ffff7be3fcd in sk_ENGINE_CLEANUP_ITEM_pop_free (freefunc=0x7ffff7be3bd0 <engine_cleanup_cb_free>, sk=<optimized out>) at ../crypto/engine/eng_local.h:48                            
#19 engine_cleanup_int () at ../crypto/engine/eng_lib.c:176                                                                                                                                    
#20 0x00007ffff7c329ae in OPENSSL_cleanup () at ../crypto/init.c:418                                                                                                                           
#21 OPENSSL_cleanup () at ../crypto/init.c:344                                                                                                                                                 
#22 0x00007ffff785c955 in __run_exit_handlers (status=0, listp=0x7ffff79f1840 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)                      
    at ./stdlib/exit.c:111                                                                                                                                                                     
#23 0x00007ffff785ca8a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:141                                                                                                            
#24 0x00005555555962e9 in ?? ()                                                                                                                                                                
#25 0x00007ffff78456ca in __libc_start_call_main (main=main@entry=0x5555555961b0, argc=argc@entry=15, argv=argv@entry=0x7fffffffeb78) at ../sysdeps/nptl/libc_start_call_main.h:58             
#26 0x00007ffff7845785 in __libc_start_main_impl (main=0x5555555961b0, argc=15, argv=0x7fffffffeb78, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,                    
    stack_end=0x7fffffffeb68) at ../csu/libc-start.c:360                                                                                                                                       
#27 0x0000555555596491 in ?? ()             

This can bisected to commit 02b87cc

Sebastian

@sebastianas sebastianas added the issue: bug report The issue was opened to report a bug label Oct 25, 2023
@beldmit
Copy link
Member

beldmit commented Oct 25, 2023

I'll try to look at it.

@beldmit
Copy link
Member

beldmit commented Oct 25, 2023

@sebastianas could you please run the command in question under valgrind with and without this commit?

@sebastianas
Copy link
Contributor Author

valgrind --leak-check=full --show-leak-kinds=all $cmd

with-commit.txt
without-commit.txt

@nhorman
Copy link
Contributor

nhorman commented Oct 25, 2023

thats....interesting

In the valgrind run without the commit, theres a series of leaks, which is quite odd, given that the commit in question increases the refcount of the private key in the pkcs11 token by 1 (above what it would have been without the commit), so I would have, if anything expected to see leaks with the commit rather than without

More interesting is the valgrind run with the commit

In this case there are a number of (effectively) use after free cases in the hsm module, triggered by the C++ runtime running an atexit handler which deletes the hsm singleton instance that holds all the data access the openssl module accesses through the engine api. Whats more odd is how the commit in question would have prevented this behavior, given that atexit handlers should always run in reverse order (i.e. the loading of the HSM module should always occur after OPENSSL_init where we register our atexit handler), so I'm not sure how this doesn't happen regardless of the inclusion status of the commit in question

@beldmit
Copy link
Member

beldmit commented Oct 25, 2023

@nhorman why does it increase the refcount? I call the _free function in the commit in question so the refcount shouldn't change

@simo5
Copy link
Contributor

simo5 commented Oct 25, 2023

There are 2 issues here.

Issue 1, softhsm should really , really , really, use their own libctx,a nd stop stomping on the default context that should be reserved for the main application.

Issue 2 atexit() handler, which is often called before the engine tries to close all sessions, one could argue that openssl is wrong in installing an atexit() handler w/o being directed by the application which is the only entity that has a chance of knowin in which order atexit() handlers should be installed (as it influences the order in which they are run.

But really softhsm should handle this with fixing issue 1, then it can use its own context and not get messed up when it is requested to close sessions after openssl already destroyed the default library context.

Note that softhsm is not the only pkcs11 module that has these issues, so much so I had to add this PR to the pkcs11-provider which experienced the same problems as the engine early on:
latchset/pkcs11-provider#225

@nhorman
Copy link
Contributor

nhorman commented Oct 25, 2023

@beldmit I might be mistaken, but for every keytype in that patch, you call {keytype}_get1, followed by EVP_set1. Each of those calls ups the refcnt of the keytype. You do call keytype_free, but that only releases one of the two counts.

@nhorman
Copy link
Contributor

nhorman commented Oct 25, 2023

@simo5 that all makes sense, but I would be a bit concerned about the fact that the atexit handler in question is/may be part of the c++ runtime (see the free backtrace in the valgrind log). If that's the case, the c++ runtime is freeing global objects that the main application still has references too, and will use in the openssl atexit handler. Honestly I'm not sure how to fix that without abandoning the use of atexit entirely.

@rsbeckerca
Copy link
Contributor

@simo5:

Issue 2 atexit() handler, which is often called before the engine tries to close all sessions, one could argue that openssl is wrong in installing an atexit() handler w/o being directed by the application which is the only entity that has a chance of knowin in which order atexit() handlers should be installed (as it influences the order in which they are run.

FYI: Please see (closed) issues #17059 and #17537 for more detailed analysis of issues around atexit().

@simo5
Copy link
Contributor

simo5 commented Oct 25, 2023

@nhorman If softhsm is setting up an atexit() handler, then they absolutely need to fix that too (but I do not believe this is the case), they are invoked through at least 2/3 layers of indirection, it is not possible to safely use an atexit() handler in that case, and it is not needed in the first place, pkcs11 modules have a C_Finalize() function that dictates when they must and are safe to be unloaded.
Doing it earlier is definitely wrong.

@simo5
Copy link
Contributor

simo5 commented Oct 25, 2023

@rsbeckerca yeah I am painfully aware of those discussion, I thin openssl is wrong at setting up atexit(), applications need to unload it.
Of course there is the issue that openssl is also invoked via very indirect library calls (libraries calling liabraries calling libraries calling into openssl), for those cases I think it is better to leak memory/handlers that to try to atexit() anyway.
After all we have things like FD_CLOEXEC to avoid resource leaks on exec() and on a real exit all memory resources and file descriptors will be released anyway when the process terminates.

@rsbeckerca
Copy link
Contributor

@rsbeckerca yeah I am painfully aware of those discussion, I thin openssl is wrong at setting up atexit(), applications need to unload it. Of course there is the issue that openssl is also invoked via very indirect library calls (libraries calling liabraries calling libraries calling into openssl), for those cases I think it is better to leak memory/handlers that to try to atexit() anyway. After all we have things like FD_CLOEXEC to avoid resource leaks on exec() and on a real exit all memory resources and file descriptors will be released anyway when the process terminates.

I tend to agree. Fortunately for my platform, memory leaks during atexit() processing are pretty much irrelevant as once a process is gone, the global heap cleans itself up without fear; however, I have sympathy as one of our modules is moving to Windows (shhhh - its a secret that everyone now knows), and a new threading model that will be impacted, and that is going to make my life much more difficult.

@kroeckx
Copy link
Member

kroeckx commented Oct 25, 2023 via email

@paulidale
Copy link
Contributor

Much harder than it sounds which is why it's not moving forward well.

@nhorman
Copy link
Contributor

nhorman commented Oct 25, 2023

From what (little) I can tell, I don't think that atexit handler is getting setup from any other component directly. Based on the stack trace, it looks like the c++ runtime may be using atexit to clean up global objects. Not saying it's right, just saying that I can't see any other way for the delete method of an object to get called directly from an atexit handler

@rsbeckerca
Copy link
Contributor

Much harder than it sounds which is why it's not moving forward well.

The suggestion I have for this (used in other situations) is to maintain an opaque structure that the caller has to pass around for cleanup purposes instead of using atexit(). The caller can then invoke a new variant of OPENSSL_Cleanup(context) during their own atexit() processing. OpenSSL methods can add to this context for things needing to be invoked at exit. Obviously this requires an API change, so out of context until post 3.2. Just my musings, but it gets around handing multiple heaps that messes with atexit().

@beldmit
Copy link
Member

beldmit commented Oct 25, 2023

Anyway let me debug tomorrow. If my patch is wrong and can't easily be fixed, then it will be better to roll it back. Otherwise if it is correct or can easily be fixed, the other problems should be dealt with.

@simo5
Copy link
Contributor

simo5 commented Oct 25, 2023

People seem to run into this atexit handler regularly. The reason we do this by default is so that something like valgrind doesn't report issues. We might have to rethink how we should do this. For instance we might need to keep a reference counter of how many times we've been initialized. But that's probably harder than it sounds.

I do not think you can do that for the default handler, as initialization is implicit and simply does not happen multiple times just because different libraries from within the same process call into openssl, the first one will cause initialization and all the others wont.

You could keep track of all objects and defer de-initialization until all of them have been freed. But that would be functionally equivalent to not installing an atexit() handler if applications (or libraries calling into openssl) are misbehaved and do not clean up all the contexts they have allocated at exit.

@paulidale
Copy link
Contributor

From what (little) I can tell, I don't think that atexit handler is getting setup from any other component directly. Based on the stack trace, it looks like the c++ runtime may be using atexit to clean up global objects. Not saying it's right, just saying that I can't see any other way for the delete method of an object to get called directly from an atexit handler

It isn't atexit, there is usually special code in the main wrapper that calls the dtor list.

@simo5
Copy link
Contributor

simo5 commented Oct 25, 2023

@paulidale @kroeckx

However, there is one thing that can be done, and that is to ensure that on atexit() (edit: and on OPENSSL_cleanup() ) the correct ordering is used to free internal openssl data.

Providers must be the first* thing ever that is released, and nothing else should be released until providers are freed in the inverse order they have been loaded (in case any provider depends on other provider to perform their shutdown).

Of course the devil is in the details as various context may hold pointers to *data hold by providers ... so those context need to be freed before the providers are ... otherwise when the providers vtables ->free calls ae called things will again end up in tears ...

This would at least cover up the main issues with softhsm imploding when the pks11 provider calls C_CLoseSession or C_Finalize methods.

softhsm fails in most of these cases because the default openssl context is already half deallocated before the providers (or engines) call the pkcs11 drivers finalizers.

So there is at least an ordering bug in the openssl cleanup logic that may not be easy to fix without a lot of introspection on what all the objects themselves reference.

@rsbeckerca
Copy link
Contributor

@paulidale @kroeckx

However, there is one thing that can be done, and that is to ensure that on atexit() (edit: and on OPENSSL_cleanup() ) the correct ordering is used to free internal openssl data.

Providers must be the first* thing ever that is released, and nothing else should be released until providers are freed in the inverse order they have been loaded (in case any provider depends on other provider to perform their shutdown).

Of course the devil is in the details as various context may hold pointers to *data hold by providers ... so those context need to be freed before the providers are ... otherwise when the providers vtables ->free calls ae called things will again end up in tears ...

This would at least cover up the main issues with softhsm imploding when the pks11 provider calls C_CLoseSession or C_Finalize methods.

softhsm fails in most of these cases because the default openssl context is already half deallocated before the providers (or engines) call the pkcs11 drivers finalizers.

So there is at least an ordering bug in the openssl cleanup logic that may not be easy to fix without a lot of introspection on what all the objects themselves reference.

I think you will find that different implementations make atexit() order guarantees problematic. That was part of the issue we found in prior cases. There is no portable ordering guarantee.

@nhorman
Copy link
Contributor

nhorman commented Oct 25, 2023

@paulidale that doesn't quite mesh with what the valgrind log shows:

Engine "pkcs11" set.
==162888== Invalid read of size 4
==162888==    at 0x53CE278: detectFork (SoftHSM.cpp:12445)
==162888==    by 0x53CE278: SoftHSM::i() (SoftHSM.cpp:349)
==162888==    by 0x53A2408: C_CloseAllSessions (main.cpp:347)
==162888==    by 0x53779B9: pkcs11_slot_unref (p11_slot.c:433)
==162888==    by 0x5377A4F: pkcs11_release_slot (p11_slot.c:477)
==162888==    by 0x5377A4F: pkcs11_release_all_slots (p11_slot.c:464)
==162888==    by 0x5370F4D: ctx_finish (eng_back.c:352)
==162888==    by 0x536EE37: engine_finish (eng_front.c:163)
==162888==    by 0x4AE1952: engine_unlocked_finish (eng_init.c:64)
==162888==    by 0x4AE3F85: int_cleanup_cb_doall (eng_table.c:183)
==162888==    by 0x4AE3F85: int_cleanup_cb_doall (eng_table.c:177)
==162888==    by 0x4B2C10B: doall_util_fn (lhash.c:197)
==162888==    by 0x4B2C10B: OPENSSL_LH_doall (lhash.c:205)
==162888==    by 0x4AE4364: lh_ENGINE_PILE_doall (eng_local.h:159)
==162888==    by 0x4AE4364: engine_table_cleanup (eng_table.c:192)
==162888==    by 0x4AE1BD9: engine_cleanup_cb_free (eng_lib.c:169)
==162888==    by 0x4BAE16F: OPENSSL_sk_pop_free (stack.c:426)
==162888==  Address 0x4fedfc8 is 136 bytes inside a block of size 144 free'd
==162888==    at 0x484399B: operator delete(void*, unsigned long) (vg_replace_malloc.c:935)
==162888==    by 0x4DC3954: __run_exit_handlers (exit.c:111) <===== HERE
==162888==    by 0x4DC3A89: exit (exit.c:141)
==162888==    by 0x14A2A5: main (in /home/bigeasy/openssl/tb/openssl/build/apps/openssl)
==162888==  Block was alloc'd at
==162888==    at 0x4840F2F: operator new(unsigned long) (vg_replace_malloc.c:422)

Its certainly possible that the C++ runtime is doing something special in the main wrapper, but one way or another, its using the exit handler path to make it happen.

@nhorman
Copy link
Contributor

nhorman commented Oct 25, 2023

@rsbeckerca atexit should guarantee that registered function order is exactly the reverse of the registration order (ie.. calling atexit(a), atexit(b), atexit(c) implies that when exit is called, the registered functions should be called explicitly in c, b, a order.

That said, I'm struggling to see how this can ever work. I say that because OPENSSL_init_crypto should always be called prior to any dynamic engine being loaded (at least, I think it should). Given the OPENSSL_init_crypto is where the OPENSSL_cleanup handler is atexit registered, we should be guararnted that it will run after any subsequent atexit calls are made.

Assuming that softhsm registers an atexit handler (be it via explicit call, which I can't find), or via some wrapper magic in the c++ runtime, we should be guaranteed that the softhsm atexit handler will run prior to the openssl atexit handler. The implication here being that we should be guaranteed that the softhsm singleton object gets freed prior to the pcks11 library shutting down, causing the error the use after free error we are seeing.

I'm sure I'm missing something, but right now this seems very broken.

@rsbeckerca
Copy link
Contributor

@rsbeckerca atexit should guarantee that registered function order is exactly the reverse of the registration order (ie.. calling atexit(a), atexit(b), atexit(c) implies that when exit is called, the registered functions should be called explicitly in c, b, a order.

That said, I'm struggling to see how this can ever work. I say that because OPENSSL_init_crypto should always be called prior to any dynamic engine being loaded (at least, I think it should). Given the OPENSSL_init_crypto is where the OPENSSL_cleanup handler is atexit registered, we should be guararnted that it will run after any subsequent atexit calls are made.

Assuming that softhsm registers an atexit handler (be it via explicit call, which I can't find), or via some wrapper magic in the c++ runtime, we should be guaranteed that the softhsm atexit handler will run prior to the openssl atexit handler. The implication here being that we should be guaranteed that the softhsm singleton object gets freed prior to the pcks11 library shutting down, causing the error the use after free error we are seeing.

I'm sure I'm missing something, but right now this seems very broken.

@nhorman "Should" is the operative word. In some exotic architectures (you know of whom I speak), the memory space of the DLL and main program is different, meaning that the lists maintained by atexit() are in separate spaces for the DLL and the main program. This results in a fundamental inability to reverse-sequence what is added vs. processed. In one case (mine), when a DLL unloads due to dlclose(), atexit() inside the DLL's memory space is processed in the correct order from the point of view of the DLL. This is regardless of what the main program does. When the main program exits, its own atexit() list is processed, which is also correct, from the point of view of the main program, but that list does not know about the DLL. The order is not guaranteed across the memory boundaries of the two (DLL vs. main). If the run-time was more robust (and the POSIX spec required an unambiguous compliance, which the OS dev team proved does not), this might not be an issue. However, there is actually no guarantee the DLL in an MPP architecture, is in the same chip-set as the main program, so coordination is difficult at best, impractical at worse.

@nhorman
Copy link
Contributor

nhorman commented Oct 25, 2023

@rsbeckerca that's the weird thing. POSIX does mandate an unambiguous execution order:
https://pubs.opengroup.org/onlinepubs/007904875/functions/atexit.html

But in an ironic twist, it seems to me that, if and only if that required order is honored, will this breakage occur. I.e. if on a more esoteric platform that atexit order cannot be guaranteed, the execution of an applications atexit handler occurs prior to the softhsm atexit handler, this problem would not occur, as openssl would release any reference to the softhsm prior to its implied delete

Clearly I'm missing something here, need to look at this more closely

@simo5
Copy link
Contributor

simo5 commented Oct 26, 2023

I think you will find that different implementations make atexit() order guarantees problematic. That was part of the issue we found in prior cases. There is no portable ordering guarantee.

I am talking about the order in which openssl frees its internal structures, I am pretty sure it is not correct, not about atexit ordering

@simo5
Copy link
Contributor

simo5 commented Oct 26, 2023

the C_CloseAllSessions, or C_CLoseSession, or C_Finalize are all called byt the engine shutting down, the engine is shut down as part of openssl's atexit(), the problem is that when it calls the engine's shutdown it is already too late, some of the opnessl context has already been freed, and when softhsm calls into openssl as part of closing sessions (likely to free some objects it allocated, like EVP_PKEY's) it ends up double freeing or encountering null pointers or in general bad status in openssl context, which ends in tears (and segfaults).

@nhorman
Copy link
Contributor

nhorman commented Oct 26, 2023

I don't think that's what's going on. I agree that there is an ordering problem here in terms of what is getting freed when,.but if you look at the invalid reads in the valgrind log, and the segfault backtrace, both reference data that are class private to the softhsm, and don't appear to be directly referenced by openssl. I.e. an atexit handler called the delete method form the hsm class, then later,.the openssl atexit handler attempted.to tear down the engine that used the hsm, and.wound up trying to read/write data that had already been freed in the hsm class.object instance

@paulidale
Copy link
Contributor

I'm sure I'm missing something, but right now this seems very broken.

I think what you're missing is what happens to global destructors. The atexit processing is part of the usual exit library call which happens either by calling exit or by returning from main. After this the _exit system call is made and the kernel starts cleaning things up. As part of this process the dynamic loader runs the .fini code for the main program and each loaded shared library (all the elf format things) which iterates through the associated .fini_array section and calls each function in turn. This is where C++ global classes are cleaned up. So these are cleansed on library unload well after atexit processing has finished.

@mtrojnar
Copy link
Contributor

mtrojnar commented Nov 1, 2023

@sebastianas

How do we proceed here? Do we alter the openssl binary as suggeted or mandate something else?

My initial analysis was wrong. I submitted an update in #22508 (comment). I'm sorry for the confusion.

I think that fixing OpenSSL is the proper solution. Executing OPENSSL_cleanup() in an atexit() callback turned out to be a bad idea, not only because its indirect dependencies may be terminated before the OpenSSL's callback (as in this case), and also because the application may register its own atexit() callback that will be executed after the OpenSSL's callback, and the application's callback may still need OpenSSL to gracefully close its encrypted sessions.

@simo5
Copy link
Contributor

simo5 commented Nov 1, 2023

Executing OPENSSL_cleanup() in an atexit() callback

This should definitely never happen, OPENSSL_cleanup() should always happen before the process exit()s. OPENSSL_cleanup() is not something you want to ever call in a library anyway.

@nhorman
Copy link
Contributor

nhorman commented Nov 1, 2023

SoftHSM can actually be fixed in a few different ways, they can:

  1. Stop using global class objects, opting instead to create them dynamically when needed, and deleting them on close. This avoids the registration of the atexit handlers that the c++ compiler emits, but does require significant refactoring
  2. Maintain a global variable tracking the exit status, so that their pkcs11 interface avoids dereferencing the global objects when the global flag indicates they are exiting, opting instead to return an appropriate error code. This is more of a hack, but is fairly easy to implement.

I've got a proposed patch for the latter option here, but I'm not sure what direction they are going to take

It also might be worthwhile to augment the openssl application such that its main routine calls OPENSSL_cleanup immediately prior to calling exit. Thats the fix that @sebastianas tested and confirmed to work in this environment. That prevents openssl repo built apps from encountering this failure (or any other simmilar failure from a library that manually registers an atexit handler to do cleanup), but leaves other applications to fix this independently (likely in the same way)

Fixing openssl to more broadly not use an exit handler, as discussed previously in this issue is likely the better long term fix, but is a heavy lift in terms of potential API changes, and not something that can be done immediately.

I think the best path forward is:

  1. Make the openssl app change to call OPENSSL_cleanup prior to exiting in main right now, just to take care of our own shipped apps
  2. Start tracking the work to investigate ways we can ensure that each application has its own context that it manages the lifetime of to avoid this issue in the future

I believe @hlandau opened a tracker for (2) already, though I can't find it at the moment (@hlandau can you link that here please)?

If there are no objections, I'll open a PR for (1) shortly

@mtrojnar
Copy link
Contributor

mtrojnar commented Nov 1, 2023

@simo5

This should definitely never happen, OPENSSL_cleanup() should always happen before the process exit()s. OPENSSL_cleanup() is not something you want to ever call in a library anyway.

You're missing the point. I was talking about the order of executing atexit() callbacks. See the atexit(3) manual for details.

@nhorman
Copy link
Contributor

nhorman commented Nov 1, 2023

#22544 Opened for review to fix this in our shipped applications

@t8m
Copy link
Member

t8m commented Dec 5, 2023

It would be interesting to know if #22951 changes anything with this issue or it does not have any impact.

@sebastianas
Copy link
Contributor Author

sebastianas commented Dec 7, 2023 via email

@sebastianas
Copy link
Contributor Author

sebastianas commented Dec 14, 2023

How do we move forward here? Should pkcs11/ SoftHSM be touched in order solve this?

@nhorman
Copy link
Contributor

nhorman commented Dec 15, 2023

I think we need to think about this as a few separate questions:

  1. What do we do about this in future openssl releases
    • We've discussed in this ticket the possibility of mandating a call to OPENSSL_cleanup prior to exit. This seems sensible to me, but is going to take time, as applications won't migrate quickly
    • We've discussed moving additional global data into libctx and mandating the use of an allocated libctx for all applications wishing to use c++ modules like this, but I don't think thats going to help (though others may disagree)
  2. What do we do to fix existing releases currently in the wild, or do in the place of future fixes that may not be generally palatable:
    • We could mandate that C++ modules used with openssl not use global singletons (thereby avoiding the c++ runtime atexit registration, and hence the problem writ large. This would work, but is effectively a "not our problem" approach, and may not get lots of traction (see here)
    • We could use setsigjump/longjmp to catch errors and do cleanup. This has been generally met with distaste, but I think we have to recognize that, while its not pallatable, it could be used here in a way that is compatible with released versions.
    • We could modify openssl provided apps to call OPENSSL_cleanup prior to exit, thereby avoiding the problem for applications we ship, in conjunction with documentation indicating this is a strongly recommended best practice. This fixes problems for software openssl ships, but leaves the issue open for downstream consumers, and we would have to rely on communication/referencing of our docs to get said consumers to fix the issue on their own.

@simo5
Copy link
Contributor

simo5 commented Dec 15, 2023

  • We could mandate that C++ modules used with openssl not use global singletons (thereby avoiding the c++ runtime atexit registration, and hence the problem writ large. This would work, but is effectively a "not our problem" approach, and may not get lots of traction (see here)

I think this is the only reasonable approach, the fact SoftHSM uses atexit handlers in a loadable module is broken for any application using it directly, not just openssl.

The PKCS#11 API contract is that it is the application that controls the module shutdown via the finalizer function, using an atexit handler violates the contract.

An application that installs an exit handler to free its resources on exit will face the same segfault issue when using the PKCS#11 API directly as well.

So this specific segfault is effectively a SoftHSM issue and not an OpenSSL issue.

@nhorman
Copy link
Contributor

nhorman commented Dec 15, 2023

I would agree with that.

That said, given the fact that (1) softHSM has completely ignored the issue, despite several people raising it and (2) the fact that any other C++ library would be susceptible to this issues, that would seem to me to constrain our forward looking actions to:

  1. Strongly recommending that users call OPENSSL_cleanup prior to exit (to avoid the issue)
  2. Stating that any users who don't do (1), and encounter a crash of this nature can only fix it by urging the offending c++ library to fix it (i.e. declare this a not-our-problem)

@petrovr
Copy link

petrovr commented Dec 17, 2023

Quite interesting...
I wonder why developers think that crash in a module should be fixed in this module. In this case SoftHSM pkcs#11 module.

FYI PKIX-SSH: Support for EC keys and certificates stored in PKCS#11 tokens is added in version 8.8 (29 Feb 2016).
PKIX-SSH supports keys and certificates provided by loadable modules(engines). Unfortunately PKCS#11 engine is not usable. A lot of reasons - main was crashes if keys are from engine. Engine pretend to be default key method. Delegation of functionality(operation) if key is not engine key. As result engine cannot be used with other engines.
Solution - direct use of EC keys from token and forget this engine.

To test implementation I use SoftHSM. This custom build as it must be build with different OpenSSL versions:
1.1.1 -
изображение, and 3.2 -
изображение

@petrovr
Copy link

petrovr commented Dec 17, 2023

The title is "3.0.12 reveals pkcs11 engine..." with comment "...the testsuite for libp11-0.4.12 breaks in 3.0.12, works in 3.0.11....".

Question is why code like next:

        RSA *rsa = EVP_PKEY_get1_RSA(pkey);
        EVP_PKEY_set1_RSA(pkey, rsa);
        RSA_free(rsa)

leads to crash?
For reference see #17286 and #17092, then #19965 .

So far so good. Set the method, preset the key and engine code should work. Why then crash?
Quick look into PKCS#11 code should that engined defines finish method for RSA key. Method delete "application data" and then calls finish of RSA default method. Which one is "RSA default method"?
Depending from configuration, setup and etc. "RSA default method" != "OpenSSL RSA method". O-o-o-p-p-s!

In NSS engine and PKIX-SSH code I do not use use finish method. Instead this "application data" is created with own free method, sample :

	/* ensure RSA context index */
	if (ssh_pkcs11_rsa_ctx_index < 0)
		ssh_pkcs11_rsa_ctx_index = RSA_get_ex_new_index(0,
		    NULL, NULL, NULL, CRYPTO_EX_pkcs11_rsa_free);

My conclusion is that #19965 reveals issue with PKCS#11 engine code.

Note that engine argument to openssl utility sets engine methods as default.
Hint: see message is "Engine "pkcs11" set." in report above.

stanislavlevin added a commit to stanislavlevin/bind9 that referenced this issue Dec 29, 2023
openssl's engine pkcs11 + softhsm doesn't work with recent openssl:
openssl/openssl#22508
@petrovr
Copy link

petrovr commented Dec 29, 2023

Interesting :

  • opensc does not use p11 library
  • engine_pkcs11 is dead / merged in p11 library 6 years ago
  • p11 library is mostly pkcs#11 engine.

Off topic ibm cryptoki pkcs#11 engine is also death project. I note this about 4 years ago.


Last few weeks I spend time to test again engine_pkcs11 now from p11 library project. The old status was:

test_ssh_eng_pkcs11	# NOTE
				# 2016-02-13 :
				# - pass if pkcs11 engine is not registered as default
				# - fail with EC as public key can not be loaded(parsed) yet
				#   (opensc limitation)

Note that above status was reason to publish on 29 Feb 2016 PKIX-SSH version 8.8 with "pkcs11 module support EC keys". There was no working engine.

It seems to me now issues with engine are resolved:
image

Now engine works for me without to crash in all cases:

  • set from openssl configuration
  • initialised from openssl configuration
  • using pkix-ssh configuration
  • RSA or EC keys

for protocol

cat > engine-pkcs11 <<EOF
Engine pkcs11
VERBOSE
MODULE_PATH = $pkcs11
EOF
cat > openssl-pkcs11.cnf <<EOF
openssl_conf = config

[ config ]
engines = engine_section

[ engine_section ]
engine1 = engine_pkcs11

[engine_pkcs11]
engine_id = pkcs11
#init = 1

#dynamic_path = $engine_pkcs11

#use-ssh-conf#VERBOSE = EMPTY
#use-ssh-conf#MODULE_PATH = $pkcs11
EOF

@petrovr
Copy link

petrovr commented Dec 29, 2023

About p11 crash in regression test.

  • pkix-ssh with direct use of softhsm2 module - does not crash
  • pkix-ssh with pkcs#11 engine that uses softhsm2 module - does not crash

What is difference: pkix-ssh client performs explicit startup and clean-up of openssl library. It seems to me this is only difference to openssl utilities.


Openssl utility forces engine as default algorithm provider. But this is similar if engine is loaded by path from openssl configuration. Tested with pkix-ssh no crash.
So I cannot confirm that reported crash is because p11 code redirect to itself some operations.


I guess that openssl team will revert modification that exposes failure in non-well maintained pksc#11 engine projects.

@petrovr
Copy link

petrovr commented Jan 7, 2024

Hi,

Now PKIX-SSH has following extra tests in master repository.

  1. add X.509 tests with pkcs#11 token: light version of my internal tests
  2. add X.509 tests with pkcs#11 engine: match my internal tests with case when engine is loaded by path

Those tests pass with and without OpenSSL clean-up.
Remark: my tests are with bynaries from "current" projects repositories. OpenSSL includes 3.0, 3.1, 3.2 but not master.

Above means that cause of issue is OpenSSL utiliity ?


Off topic:

Second commit include work-around specific to pkcs#11 engine from libp11 project. For respective commit for technical details.

Description: PKIX-SSH checks for some consitency between optional private key material and public part.

Unfortunately libp11 creates EC keys with bogus private part. Looks like work-around OpenSSL issue. But is this case is better to left OpenSSL utility to crash instead to provide fake key.

PKIX-SSH core try to clean fake part as is expected token do not export private part. And now yet another OpenSSL regression culd be encountered.

So to use EC keys from this pkcs11 engine OpenSSL 1.1.1 must be relative new. Sample failure:

> openssl version
OpenSSL 1.1.1l-fips  24 Aug 2021 SUSE release 150500.17.22.1
[1] SoftHSM slot ID 0x1        uninitialized, login  (no label)
Found slot:  SoftHSM slot ID 0x6b823d3c
Found token: test0
Found 4 private keys:
   1 P  id=0004 label=p11-ec521
   2 P  id=0002 label=p11-ec256
   3 P  id=0001 label=p11-rsa
   4 P  id=0003 label=p11-ec384
Cannot clean empty EC private key!
Load key "engine:pkcs11:0002" - error in libcrypto

If vendor decide to update to 1.1.1.r ....

@t8m
Copy link
Member

t8m commented Jan 31, 2024

The #23063 PR was merged, this should resolve this problem as a side effect. However the underlying issue is still there.

vt-alt pushed a commit to altlinux/specs that referenced this issue Apr 1, 2024
- Reenabled testing (closes: #48229).

openssl3-3.1.5-alt2

- Backported upstream fix for openssl/openssl#22508.
stanislavlevin added a commit to stanislavlevin/bind9 that referenced this issue Apr 2, 2024
Fix for openssl/openssl#22508
landed in Sisyphus.

TODO: convert this to the patch
stanislavlevin added a commit to stanislavlevin/bind9 that referenced this issue Apr 2, 2024
Fix for openssl/openssl#22508
landed in Sisyphus.

TODO: convert this to the patch
stanislavlevin added a commit to stanislavlevin/bind9 that referenced this issue Apr 2, 2024
Fix for openssl/openssl#22508
landed in Sisyphus.

TODO: convert this to the patch
stanislavlevin added a commit to stanislavlevin/bind9 that referenced this issue Apr 2, 2024
Fix for openssl/openssl#22508
landed in Sisyphus.

TODO: convert this to the patch
stanislavlevin added a commit to stanislavlevin/bind9 that referenced this issue Apr 2, 2024
stanislavlevin added a commit to stanislavlevin/bind9 that referenced this issue Apr 3, 2024
stanislavlevin added a commit to stanislavlevin/bind9 that referenced this issue Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch triaged: bug The issue/pr is/fixes a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.