Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug (crash) when stopping provider #21023

Open
mouse07410 opened this issue May 22, 2023 · 20 comments
Open

Bug (crash) when stopping provider #21023

mouse07410 opened this issue May 22, 2023 · 20 comments
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 triaged: bug The issue/pr is/fixes a bug

Comments

@mouse07410
Copy link
Contributor

Detailed story is in this issue, though I'd recommend starting here.

TL;DR
OpenSSL crashes after running liboqs tests, hitting NULL ptr when cleaning up/deallocating things. Great analysis is here. It occurs when several "additional" providers are defined in openssl.cnf - by "additional" I mean pkcs11-provider and oqs-provider, though legacy provider seems to also contribute to this problem when it's defined (I stopped enabling it, precisely because of that).

Quoting from the referring issue:

With both pkcs11-provider and oqs-provider enabled - liboqs tests will pass (reporting == 464 passed, 220 skipped in 40.80s ==), but crash in the end:

Segmentation fault
FAILED: tests/CMakeFiles/run_tests /Users/ur20980/src/liboqs/build/tests/CMakeFiles/run_tests 
cd /Users/ur20980/src/liboqs && /opt/local/bin/cmake -E env OQS_BUILD_DIR=/Users/ur20980/src/liboqs/build python3 -m pytest --verbose --numprocesses=auto --ignore=scripts/copy_from_upstream/repos
ninja: build stopped: subcommand failed.

and crash report:

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [18302]

VM Region Info: 0 is not in any region.  Bytes before following region: 4310319104
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      UNUSED SPACE AT START
--->  
      __TEXT                      100ea4000-100ea8000    [   16K] r-x/r-x SM=COW  .../MacOS/Python

Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libsystem_pthread.dylib       	       0x1a1c948c4 pthread_rwlock_wrlock + 0
1   libcrypto.3.dylib             	       0x10564add4 CRYPTO_THREAD_write_lock + 12 (threads_pthread.c:110)
2   libcrypto.3.dylib             	       0x1055f2c14 ERR_unload_strings + 92 (err.c:314)
3   libcrypto.3.dylib             	       0x105647720 ossl_provider_free + 92 (provider_core.c:688)
4   libcrypto.3.dylib             	       0x1056ad138 OPENSSL_sk_pop_free + 76 (stack.c:439)
5   libcrypto.3.dylib             	       0x105647104 sk_OSSL_PROVIDER_pop_free + 12 (provider_core.c:199) [inlined]
6   libcrypto.3.dylib             	       0x105647104 ossl_provider_store_free + 76 (provider_core.c:295)
7   libcrypto.3.dylib             	       0x105637d6c context_deinit_objs + 124 (context.c:250)
8   libcrypto.3.dylib             	       0x105637564 context_deinit + 16 (context.c:334) [inlined]
9   libcrypto.3.dylib             	       0x105637564 OSSL_LIB_CTX_free + 132 (context.c:465)
10  oqsprovider.0.5.0-dev.dylib   	       0x10436f560 oqsx_freeprovctx + 24 (oqsprov_keys.c:178)
11  oqsprovider.0.5.0-dev.dylib   	       0x10436e9a0 oqsprovider_teardown + 12 (oqsprov.c:553)
12  libcrypto.3.dylib             	       0x104547400 ossl_provider_free + 76
13  libcrypto.3.dylib             	       0x10459b5a0 OPENSSL_sk_pop_free + 60
14  libcrypto.3.dylib             	       0x104546ec0 ossl_provider_store_free + 72
15  libcrypto.3.dylib             	       0x10453b5c4 context_deinit_objs + 124
16  libcrypto.3.dylib             	       0x10453ae5c context_deinit + 32
17  libcrypto.3.dylib             	       0x10453ae2c ossl_lib_ctx_default_deinit + 20
18  libcrypto.3.dylib             	       0x10453dd80 OPENSSL_cleanup + 204
19  libsystem_c.dylib             	       0x1a1b55ed4 __cxa_finalize_ranges + 492
20  libsystem_c.dylib             	       0x1a1b55c4c exit + 44
21  libdyld.dylib                 	       0x1a1cb0554 dyld4::LibSystemHelpers::exit(int) const + 20
22  dyld                          	       0x1a193ff7c start + 2320
@mattcaswell
Copy link
Member

Do you have the ability to run this in a debugger? Not sure how easy that is given the oqs-provider test framework. It would be really nice to be able to confirm that the write lock is indeed NULL in ERR_unload_strings as per my previous analysis. It would also be nice to know if err_cleanup gets called at any point prior to the crash. Failing that, adding some "printf" statements might do the job.

@mattcaswell mattcaswell added triaged: bug The issue/pr is/fixes a bug and removed issue: bug report The issue was opened to report a bug labels May 22, 2023
@mouse07410
Copy link
Contributor Author

Do you have the ability to run this in a debugger?

Alas, not really. :-(
MacOS lldb and dtrace remain largely beyond my comprehension, and gdb doesn't really work there.

BTW, am I correct that (at least in several cases) OpenSSL chose not to follow "safe programming" approach, and refused to validate pointers when they are "supposed to" be non-NULL?

@bernd-edlinger
Copy link
Member

4 libcrypto.3.dylib 0x1056ad138 OPENSSL_sk_pop_free + 76 (stack.c:439)
13 libcrypto.3.dylib 0x10459b5a0 OPENSSL_sk_pop_free + 60

Since the same function appears to be located at two different addresses,
this means your libcrypto.so it is likely loaded twice.
This is a known scenario where a crash is very likely.
You may try OPENSSL_init_crypto(OPENSSL_INIT_NO_ATEXIT, NULL);
to work around this as far as I know still unresolved issue.

@mattcaswell
Copy link
Member

this means your libcrypto.so it is likely loaded twice.

Ah!! The question is how does it end up being loaded twice in this scenario?

@mattcaswell
Copy link
Member

I guess the oqs test and the oqs provider are somehow linking against different versions of libcrypto....??? I'm not sure how that could occur. I guess I don't know how the macos dynamic linker works here.

@baentsch or @levitte - any thoughts on how that might occur?

@baentsch
Copy link
Contributor

baentsch commented May 23, 2023

Since the same function appears to be located at two different addresses,
this means your libcrypto.so it is likely loaded twice.

Very good observation, @bernd-edlinger !

I guess the oqs test and the oqs provider are somehow linking against different versions of libcrypto....??? I'm not sure how that could occur. I guess I don't know how the macos dynamic linker works here.

@baentsch or @levitte - any thoughts on how that might occur?

Yes: Here's the high-level dependency list:

  1. openssl uses libcrypto.
  2. oqsprovider uses liboqs, which --by default-- in turn depends on/links in libcrypto, too.

Now, one question to @mouse07410 : You once stated you are not building everything "together" as I (& our CI) do: Is it possible that you have linked liboqs against a different libcrypto than the one used during execution of openssl (and the loaded oqsprovider)?

Second, a proposal to @mouse07410 : Could you please build liboqs without the dependency on libcrypto (by setting -DOQS_USE_OPENSSL=OFF when building liboqs)? If the problem then were gone, we probably know what's happening...

Edit/Add: On second thought: Shouldn't all libcrypto symbols remain unresolved in all components, i.e., liboqs, oqsprovider and openssl until the moment that the code gets executed and the dependency gets resolved by the dynamic linker using one (whichever) libcrypto? In other words, how could libcrypto possibly get loaded twice?!? Neither oqsprovider nor liboqs explicitly load libcrypto.... So that's probably not the solution, but well, the test/proposal above may be informative anyway....

@mouse07410
Copy link
Contributor Author

I guess the oqs test and the oqs provider are somehow linking against different versions of libcrypto.

See for yourself:

$ otool -L /opt/local/lib/liboqs.dylib
/opt/local/lib/liboqs.dylib:
	@rpath/liboqs.2.dylib (compatibility version 2.0.0, current version 0.8.0)
	/opt/local/libexec/openssl3/lib/libcrypto.3.dylib (compatibility version 3.0.0, current version 3.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.100.3)
$ otool -L /opt/local/lib/ossl-modules/oqsprovider.dylib 
/opt/local/lib/ossl-modules/oqsprovider.dylib:
	@rpath/oqsprovider.1.dylib (compatibility version 1.0.0, current version 0.5.0)
	@rpath/liboqs.2.dylib (compatibility version 2.0.0, current version 0.8.0)
	/opt/local/libexec/openssl3/lib/libcrypto.3.dylib (compatibility version 3.0.0, current version 3.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.100.3)
$ 
$ otool -L ~/openssl-3/lib/liboqs.dylib
/Users/ur20980/openssl-3/lib/liboqs.dylib:
	@rpath/liboqs.2.dylib (compatibility version 2.0.0, current version 0.8.0)
	/Users/ur20980/openssl-3/lib/libcrypto.3.dylib (compatibility version 3.0.0, current version 3.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.100.3)
$ otool -L ~/openssl-3/lib/ossl-modules/oqsprovider.dylib 
/Users/ur20980/openssl-3/lib/ossl-modules/oqsprovider.dylib:
	@rpath/oqsprovider.1.dylib (compatibility version 1.0.0, current version 0.5.0)
	@rpath/liboqs.2.dylib (compatibility version 2.0.0, current version 0.8.0)
	/Users/ur20980/openssl-3/lib/libcrypto.3.dylib (compatibility version 3.0.0, current version 3.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.100.3)
$ 

Two versions of OpenSSL present:

  • Macports-installed (binary) system-wide OpenSSL-3.1.0, residing in /opt/local/libexec/openssl3, symlinked to /opt/local as appropriate;
  • Built-from-source OpenSSL-3.2.0-dev, residing in $HOME/openssl-3 (sources are, unrurprisingly, in $HOME/src/openssl).

openssl.cnf for each of these two refers to providers and such at their coresponding (and different) locations.

Is it possible that you have linked liboqs against a different libcrypto than the one used during execution of openssl (and the loaded oqsprovider)?

IMHO, no, not possible - see above for my reasons.

Could you please build liboqs without the dependency on libcrypto (by setting -DOQS_USE_OPENSSL=OFF when building liboqs)? If the problem then were gone, we probably know what's happening...

Absolutely! Will do and report here.

In other words, how could libcrypto possibly get loaded twice?!?

Maybe, if more than one provider (or something else?) does something like OPENSSL_init_crypto() or whatever the function is called?

@mattcaswell
Copy link
Member

See for yourself:

This shows us liboqs and oqsprovider - but what about the test application itself. Presumably this is also linked against libcrypto.

If the libcrypto versions that you have on your system is one built with debug symbols and one without? I note that at this point in the stack trace...

9   libcrypto.3.dylib             	       0x105637564 OSSL_LIB_CTX_free + 132 (context.c:465)
10  oqsprovider.0.5.0-dev.dylib   	       0x10436f560 oqsx_freeprovctx + 24 (oqsprov_keys.c:178)
11  oqsprovider.0.5.0-dev.dylib   	       0x10436e9a0 oqsprovider_teardown + 12 (oqsprov.c:553)
12  libcrypto.3.dylib             	       0x104547400 ossl_provider_free + 76

We suddenly go from not having source filenames/line numbers to having them.

@mouse07410
Copy link
Contributor Author

libcrypto.dylib for OpenSSL-3.1.0 has no debug symbols, we're out of luck there. Binary install...

I build libcrypto.dylib for OpenSSL-3.2.0-dev from source, therefore it has the appropriate debug symbols and such.

Here's what's happening when I build liboqs for OpenSSL-3.2.0-dev and run tests. As you see, all tests complete, then Python crashes (through OpenSSL):

[gw2] [ 99%] PASSED tests/test_hash.py::test_hash_sha2_random[sha384] 
[gw0] [100%] PASSED tests/test_hash.py::test_hash_sha2_random[sha3_384] 

================================ 272 passed, 130 skipped in 84.24s (0:01:24) =================================
Segmentation fault
FAILED: tests/CMakeFiles/run_tests /Users/ur20980/src/liboqs/build/tests/CMakeFiles/run_tests 
cd /Users/ur20980/src/liboqs && /opt/local/bin/cmake -E env OQS_BUILD_DIR=/Users/ur20980/src/liboqs/build python3 -m pytest --verbose --numprocesses=auto --ignore=scripts/copy_from_upstream/repos
ninja: build stopped: subcommand failed.
Process:               Python [46986]
Path:                  /opt/local/Library/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python
Identifier:            org.python.python
Version:               3.11.3 (3.11.3)
Code Type:             X86-64 (Native)
Parent Process:        Python [46940]
Responsible:           Terminal [8353]
User ID:               501

Date/Time:             2023-05-23 10:22:19.0221 -0400
OS Version:            macOS 13.4 (22F66)
Report Version:        12
Bridge OS Version:     7.5 (20P5058)
Anonymous UUID:        BD844EB9-9C6F-867E-78EB-1ACDA55970A0


Time Awake Since Boot: 310000 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000000
Exception Codes:       0x0000000000000001, 0x0000000000000000

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [46997]

VM Region Info: 0 is not in any region.  Bytes before following region: 4365639680
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      UNUSED SPACE AT START
--->  
      __TEXT                      104366000-10436a000    [   16K] r-x/r-x SM=COW  .../MacOS/Python

Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libsystem_pthread.dylib       	    0x7ff81b5833c1 pthread_rwlock_wrlock + 0
1   libcrypto.3.dylib             	       0x106111889 CRYPTO_THREAD_write_lock + 9 (threads_pthread.c:110)
2   libcrypto.3.dylib             	       0x1060aecc9 ERR_unload_strings + 57 (err.c:314)
3   libcrypto.3.dylib             	       0x10610dd4e ossl_provider_free + 78 (provider_core.c:688)
4   libcrypto.3.dylib             	       0x10622ea5b OPENSSL_sk_pop_free + 59 (stack.c:439)
5   libcrypto.3.dylib             	       0x10610d774 sk_OSSL_PROVIDER_pop_free + 12 (provider_core.c:199) [inlined]
6   libcrypto.3.dylib             	       0x10610d774 ossl_provider_store_free + 68 (provider_core.c:295)
7   libcrypto.3.dylib             	       0x1060fc8c2 context_deinit_objs + 194 (context.c:250)
8   libcrypto.3.dylib             	       0x1060fc1b6 context_deinit + 16 (context.c:334) [inlined]
9   libcrypto.3.dylib             	       0x1060fc1b6 OSSL_LIB_CTX_free + 118 (context.c:465)
10  oqsprovider.0.5.0-dev.dylib   	       0x1051ce542 oqsx_freeprovctx + 18
11  oqsprovider.0.5.0-dev.dylib   	       0x1051cd399 oqsprovider_teardown + 9
12  libcrypto.3.dylib             	       0x10568bd9a ossl_provider_free + 61
13  libcrypto.3.dylib             	       0x10579ae24 OPENSSL_sk_pop_free + 45
14  libcrypto.3.dylib             	       0x10568b883 ossl_provider_store_free + 63
15  libcrypto.3.dylib             	       0x10568015f context_deinit_objs + 194
16  libcrypto.3.dylib             	       0x10567f9c4 context_deinit + 27
17  libcrypto.3.dylib             	       0x10567f99c ossl_lib_ctx_default_deinit + 16
18  libcrypto.3.dylib             	       0x1056829fa OPENSSL_cleanup + 200
19  libsystem_c.dylib             	    0x7ff81b457ba8 __cxa_finalize_ranges + 416
20  libsystem_c.dylib             	    0x7ff81b4579bb exit + 35
21  libdyld.dylib                 	    0x7ff81b59d8d3 dyld4::LibSystemHelpers::exit(int) const + 11
22  dyld                          	    0x7ff81b22c458 start + 1960

@mattcaswell
Copy link
Member

So, my assumption is that the Python is picking up the 3.1.0 libcrypto version (no debug symbols) and then loading the oqsprovider which is picking up the 3.2.0-dev libcrypto version. Chaos ensues.

@mattcaswell
Copy link
Member

Unfortunately I know nothing about python or how to influence what version of libcrypto it uses...

@mouse07410
Copy link
Contributor Author

So, my assumption is that the Python is picking up the 3.1.0 libcrypto version (no debug symbols) and then loading the oqsprovider which is picking up the 3.2.0-dev libcrypto version. Chaos ensues.

Thus, the solution for me would be removing oqs-provider from openssl.cnf for the liboqs tests to run. Which I did, with good results - no more crashes.

Unfortunately I know nothing about python or how to influence what version of libcrypto it uses...

I don't know much about Python - but am pretty sure that it would be impossible to influence, unless one builds it from source and can configure appropriately. Since I'm using system-wide binary distribution of Python - it's out of question.

@mouse07410
Copy link
Contributor Author

I think we can close this. First, this workaround tested OK, and second - it doesn't look like there's a way to "fix" or otherwise address it in the code.

@mouse07410
Copy link
Contributor Author

BTW, speaking of OPENSSL_init_crypto() - looks like plenty of code in OpenSSL invokes this function. Here's a small excerpt:

. . .
openssl/crypto/engine/eng_table.c
205:    OPENSSL_init_crypto(OPENSSL_INIT_LOAD_CONFIG, NULL);

openssl/crypto/engine/eng_all.c
15:    OPENSSL_init_crypto(OPENSSL_INIT_ENGINE_ALL_BUILTIN, NULL);

openssl/crypto/objects/obj_dat.c
77:    OPENSSL_init_crypto(OPENSSL_INIT_LOAD_CONFIG, NULL);

openssl/crypto/conf/conf_sap.c
40:    OPENSSL_init_crypto(OPENSSL_INIT_LOAD_CONFIG, &settings);

openssl/crypto/rand/rand_lib.c
457:     OPENSSL_init_crypto(OPENSSL_INIT_BASE_ONLY, NULL);

openssl/crypto/provider_core.c
415:                OPENSSL_init_crypto(OPENSSL_INIT_LOAD_CONFIG, NULL);
1367:        OPENSSL_init_crypto(OPENSSL_INIT_LOAD_CONFIG, NULL);
. . .

@baentsch
Copy link
Contributor

So, my assumption is that the Python is picking up the 3.1.0 libcrypto version (no debug symbols) and then loading the oqsprovider which is picking up the 3.2.0-dev libcrypto version. Chaos ensues.

I find this improbable: How should oqsprovider "pick up" something that it does not load? (At least on Unix), all libcrypto symbols are undefined in the oqsprovider binary, i.e., must be resolved by the shared linker when it's loaded, right? And as that loading only happens via libcrypto, how should a different library version be loaded than the one that loaded libcrypto itself?

looks like plenty of code in OpenSSL invokes this function.

This in turn looks much more like something worth while investigating: This list contains at least one function (OSSL_PROVIDER_do_all) that rings a bell: oqsprovider testing (sic!) needed to be changed to use it because of another bug (#19326)....

May I suggest to not close this issue such as to eventually look at it as and when there's folks meeting that are interested in making providers (and all their conceptual features) a "more regular" OpenSSL capability (there's a call for participation in something like this open for all I know)?

@mouse07410
Copy link
Contributor Author

May I suggest to not close this issue . . . ?

Sure. It looks like OpenSSL could do something to lower the probability of this issue rearing its ugly head. ;-)

@mouse07410 mouse07410 reopened this May 24, 2023
@mattcaswell
Copy link
Member

I find this improbable: How should oqsprovider "pick up" something that it does not load? (At least on Unix)

@baentsch - isn't oqsprovider linked against libcrypto? The output shown in this comment suggests that is is:

#21023 (comment)

And it is making calls to libcrypto API functions. So the dynamic linker will "pick up" libcrypto when it loads the oqsprovider.

@baentsch
Copy link
Contributor

So the dynamic linker will "pick up" libcrypto when it loads the oqsprovider.

Yes. This is not what I put in question. I just wonder how the same dynamic linker that loaded libcrypto could possibly load a different libcrypto when resolving libcrypto symbols in liboqs (which indeed occurs as a consequence of libcrypto itself loading oqsprovider).

Further, when looking at the code of OPENSSL_init_crypto there are lots of code paths (and some comments) that make me think some review (with the background of providers loading components that need libcrypto) is warranted, things like "At some point we should look at this function with a view to moving most/all of this into OSSL_LIB_CTX." or the many opts....

@mattcaswell
Copy link
Member

I just wonder how the same dynamic linker that loaded libcrypto could possibly load a different libcrypto when resolving libcrypto symbols in liboqs (which indeed occurs as a consequence of libcrypto itself loading oqsprovider).

I don't think that oqsprovider and liboqs are ending up with different libcrypto versions. I think python and oqsprovider are using different libcrypto versions, i.e. python is using the system libcrypto, and oqsprovider is using a custom libcrypto.

Further, when looking at the code of OPENSSL_init_crypto there are lots of code paths (and some comments) that make me think some review (with the background of providers loading components that need libcrypto) is warranted, things like "At some point we should look at this function with a view to moving most/all of this into OSSL_LIB_CTX." or the many opts....

Undoubtedly it would be desirable to move more stuff into OSSL_LIB_CTX. But I don't believe that is what is at the heart of this issue.

@mouse07410
Copy link
Contributor Author

I don't think that oqsprovider and liboqs are ending up with different libcrypto versions. I think python and oqsprovider are using different libcrypto versions, i.e. python is using the system libcrypto, and oqsprovider is using a custom libcrypto.

Exactly.

Because I build both oqsprovider and liboqs - so I can (and do) control which libcrypto they're linked against. And the loader obediently picks the that libcrypto to resolve calls from these components. In this context, it's libcrypto from OpenSSL-3.2.0-dev.

Since Python comes from binary distribution and is linked against a different libcrypto - in this case it's libcrypto from OpenSSL-3.1.0 (also installed in binary by Macports).

That practically guarantees loading two different libcrypto versions when testing or running software built for/with one version of OpenSSL (like 3.2.0+) with tools that are built for/with the "released" version (in this case 3.1.0).

@t8m t8m added branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 triaged: bug The issue/pr is/fixes a bug labels Oct 9, 2023
@t8m t8m added the branch: 3.2 Merge to openssl-3.2 label Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch branch: 3.0 Merge to openssl-3.0 branch branch: 3.1 Merge to openssl-3.1 branch: 3.2 Merge to openssl-3.2 triaged: bug The issue/pr is/fixes a bug
Projects
None yet
Development

No branches or pull requests

5 participants