Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ERR upcalls in legacy provider #17474

Closed

Conversation

levitte
Copy link
Member

@levitte levitte commented Jan 11, 2022

This reverts 7ee992a, which linked
test/endecode_test statically with the legacy provider to avoid
having ossl_init_register_atexit() called from inside the provider.

We avoid calling ossl_init_register_atexit() from inside the provider
by reimplementing libcrypto functions that end up doing so, in terms of
upcalls. For the legacy provider, that's the set of ERR upcalls.

Any provider module should use the ERR upcalls for error reporting, or they may be lost (if the provider module is linked with a different libcrypto than it's loaded by).

Move them to their own source file, so they end up in a separate
object file.  This allows providers to override their implementation
to use the corresponding OSSL_FUNC upcalls without having to
reimplement everything from crypto/err/err.c.
@levitte levitte marked this pull request as draft January 11, 2022 17:40
@levitte
Copy link
Member Author

levitte commented Jan 11, 2022

I'm temporarly making this a draft, as I got some other weird errors on my laptop

@levitte
Copy link
Member Author

levitte commented Jan 11, 2022

@rsbeckerca, would you mind trying this out on NonStop? Same configuration that caused the atexit() related crash...

@rsbeckerca
Copy link
Contributor

@rsbeckerca, would you mind trying this out on NonStop? Same configuration that caused the atexit() related crash...

In progress at commit 5879a0723d. I'll post here with results (2-ish hours).

@levitte
Copy link
Member Author

levitte commented Jan 11, 2022

@rsbeckerca, would you mind trying this out on NonStop? Same configuration that caused the atexit() related crash...

In progress at commit 5879a07. I'll post here with results (2-ish hours).

There's [ahem] yet another fixup that I pushed just a moment ago. You might want to stop and go with commit 5e46898 instead

@rsbeckerca
Copy link
Contributor

@rsbeckerca, would you mind trying this out on NonStop? Same configuration that caused the atexit() related crash...

In progress at commit 5879a07. I'll post here with results (2-ish hours).

There's [ahem] yet another fixup that I pushed just a moment ago. You might want to stop and go with commit 5e46898 instead

Restarted with that commit.

@rsbeckerca
Copy link
Contributor

@rsbeckerca, would you mind trying this out on NonStop? Same configuration that caused the atexit() related crash...

In progress at commit 5879a07. I'll post here with results (2-ish hours).

There's [ahem] yet another fixup that I pushed just a moment ago. You might want to stop and go with commit 5e46898 instead

Restarted with that commit.

Restarted again on e4d879a

@levitte
Copy link
Member Author

levitte commented Jan 11, 2022

Side note: Windows runs currently fail for unrelated reasons

@levitte
Copy link
Member Author

levitte commented Jan 11, 2022

Restarted again on e4d879a

Thanks for noticing. I think I finally got it in shape to pass all the CI runs

@levitte levitte marked this pull request as ready for review January 11, 2022 18:34
@rsbeckerca
Copy link
Contributor

@levitte sorry to say that this fix does not work. SIGSEGV again.

@levitte
Copy link
Member Author

levitte commented Jan 11, 2022

@levitte sorry to say that this fix does not work. SIGSEGV again.

Ok. Any chance you can get a fall grace for when that happens?

@rsbeckerca
Copy link
Contributor

@levitte sorry to say that this fix does not work. SIGSEGV again.

Ok. Any chance you can get a fall grace for when that happens?

It isn't very informative. Same as before:

(xInspect 2,569):bt
#0  0xfffffffff3bad428 in __process_atexit_functions ()
#1  0xfffffffff3c07871 in CRE_TERMINATOR_ ()
#2  0xfffffffff3d6ac54 in exit ()
#3  0x700c1a0a in _MAIN () at \BLPROD.$RTAL.T8432ACM.CMAIN64C:52

@rsbeckerca
Copy link
Contributor

rsbeckerca commented Jan 11, 2022

@levitte sorry to say that this fix does not work. SIGSEGV again.

Ok. Any chance you can get a fall grace for when that happens?

It isn't very informative. Same as before:

(xInspect 2,569):bt
#0  0xfffffffff3bad428 in __process_atexit_functions ()
#1  0xfffffffff3c07871 in CRE_TERMINATOR_ ()
#2  0xfffffffff3d6ac54 in exit ()
#3  0x700c1a0a in _MAIN () at \BLPROD.$RTAL.T8432ACM.CMAIN64C:52

FYI: This build does a dlopen of legacy.so which auto-loads libcrypto.so as before. What may not be clear is that when dlclose is called on legacy.so, libcrypto.so also unloads prior to dlclose returning on this platform.

@bernd-edlinger
Copy link
Member

ossl_init_load_crypto_nodelete() is there to prevent this kind of crash,
is it somehow disabled on your platform or otherwise not functional?

@levitte
Copy link
Member Author

levitte commented Jan 12, 2022

@rsbeckerca, I asked for the wrong thing, so let's try again. In your crash case, I assume that ossl_init_register_atexit() is called twice, the second time through the legacy provider's OSSL_provider_init(). So it would seem to me that a breakpoint on ossl_init_register_atexit() and getting the call stack at that point would shed some light on what's going on.

@rsbeckerca
Copy link
Contributor

rsbeckerca commented Jan 12, 2022

The first call to ossl_init_register_atexit, made to the static copy at address 0x7012226f is:

#0  ossl_init_register_atexit ()
    at /home/ituglib/randall/openssl-3.0/crypto/init.c:101
#1  0x7012224d in ossl_init_register_atexit_ossl_ ()
    at /home/ituglib/randall/openssl-3.0/crypto/init.c:90
#2  0x70130452 in CRYPTO_THREAD_run_once (once=<value optimized out>,
    init=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/crypto/threads_none.c:70
#3  0x70122a03 in OPENSSL_init_crypto (opts=64,
    settings=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/crypto/init.c:531
#4  0x7012d313 in ossl_provider_find (libctx=<value optimized out>,
    name=<value optimized out>, noconfig=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/crypto/provider_core.c:421
#5  0x7012b15f in OSSL_PROVIDER_try_load (libctx=<value optimized out>,
    name=<value optimized out>, retain_fallbacks=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/crypto/provider.c:25
#6  0x7012b2ab in OSSL_PROVIDER_load (libctx=<value optimized out>,
    name=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/crypto/provider.c:56
#7  0x700cf58f in test_get_libctx (libctx=<value optimized out>,
    default_null_prov=0x80c3c50,
    config_file=0x80cf0c8 "../../test/default.cnf",
    provider=<value optimized out>, module_name=0x80cf0e9 "default")
    at /home/ituglib/randall/openssl-3.0/test/testutil/provider.c:27
#8  0x700c73ce in setup_tests ()
    at /home/ituglib/randall/openssl-3.0/test/endecode_test.c:1304
#9  0x700cea45 in main (argc=<value optimized out>, argv=0x80cf000)
    at /home/ituglib/randall/openssl-3.0/test/testutil/main.c:29
#10 0x700c1a00 in _MAIN () at \BLPROD.$RTAL.T8432ACM.CMAIN64C:52

The second call to ossl_init_register_atexit in dynamic library is at the same address, after the dlopen() finishes is:

#0  ossl_init_register_atexit ()
    at /home/ituglib/randall/openssl-3.0/crypto/init.c:101
#1  0x7fca63cd in ossl_init_register_atexit_ossl_ ()
    at /home/ituglib/randall/openssl-3.0/crypto/init.c:90
#2  0x7fcb8532 in CRYPTO_THREAD_run_once (once=<value optimized out>,
    init=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/crypto/threads_none.c:70
#3  0x7fca6b83 in OPENSSL_init_crypto (opts=12,
    settings=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/crypto/init.c:531
#4  0x7fca345f in ossl_namemap_stored (libctx=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/crypto/core_namemap.c:494
#5  0x7fc6e680 in inner_evp_generic_fetch (methdata=0x6fffe160, prov=0x0,
    operation_id=1, name_id=0, name=0x7fff4430 "SHA1",
    properties=<value optimized out>,
    new_method=0x7fc58260 <evp_md_from_algorithm>,
    up_ref_method=0x7fc58840 <evp_md_up_ref>,
    free_method=0x7fc58860 <evp_md_free>)
    at /home/ituglib/randall/openssl-3.0/crypto/evp/evp_fetch.c:249
#6  0x7fc6ec04 in evp_generic_fetch (libctx=<value optimized out>,
    operation_id=1, name=<value optimized out>,
    properties=<value optimized out>, new_method=<value optimized out>,
    up_ref_method=<value optimized out>, free_method=0x7fc58860 <evp_md_free>)
    at /home/ituglib/randall/openssl-3.0/crypto/evp/evp_fetch.c:372
#7  0x7fc5661c in EVP_MD_fetch (ctx=<value optimized out>,
    algorithm=<value optimized out>, properties=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/crypto/evp/digest.c:1060
#8  0x7ffd78fa in ossl_prov_digest_fetch (pd=<value optimized out>,
    libctx=0x11027c9c0, mdname=<value optimized out>,
    propquery=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/providers/common/provider_util.c:171
#9  0x7ffd79f4 in ossl_prov_digest_load_from_params (
    pd=<value optimized out>, params=<value optimized out>,
    ctx=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/providers/common/provider_util.c:196
#10 0x7ffd0173 in kdf_pvk_init (ctx=0x110217e10)
    at /home/ituglib/randall/openssl-3.0/providers/implementations/kdfs/pvkkdf.c:91
#11 0x7ffcfb36 in kdf_pvk_new (provctx=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/providers/implementations/kdfs/pvkkdf.c:52
#12 0x70284124 in EVP_KDF_CTX_new (kdf=0x110273d40)
    at /home/ituglib/randall/openssl-3.0/crypto/evp/kdf_lib.c:31
#13 0x70137898 in derive_pvk_key (key=<value optimized out>,
    keylen=<value optimized out>, salt=<value optimized out>,
    saltlen=<value optimized out>, pass=<value optimized out>,
    passlen=<value optimized out>, libctx=0x110051710, propq=0x0)
    at /home/ituglib/randall/openssl-3.0/crypto/pem/pvkfmt.c:807
#14 0x7013888b in i2b_PVK (out=0x6fffebd0, pk=<value optimized out>,
    enclevel=<value optimized out>, cb=<value optimized out>,
    u=<value optimized out>, libctx=<value optimized out>, propq=0x0)
    at /home/ituglib/randall/openssl-3.0/crypto/pem/pvkfmt.c:1073
#15 0x70138bd6 in i2b_PVK_bio_ex (out=<value optimized out>,
    pk=<value optimized out>, enclevel=<value optimized out>,
    cb=<value optimized out>, u=<value optimized out>,
    libctx=<value optimized out>, propq=0x0)
    at /home/ituglib/randall/openssl-3.0/crypto/pem/pvkfmt.c:1115
#16 0x700c2c75 in encode_EVP_PKEY_PVK (
    file=0x80028f0 "/home/ituglib/randall/openssl-3.0/test/endecode_test.c",
    line=772, encoded=<value optimized out>,
    encoded_len=<value optimized out>, object=<value optimized out>,
    selection=<value optimized out>, output_type=0x8003340 "PVK",
    output_structure=0x0, pass=0x80028b0 "the holy handgrenade of antioch",
    pcipher=0x0) at /home/ituglib/randall/openssl-3.0/test/endecode_test.c:428
#17 0x700c1cbd in test_encode_decode (
    file=0x80028f0 "/home/ituglib/randall/openssl-3.0/test/endecode_test.c",
    line=772, type=<value optimized out>, pkey=<value optimized out>,
    selection=<value optimized out>, output_type=<value optimized out>,
    output_structure=0x0, pass=0x80028b0 "the holy handgrenade of antioch",
    pcipher=0x0, encode_cb=0x700c2ba0 <encode_EVP_PKEY_PVK>,
    decode_cb=0x700c2360 <decode_EVP_PKEY_prov>,
    test_cb=0x700c2e20 <test_mem>, check_cb=0x700c3a60 <check_PVK>,
    dump_cb=0x700c2f40 <dump_der>, flags=0)
    at /home/ituglib/randall/openssl-3.0/test/endecode_test.c:168
#18 0x700c4177 in test_protected_via_PVK (type=<value optimized out>,
    key=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/test/endecode_test.c:772
#19 0x700c4cfe in test_protected_DSA_via_PVK ()
    at /home/ituglib/randall/openssl-3.0/test/endecode_test.c:982
#20 0x700cc451 in run_tests (test_prog_name=<value optimized out>)
    at /home/ituglib/randall/openssl-3.0/test/testutil/driver.c:334
#21 0x700cea56 in main (argc=<value optimized out>, argv=0x80cf000)
    at /home/ituglib/randall/openssl-3.0/test/testutil/main.c:30
#22 0x700c1a00 in _MAIN () at \BLPROD.$RTAL.T8432ACM.CMAIN64C:52

These calls are only made once during the execution of the 114 tests. Still OPENSSL_cleanup is registered twice. Note that the address of OPENSSL_init_crypto is different in each stack trace.

@rsbeckerca
Copy link
Contributor

rsbeckerca commented Jan 12, 2022

ossl_init_load_crypto_nodelete() is there to prevent this kind of crash, is it somehow disabled on your platform or otherwise not functional?

There appear to be two instances of static int base_inited, statically and dynamically linked. The static vs. dynamic pass through ossl_init_load_crypto_nodelete() sees a different copy. This is a similar situation to the original issue I reported where the variable registered_atexit was duplicated by the same mechanism. The CRYPTO_ONCE annotation on that declaration does not seem to have an effect on the linker or loader.

@bernd-edlinger
Copy link
Member

Interesting, so there are 2 instances load_crypto_nodelete, but only one ossl_init_load_crypto_nodelete?

static CRYPTO_ONCE load_crypto_nodelete = CRYPTO_ONCE_STATIC_INIT;
DEFINE_RUN_ONCE_STATIC(ossl_init_load_crypto_nodelete)

@levitte
Copy link
Member Author

levitte commented Jan 12, 2022

I found that call stack as well when testing with a no-shared configuration, which effectively results in the same kind of situation with two libcrypto copies in the process space (but since I run on Linux, using atexit() in dynamically loaded modules works "correctly").

That call stack had me realise that the atexit() issue is much more deep seated than I previously understood... running OPENSSL_init_crypto(OPENSSL_INIT_NO_ATEXIT, NULL) in the provider is a solution, but is really just a hack with resulting memory leaks. I do have some ideas for a more practical solution, but will have to experiment a bit. That will happen in a separate PR, and I'll convert this PR back to draft in the mean time.

@levitte levitte marked this pull request as draft January 12, 2022 08:23
@rsbeckerca
Copy link
Contributor

rsbeckerca commented Jan 12, 2022

Interesting, so there are 2 instances load_crypto_nodelete, but only one ossl_init_load_crypto_nodelete?

static CRYPTO_ONCE load_crypto_nodelete = CRYPTO_ONCE_STATIC_INIT;
DEFINE_RUN_ONCE_STATIC(ossl_init_load_crypto_nodelete)

I did not monitor for that address in the first run. The critical part here is that data symbols are always duplicated in the architecture OpenSSL uses in this situation. The code probably is too. Essentially, crypto/init.o exists in two places (main and libcrypto.so) and the invocation depends on how the linker sees symbols when it is putting the program together.

In main:

(xInspect 2,1050):p &load_crypto_nodelete
$1 = (CRYPTO_ONCE *) 0x80cc4a8
(xInspect 2,1050):p ossl_init_load_crypto_nodelete
$2 = {int (void)} 0x70122460 <ossl_init_load_crypto_nodelete>

In libcrypto.so:

(xInspect 2,1050):p ossl_init_load_crypto_nodelete
$4 = {int (void)} 0x7fca65e0 <ossl_init_load_crypto_nodelete>
(xInspect 2,1050):p &load_crypto_nodelete
$5 = (CRYPTO_ONCE *) 0x7ffb3fc8

So both symbols are duplicated.

@bernd-edlinger
Copy link
Member

So both symbols are duplicated.

okay, that is what I would have expected.
But what happens when ossl_init_load_crypto_nodelete
is called in the shared object, it should lock the shared object,
so it should not be unloaded but you have the crash, because it is
unloaded. Are you able to figure out what went wrong here:

        dso = DSO_dsobyaddr(&base_inited, DSO_FLAG_NO_UNLOAD_ON_FREE);
        /*
         * In case of No!, it is uncertain our exit()-handlers can still be
         * called. After dlclose() the whole library might have been unloaded
         * already.
         */
        OSSL_TRACE1(INIT, "obtained DSO reference? %s\n",
                    (dso == NULL ? "No!" : "Yes."));

@rsbeckerca
Copy link
Contributor

I found that call stack as well when testing with a no-shared configuration, which effectively results in the same kind of situation with two libcrypto copies in the process space (but since I run on Linux, using atexit() in dynamically loaded modules works "correctly").

That call stack had me realise that the atexit() issue is much more deep seated than I previously understood... running OPENSSL_init_crypto(OPENSSL_INIT_NO_ATEXIT, NULL) in the provider is a solution, but is really just a hack with resulting memory leaks. I do have some ideas for a more practical solution, but will have to experiment a bit. That will happen in a separate PR, and I'll convert this PR back to draft in the mean time.

As I have commented before, I think the atexit is not the actual root case of the problem although it is definitely manifesting it. The root issue is that the duplication of the registration guards preventing atexit from being called twice on OPENSSL_cleanup based on how the test program is being linked. One way of dealing with that is to register the guard variable address (or some other global memory context) with the DLL once dlopen is called so that the linker is no longer has control of the duplication - assuming the code is reentrant.

@rsbeckerca
Copy link
Contributor

It is hard to see what's going on because of the DEFINEs and optimization even at -O0. In both cases, it looks like the code thinks base_inited was not set prior to the call both times. *once=1 is set both times.

@rsbeckerca
Copy link
Contributor

It is hard to see what's going on because of the DEFINEs and optimization even at -O0. In both cases, it looks like the code thinks base_inited was not set prior to the call both times. *once=1 is set both times.

Operator error: I built with default -O2. Will rebuild with -O0. That will take until 7am EST, so will grab a nap and debug this again later this morning.

@levitte
Copy link
Member Author

levitte commented Jan 12, 2022

@rsbeckerca, the crashes that you see now, they are in test_store, right? Do you not get the same crashes in a master checkout? That should be completely unrelated to the fix from #17345...

This involves the following functions:

ERR_new(), ERR_set_debug(), ERR_set_error(), ERR_vset_error(),
ERR_set_mark(), ERR_clear_last_mark(), ERR_pop_to_mark(void)
@levitte levitte force-pushed the legacy-provider-use-error-upcalls branch from e4d879a to b76295e Compare January 12, 2022 11:47
@levitte
Copy link
Member Author

levitte commented Jan 12, 2022

I've modified this PR to only include the ERR upcall modifications. The atexit() mess is to be handled in a separate PR (I'm working on it)

@levitte levitte marked this pull request as ready for review January 12, 2022 11:50
@rsbeckerca
Copy link
Contributor

@rsbeckerca, the crashes that you see now, they are in test_store, right? Do you not get the same crashes in a master checkout? That should be completely unrelated to the fix from #17345...

test_endecode

@rsbeckerca
Copy link
Contributor

9c31ab8 passes both 04-test_encoder_decoder.t and 04-test_encoder_decoder_legacy.t on NonStop x86

@levitte
Copy link
Member Author

levitte commented Jan 14, 2022

Final approval anyone?

@bernd-edlinger
Copy link
Member

@rsbeckerca I still don't understand why your shared objects have been unloaded, on your target.
in other words, #17474 (comment) is still unanswered.
I assume you did not configure with no-pinshared, therefore, the function ossl_init_load_crypto_nodelete
should execute (in the shared object) immediately after the atexit was called. This does intentionally
leak one reference to the libcrypto, and therefore is supposed to prevent unloading afterwards.
Have you any clue why that is not working, can you debug this code, if there is an obvoious reason
why that fails?

@levitte
Copy link
Member Author

levitte commented Jan 17, 2022

@rsbeckerca I still don't understand why your shared objects have been unloaded, on your target. in other words, #17474 (comment) is still unanswered. I assume you did not configure with no-pinshared, therefore, the function ossl_init_load_crypto_nodelete should execute (in the shared object) immediately after the atexit was called. This does intentionally leak one reference to the libcrypto, and therefore is supposed to prevent unloading afterwards. Have you any clue why that is not working, can you debug this code, if there is an obvoious reason why that fails?

@bernd-edlinger, that discussion isn't relevant in this PR any more.

@rsbeckerca
Copy link
Contributor

@levitte Well, the current master fails on NonStop for 30-test_evp_extra.t, so whatever was pushed is broken. @ Commit 5764533. So I'm not sure the hackery is fully making it.

@rsbeckerca
Copy link
Contributor

@levitte Well, the current master fails on NonStop for 30-test_evp_extra.t, so whatever was pushed is broken. @ Commit 5764533. So I'm not sure the hackery is fully making it.

Do you want a separate Issue for this report?

@levitte
Copy link
Member Author

levitte commented Jan 17, 2022

@levitte Well, the current master fails on NonStop for 30-test_evp_extra.t, so whatever was pushed is broken. @ Commit 5764533. So I'm not sure the hackery is fully making it.

Do you want a separate Issue for this report?

Yes please

@rsbeckerca
Copy link
Contributor

30-test_evp_extra

Opened as #17537

@levitte
Copy link
Member Author

levitte commented Jan 19, 2022

Ping

crypto/err/err_mark.c Outdated Show resolved Hide resolved
@paulidale paulidale added approval: done This pull request has the required number of approvals branch: master Merge to master branch labels Jan 20, 2022
@openssl-machine
Copy link
Collaborator

24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually.

@levitte levitte added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Jan 21, 2022
openssl-machine pushed a commit that referenced this pull request Jan 21, 2022
Move them to their own source file, so they end up in a separate
object file.  This allows providers to override their implementation
to use the corresponding OSSL_FUNC upcalls without having to
reimplement everything from crypto/err/err.c.

Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #17474)
openssl-machine pushed a commit that referenced this pull request Jan 21, 2022
This involves the following functions:

ERR_new(), ERR_set_debug(), ERR_set_error(), ERR_vset_error(),
ERR_set_mark(), ERR_clear_last_mark(), ERR_pop_to_mark(void)

Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from #17474)
@levitte
Copy link
Member Author

levitte commented Jan 21, 2022

Merged

fbe8870 ERR: Move ERR_set_mark(), ERR_pop_to_mark() and ERR_clear_last_mark()
8c2e588 LEGACY PROV: Reimplement the ERR building blocks in upcall terms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Merge to master branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants