Do not call RAND_get0_public from within the FIPS provider initialization #14497
Conversation
|
Nice find. |
|
Hmmm. It feels wrong that we should need to do this. Does this indicate some underlying bug in the init code? |
|
It could be a bug with the init code, I'm not sure how much effort we need to put into solving it. I don't think that insisting that initialisation takes place single threaded is onerous. It ought to be documented. We should make an effort to at least understand the problem. Has @t8m done enough on this front? |
The leaked data are from the initialization of the FIPS provider invoked from the FIPS provider shared library constructor. The leak does not happen if FIPS provider is not configured in config file. So yeah, this PR might be papering over a real issue. The code path is something like this:
The data from 6 are leaked. This does not happen if the step 1 is avoided - i.e., the initial load and implicit initialization is done in the main thread. |
|
I'll try to look into it some more as I am curious why there is the difference between running this in the main thread and a created thread. |
|
Something shoudl probably be documented, perhaps in |
|
Marking as draft for now while investigating some more. |
|
24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually. |
|
My summary of investigation after many debugging runs:
Nice... |
|
So, do you now understand this problem sufficiently well to fix it? |
Or, perhaps, what to tell app developers on how to minimize the memory leaks or not worry about them. |
Calling OPENSSL_init_crypto(0, NULL) is a no-op and will not properly initialize locking.
Not yet fully. Unfortunately the leak by RAND_get0_public() without OSSL_PROVIDER_load() is actually a different problem which I fixed in the pushed commit. Doing the global init is definitely a wrong fix though so I've dropped that. |
It is not needed anymore and it causes leaks because it is called when the FIPS provider libctx is not yet properly set up.
|
This fixes the leak problem. However I suppose the problem would reappear if we decided that we have to always run KATs. The reason for the leak if the RAND_get0_public is called there was that the prov->provctx on the libcrypto side is not yet initialized when the ossl_init_thread_start is called for the ossl_ctx_thread_stop handler on the libctx from within the provider. It is possible there is some better way how to fix the underlying problem however this at least is not just papering over the issue as the RAND_get0_public() is really not needed there anymore and it solves the leak. |
|
Looks okay. |
|
The only reason this line was still there was that it was causing errors without it in the tests at some point. It looks like this issue may have shifted now.. |
|
I'm marking this urgent as this fixes the annoying CI failures. However I'd like to ask @mattcaswell to review before I'll merge. |
|
Urgent is good. |
|
I am quite sceptical about the need for these OPENSSL_init_crypto calls in the bio, engine and store code. The
bio/engine/store don't care about the init lock. That's entirely internal to the init code. They don't have any assembler code so shouldn't care about cpuid setup, and they don't use any thread local storage. The previous calls where 0 was passed as the option did precisely nothing at all. So I'd like to know if anything breaks if you remove those calls altogether. The calls in the err and rand code do make sense because both of those sub-systems utilise thread local storage. |
|
OK, I removed those as suggested. @paulidale @mattcaswell please re-review. |
|
LGTM. But lets see the CIs go through before merging to make sure the removals haven't had some unexpected side effect. |
…rypto Keeping only the calls that are needed to initialize thread locals and removing the rest.
|
LGTM. Subject to CIs passing. |
|
Aargh, the non-caching acvp test failure is actually valid. |
|
Fortunately this is just wrong expectation in acvp_test and is easily fixable. |
There might be more because internal instances of the DRBG might be initialized for the first time and thus self-tested as well.
|
Heh, so re-review needed again. @paulidale @mattcaswell |
Calling OPENSSL_init_crypto(0, NULL) is a no-op and will not properly initialize thread local handling. Only the calls that are needed to initialize thread locals are kept, the rest of the no-op calls are removed. Reviewed-by: Paul Dale <pauli@openssl.org> Reviewed-by: Matt Caswell <matt@openssl.org> (Merged from #14497)
It is not needed anymore and it causes leaks because it is called when the FIPS provider libctx is not yet properly set up. Reviewed-by: Paul Dale <pauli@openssl.org> Reviewed-by: Matt Caswell <matt@openssl.org> (Merged from #14497)
There might be more because internal instances of the DRBG might be initialized for the first time and thus self-tested as well. Reviewed-by: Matt Caswell <matt@openssl.org> (Merged from #14497)
|
Merged to master. This hopefully fixed the remaining intermittent failures. Or at least the common ones. |
|
Thank you for the reviews |
This is necessary to avoid leaks on exit when the FIPS provider is initialized first in other thread than the main one.
Fixes #14437