Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not call RAND_get0_public from within the FIPS provider initialization #14497

Closed
wants to merge 5 commits into from

Conversation

@t8m
Copy link
Member

@t8m t8m commented Mar 10, 2021

This is necessary to avoid leaks on exit when the FIPS provider is initialized first in other thread than the main one.

Fixes #14437

Copy link
Contributor

@paulidale paulidale left a comment

Nice find.

@mattcaswell
Copy link
Member

@mattcaswell mattcaswell commented Mar 10, 2021

Hmmm. It feels wrong that we should need to do this. Does this indicate some underlying bug in the init code?

@paulidale
Copy link
Contributor

@paulidale paulidale commented Mar 10, 2021

It could be a bug with the init code, I'm not sure how much effort we need to put into solving it. I don't think that insisting that initialisation takes place single threaded is onerous. It ought to be documented.

We should make an effort to at least understand the problem. Has @t8m done enough on this front?

@t8m
Copy link
Member Author

@t8m t8m commented Mar 10, 2021

Hmmm. It feels wrong that we should need to do this. Does this indicate some underlying bug in the init code?

The leaked data are from the initialization of the FIPS provider invoked from the FIPS provider shared library constructor.

The leak does not happen if FIPS provider is not configured in config file.

So yeah, this PR might be papering over a real issue.

The code path is something like this:

  1. new thread created
  2. it loads default provider
  3. this triggers implicit initialization of libcrypto loading the config file
  4. this triggers loading the FIPS provider as it is configured to be loaded
  5. this runs constructor of the libfips.so
  6. this triggers POST routine which initializes a libctx for the FIPS provider and drbg in it

The data from 6 are leaked.

This does not happen if the step 1 is avoided - i.e., the initial load and implicit initialization is done in the main thread.

@t8m
Copy link
Member Author

@t8m t8m commented Mar 10, 2021

I'll try to look into it some more as I am curious why there is the difference between running this in the main thread and a created thread.

@richsalz
Copy link
Contributor

@richsalz richsalz commented Mar 10, 2021

Something shoudl probably be documented, perhaps in doc/man7/openssl-threads.pod. It's not uncommon for multi-threaded programs to require initialization in the main thread.

@t8m t8m marked this pull request as draft Mar 10, 2021
@t8m
Copy link
Member Author

@t8m t8m commented Mar 10, 2021

Marking as draft for now while investigating some more.

@openssl-machine
Copy link

@openssl-machine openssl-machine commented Mar 11, 2021

24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually.

@t8m
Copy link
Member Author

@t8m t8m commented Mar 11, 2021

My summary of investigation after many debugging runs:

  • The leaked data is just associated with missing deallocation of the public and primary drbgs in the FIPS context.
  • It can be reproduced by explicitly loading the FIPS provider as well (if you delete the activate=1 line from fipsmodule.cnf and load "fips" provider by OSSL_PROVIDER_load in a thread).
  • More interestingly there is even bigger memory leak triggered if I just run RAND_get0_public() in the thread workers.
  • But this memory leak does not happen if I first do OSSL_PROVIDER_load/unload("default") in the thread workers.
  • But it happens again if I do RAND_get0_public() without the OSSL_PROVIDER_load/unload("default") in the main thread
  • But it does not happen if I do RAND_get0_public() without the OSSL_PROVIDER_load/unload("default") in the main thread and not do any RAND_get0_public() in the thread workers.

Nice...

@mattcaswell
Copy link
Member

@mattcaswell mattcaswell commented Mar 11, 2021

So, do you now understand this problem sufficiently well to fix it?

@richsalz
Copy link
Contributor

@richsalz richsalz commented Mar 11, 2021

So, do you now understand this problem sufficiently well to fix it?

Or, perhaps, what to tell app developers on how to minimize the memory leaks or not worry about them.

Calling OPENSSL_init_crypto(0, NULL) is a no-op and will
not properly initialize locking.
@t8m t8m force-pushed the t8m:threadtest-init branch from e54eb0a to 6f493e8 Mar 11, 2021
@t8m
Copy link
Member Author

@t8m t8m commented Mar 11, 2021

So, do you now understand this problem sufficiently well to fix it?

Not yet fully. Unfortunately the leak by RAND_get0_public() without OSSL_PROVIDER_load() is actually a different problem which I fixed in the pushed commit.

Doing the global init is definitely a wrong fix though so I've dropped that.

@t8m t8m removed the approval: done label Mar 11, 2021
It is not needed anymore and it causes leaks because
it is called when the FIPS provider libctx is not yet
properly set up.
@t8m t8m force-pushed the t8m:threadtest-init branch from bab291e to 1c2146c Mar 11, 2021
@t8m t8m changed the title threadstest: explicitly initialize OpenSSL Do not call RAND_get0_public from within the FIPS provider initialization Mar 11, 2021
@t8m t8m marked this pull request as ready for review Mar 11, 2021
@t8m t8m requested review from paulidale and slontis Mar 11, 2021
@t8m
Copy link
Member Author

@t8m t8m commented Mar 11, 2021

This fixes the leak problem. However I suppose the problem would reappear if we decided that we have to always run KATs. The reason for the leak if the RAND_get0_public is called there was that the prov->provctx on the libcrypto side is not yet initialized when the ossl_init_thread_start is called for the ossl_ctx_thread_stop handler on the libctx from within the provider.

It is possible there is some better way how to fix the underlying problem however this at least is not just papering over the issue as the RAND_get0_public() is really not needed there anymore and it solves the leak.

Copy link
Contributor

@paulidale paulidale left a comment

Looks okay.

@slontis
Copy link
Contributor

@slontis slontis commented Mar 11, 2021

The only reason this line was still there was that it was causing errors without it in the tests at some point. It looks like this issue may have shifted now..

@t8m
Copy link
Member Author

@t8m t8m commented Mar 12, 2021

I'm marking this urgent as this fixes the annoying CI failures. However I'd like to ask @mattcaswell to review before I'll merge.

@paulidale
Copy link
Contributor

@paulidale paulidale commented Mar 12, 2021

Urgent is good.

@mattcaswell
Copy link
Member

@mattcaswell mattcaswell commented Mar 12, 2021

I am quite sceptical about the need for these OPENSSL_init_crypto calls in the bio, engine and store code. The OPENSSL_INIT_BASE_ONLY option does 3 things:

  • Creates the init lock
  • Runs OPENSSL_cpuid_setup()
  • Initialises the thread system we use for cleaning up thread local resources

bio/engine/store don't care about the init lock. That's entirely internal to the init code. They don't have any assembler code so shouldn't care about cpuid setup, and they don't use any thread local storage.

The previous calls where 0 was passed as the option did precisely nothing at all. So I'd like to know if anything breaks if you remove those calls altogether.

The calls in the err and rand code do make sense because both of those sub-systems utilise thread local storage.

@t8m
Copy link
Member Author

@t8m t8m commented Mar 12, 2021

OK, I removed those as suggested. @paulidale @mattcaswell please re-review.

Copy link
Member

@mattcaswell mattcaswell left a comment

LGTM. But lets see the CIs go through before merging to make sure the removals haven't had some unexpected side effect.

crypto/store/store_init.c Outdated Show resolved Hide resolved
…rypto

Keeping only the calls that are needed to initialize thread locals and
removing the rest.
Copy link
Member

@mattcaswell mattcaswell left a comment

LGTM. Subject to CIs passing.

@t8m
Copy link
Member Author

@t8m t8m commented Mar 12, 2021

Aargh, the non-caching acvp test failure is actually valid.

@t8m
Copy link
Member Author

@t8m t8m commented Mar 12, 2021

Fortunately this is just wrong expectation in acvp_test and is easily fixable.

There might be more because internal instances of the DRBG
might be initialized for the first time and thus
self-tested as well.
@t8m
Copy link
Member Author

@t8m t8m commented Mar 12, 2021

Heh, so re-review needed again. @paulidale @mattcaswell

openssl-machine pushed a commit that referenced this pull request Mar 12, 2021
Calling OPENSSL_init_crypto(0, NULL) is a no-op and will
not properly initialize thread local handling.

Only the calls that are needed to initialize thread locals
are kept, the rest of the no-op calls are removed.

Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: Matt Caswell <matt@openssl.org>
(Merged from #14497)
openssl-machine pushed a commit that referenced this pull request Mar 12, 2021
It is not needed anymore and it causes leaks because
it is called when the FIPS provider libctx is not yet
properly set up.

Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: Matt Caswell <matt@openssl.org>
(Merged from #14497)
openssl-machine pushed a commit that referenced this pull request Mar 12, 2021
There might be more because internal instances of the DRBG
might be initialized for the first time and thus
self-tested as well.

Reviewed-by: Matt Caswell <matt@openssl.org>
(Merged from #14497)
@t8m
Copy link
Member Author

@t8m t8m commented Mar 12, 2021

Merged to master. This hopefully fixed the remaining intermittent failures. Or at least the common ones.

@t8m
Copy link
Member Author

@t8m t8m commented Mar 12, 2021

Thank you for the reviews

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

6 participants