Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the DRBG reseed propagation [1.1.1] #12759

Conversation

@mspncp
Copy link
Contributor

@mspncp mspncp commented Aug 31, 2020

Before explaining the issue and the fix, let me recap briefly what reseed propagation is all about. In a nutshell, it is a compatibility feature to support the traditional way of (re-)seeding manually by calling 'RAND_add()' before 'RAND_bytes(). I'll explain it for the 'public' DRBG, the situation for the 'private' DRBG is analogous.

Reseed Propagation

For performance reasons, every thread has its own 'public' DRBG instance, and all public DRBGs are chained to a common 'primary' DRBG, which seeds from the operating system. The primary DRBG is shared by all threads and guarded by a mutex. Every DRBG reseeds independently when its number of generate requests or its elapsed time interval since the last (re-)seeding exceeds a certain threshold. This is the 'normal' way to reseed according to the NIST standard.

When a thread of the application calls RAND_add() to seed the CSPRNG, it expects an immediate effect on the output of a subsequent RAND_bytes() call within the same thread. However, the former triggers a reseeding of the primary DRBG, whereas the latter generates the output using the thread-local 'public' DRBG. In this specific situation, OpenSSL needs to guarantee an immediate propagation of the reseeding to the 'public' DRBG within the same thread.

The original implementation

Again for performance reasons, the seed propagation was implemented in a non-blocking fashion, i.e. without taking the look of the primary DRBG: The primary DRBG maintains a reseed_counter (not to be confused with the generate_counter which counts the generate requests), which is incremented automatically when it reseeds.

if (drbg->reseed_counter > 0 && drbg->parent != NULL) {
if (drbg->reseed_counter != drbg->parent->reseed_counter)
reseed_required = 1;
}

The secondary DRBG has a copy of the reseed_counter, and whenever the two counters are out of sync, the secondary DRBG will reseed automatically.

if (drbg->reseed_counter > 0) {
if (drbg->parent == NULL)
drbg->reseed_counter++;
else
drbg->reseed_counter = drbg->parent->reseed_counter;
}

In other words, the reseed_counter was used like a revision counter to detect and propagate the reseeding. Note that eventually the reseeding will propagate to the public DRBGs of all threads, but it is not guaranteed that it will happen immediately.

The issue

Part 1: the thread sanitizer warning

The original implementation did not use atomics, because I assumed that integer read and write operations where atomic with relaxed semantic, which would be sufficient for synchronization within the same thread. (In fact, I did not know much about memory ordering and aqcuire/release/relaxed semantics at that time.) This assumption turned out to be naive when @bernd-edlinger reported a thread sanitizer (TSAN) warning about a data race in RAND_DRBG_generate() in issue #7394:

This was reported by thread sanitizer once (using openssl 1.1.1 release tar ball):

==================
WARNING: ThreadSanitizer: data race (pid=17059)
  Read of size 4 at 0x7b480001e980 by thread T8 (mutexes: write M145, write M33):
    #0 RAND_DRBG_generate crypto/rand/drbg_lib.c:613 (CPPTestServer+0x11295f0)
    #1 RAND_DRBG_bytes crypto/rand/drbg_lib.c:658 (CPPTestServer+0x11295f0)
    #2 drbg_bytes crypto/rand/drbg_lib.c:946 (CPPTestServer+0x11295f0)
    #3 RAND_bytes crypto/rand/rand_lib.c:776 (CPPTestServer+0x112c1f8)
    ...

  Previous write of size 4 at 0x7b480001e980 by thread T27 (mutexes: write M494, write M152, write M158):
    #0 RAND_DRBG_reseed crypto/rand/drbg_lib.c:441 (CPPTestServer+0x1127f1b)
    #1 rand_drbg_restart crypto/rand/drbg_lib.c:542 (CPPTestServer+0x11281c2)
    #2 drbg_add crypto/rand/drbg_lib.c:974 (CPPTestServer+0x11284a7)
    #3 drbg_seed crypto/rand/drbg_lib.c:985 (CPPTestServer+0x11284a7)
    #4 RAND_seed crypto/rand/rand_lib.c:738 (CPPTestServer+0x112bee2)
    ...

  Location is heap block of size 352 at 0x7b480001e900 allocated by main thread:
    ...
    #5 CRYPTO_secure_zalloc crypto/mem_sec.c:145 (CPPTestServer+0x11013d8)
    #6 rand_drbg_new crypto/rand/drbg_lib.c:175 (CPPTestServer+0x11266eb)
    #7 RAND_DRBG_secure_new crypto/rand/drbg_lib.c:243 (CPPTestServer+0x1129bdd)
    #8 drbg_setup crypto/rand/drbg_lib.c:853 (CPPTestServer+0x1129bdd)
    ...


  Mutex M158 (0x7b1000050e80) created at:
    #0 pthread_rwlock_init ../../../../gcc-9-20181007/libsanitizer/tsan/tsan_interceptors.cc:1224 (libtsan.so.0+0x2fd7e)
    #1 CRYPTO_THREAD_lock_new crypto/threads_pthread.c:29 (CPPTestServer+0x1166030)
    #2 rand_drbg_enable_locking crypto/rand/drbg_lib.c:813 (CPPTestServer+0x11290c5)
    #3 drbg_setup crypto/rand/drbg_lib.c:858 (CPPTestServer+0x1129bf1)
    ...


SUMMARY: ThreadSanitizer: data race crypto/rand/drbg_lib.c:613 in RAND_DRBG_generate
==================

The data race affects the aforementioned operations happening in different threads:

thread 8, frame 0

if (drbg->reseed_counter > 0 && drbg->parent != NULL) {
if (drbg->reseed_counter != drbg->parent->reseed_counter)
reseed_required = 1;
}

thread 27, frame 0

if (drbg->reseed_counter > 0) {
if (drbg->parent == NULL)
drbg->reseed_counter++;
else
drbg->reseed_counter = drbg->parent->reseed_counter;
}

It was considered that the data race might be harmless but nevertheless it would be better to fix the warning (see #7394 (comment) and ff.).

Part 2: another data race

During the analysis of the issue, @bernd-edlinger noticed another potential race

The problem is that parent->reseed_prop_counter is read twice,
once before get_entropy from parent, and once after get_entropy.
So what can happen is, and it will not be fixed by acquire/release semantics:
When we are interrupted after the get_entropy call, when the
parent's mutex is already released, but before parent->reseed_prop_counter
is read for the second time, the dependent DRBG will not have a reason
to call get_entropy again, and that state is persistent.

To fix that, we need to get the parent's reseed_prop_counter before
getting the entropy from the parent. Atomic with relaxed semantic
should be sufficient, because the parent's mutex has the necessary
acquire/release barriers.

The fix

Bernds pull request #7399 fixed the data races, but at the price that the original simplicity of the implementation got lost, in particular by the introduction of the new 'reseed_next_counter' member. Unfortunately, https://github.com/openssl/openssl/pull/7399/files#diff-9181ac017a6177a5f2619f65c9b7a346R411 also introduced a breaking change of the original semantics of the seed propagation: instead of copying the value from the parent, the secondary DRBGs now increment their counter by themselves. IIUC, a secondary DRBG which is behind the primary DRBG by three counts will reseed three times until it considers itself in-sync with the primary DRBG.

Alternative fix

My alternative proposal intends to fix the data race while retaining the original simplicity.

Note that in the meantime, seed_counter has been renamed to seed_prop_counter and generate_counter to seed_gen_counter (which is another issue to be fixed).

Atomics with relaxed semantics should be sufficient

Since the incrementing of the reseed_counter of the principal DRBG is protected by the lock, my understanding is that it should be sufficient to read the value using an atomic operation with relaxed semantics, see also the comments by @kroeckx and @bernd-edlinger.

#7394 (comment)
#7394 (comment)
#7394 (comment)

See also this discussion between @dot-asm and @bernd-edlinger:

#7399 (comment) and ff.

Simplify the counter wrap

The counter wrap was complicated unnecessarily by the fact that seed_counter == 0 had the special meaning of turning seed propagation off. I removed that extra complexity by adding a separate enable_reseed_propagation member.

Remove the 'reseed_next_counter'

The 'reseed_next_counter' was introduced to fix a race between two different threads. But it is not the intention of the seed propagation to enforce reseed propagation across threads.

Restore the original names

A lot of additional confusion between the two counters generate_counter and reseed_counter has been created by the renaming to reseed_gen_counter and reseed_prop_counter on master, which subsequently has partially been reverted. This turned out to be a footgun during the provider replumbing (I hope you remember that, @paulidale.) For that reason, I am restoring the original names here on 1.1.1 and I will do the same on master.

What about the master branch?

On the master branch, the implementation has deviated even more from the simple original idea.

/*
* Counts the number of reseeds since instantiation.
* This value is ignored if it is zero.
*
* This counter is used only for seed propagation from the <master> DRBG
* to its two children, the <public> and <private> DRBG. This feature is
* very special and its sole purpose is to ensure that any randomness which
* is added by PROV_add() or PROV_seed() will have an immediate effect on
* the output of PROV_bytes() resp. PROV_priv_bytes().
*/
TSAN_QUALIFIER unsigned int reseed_prop_counter;
unsigned int reseed_next_counter;

It seems that 'get_parent_reseed_count()' nowadays takes the parent's lock before reading the reseed count, which completely thwarts the original intent of having a lock-free implementation.

if (!drbg_lock_parent(drbg)) {
ERR_raise(ERR_LIB_PROV, RAND_R_UNABLE_TO_LOCK_PARENT);
goto err;
}
if (!OSSL_get_OP_rand_get_ctx_params(pfunc)(parent, params)) {
drbg_unlock_parent(drbg);
ERR_raise(ERR_LIB_PROV, RAND_R_UNABLE_TO_GET_RESEED_PROP_CTR);
goto err;
}
drbg_unlock_parent(drbg);

I am working on a fix for master which will hopefully be in time for the beta1 freeze. I chose to raise the pull request for 1.1.1. first, because that's the root cause of the problem and it was easier to start with.

I'd be happy to get some feedback not only from @paulidale, but also from @bernd-edlinger and @kroeckx.

@mspncp
Copy link
Contributor Author

@mspncp mspncp commented Sep 1, 2020

The two commits ba5b2dfaa46c and d86bb47aab98 were left separated intentionally to facilitate comparing the old and the new fix. They will be squashed in the end, unless you prefer them to remain separated.

else
tsan_store(&drbg->reseed_counter,
tsan_load(&drbg->parent->reseed_counter));
}

This comment has been minimized.

@bernd-edlinger

bernd-edlinger Sep 1, 2020
Member

After all, all what was missing from my original solution was to convert some increments and assignments to atomic operations.

No, that completely misses the point.
The race happens here: drbg->parent->reseed_counter
can be incremented, after get_entropy returns.
So the reseed_counter does no longer tell you,
which generation was used above.

This comment has been minimized.

@mspncp

mspncp Sep 1, 2020
Author Contributor

I don't care about race conditions between different threads. All I need is that a call to RAND_add() triggers an immediate reseed on a subsequent RAND_bytes() call within the same thread.

This comment has been minimized.

@mspncp

mspncp Sep 1, 2020
Author Contributor

And this is what my solution achieves, doesn't it?

This comment has been minimized.

@bernd-edlinger

bernd-edlinger Sep 1, 2020
Member

Okay, when you dont care about other threads, using outdated seed material then that is fine.

This comment has been minimized.

@mspncp

mspncp Sep 1, 2020
Author Contributor

Assume an multithreaded application would have called RAND_add() in one of several threads in OpenSSL <= 1.1.0. Then unless the application would have provided it's own synchronization, there wouldn't have been any guarantee either whether another thread would pick up the fresh entropy or not.

This comment has been minimized.

@mspncp

mspncp Sep 1, 2020
Author Contributor

No, I don't think we talk past each other: I understand that the RAND_bytes() call of thread 2 can fail to pick up the entropy provided by thread 1 if it is executed between the reseeding (holding the lock) and the incrementing of the counter (using only atomics). But my point is that thread 2 can also fail to pick up the entropy of thread 1 in RAND_bytes(), if thread 1 is paused in RAND_add() immediately before the reseeding.

Because calling RAND_bytes() and RAND_seed() concurrently from two different threads is inherently a race condition, no matter how you slice it (pun intended). In order to ensure that the entropy of thread 1 is picked up by thread 2 before the latter does something important, you need extra synchronization by the application.

This comment has been minimized.

@mspncp

mspncp Sep 1, 2020
Author Contributor

I also understand that in your case the failure is more permanent, because the secondary DRBG will think it‘s „up to date“. But who cares, it is properly seeded and will reseed automatically when its generate counter reaches the reseed interval.

This comment has been minimized.

@bernd-edlinger

bernd-edlinger Sep 1, 2020
Member

Yes, but I said: when tread 2 continues to call RAND_bytes it will not call get_entropy again since the
reseed_counters agree but the entropy was from a previous seeding.
The reseed interval may be hours...

This comment has been minimized.

@mspncp

mspncp Sep 1, 2020
Author Contributor

With due respect, I think now you are missing the point. Your setup is a little bit academical and does not represent a real life use case: A CTR-DRBG which has been seeded with 256 bits of entropy is considered cryptographically secure for up to 2^48 generate requests before a reseeding is mandated by NIST SP 800 90Ar1. In libcrypto the threshold for a public DRBG before it reseeds from the primary DRBG is way below that limit, namely after 2^16 generate requests or 7 minutes.

https://github.com/openssl/openssl/blob/73bce28b5dcb95e68ac1208d0bf43b3a3bc900d6/crypto/rand/rand_local.h#L30-L34

So even if the thread missed the entropy shower poured by another unrelated thread, that lapse won't last longer than 7 minutes. If you consider that too much, you can lower the threshold as you like.

The main purpose of regular reseeding is to have a countermeasure against the case where the internal state of the DRBG gets compromised (e.g. by stealing it or by some side channel attack). Because in that case the entire future output of the DRBG becomes fully predictable (but not the past output). To mitigate this problem, the internal state needs to be refreshed periodically. For normal usage of RAND_bytes() the above thresholds are fully sufficient and it is irrelevant if the public DRBG misses some entropy shower of an unrelated thread or not.

Conversely, if you are generating a top secret master key for the next thirty years and you feed extra entropy into the CSPRNG using RAND_add() to improve prediction resistance (the DRBG has better ways to do that), then your entropy dose better take immediate effect to the subsequent RAND_bytes() call. It's not enough to succeed on the second attempt, because there won't bee a second try. If you fail the first attempt, you loose.

OpenSSL 1.1.1 guarantees that the call sequence RAND_add(); RAND_bytes(); continues to work as it did in older versions. But the guarantee holds only within the same thread. Because of the previous reasons and because this can be implemented in a simple lock-free fashion. And I believe that it is important for the performance of a multithreaded web application to avoid locking the primary DRBG, otherwise it might become a bottleneck.

This comment has been minimized.

@paulidale

paulidale Sep 8, 2020
Contributor

An aside: RAND_bytes() locked in 1.0.2 and thus every request would be reseeded after RAND_add(). Such as it was reseeded.

@bernd-edlinger I don't think this change in semantics is a major issue because of the automatic reseeding @mspncp mentioned. Still, it was worthwhile raising it.

@mspncp
Copy link
Contributor Author

@mspncp mspncp commented Sep 1, 2020

(Note: i unresolved the two conversations again because I don't want that an essential part of the discussion is hidden from the reader by GitHub)

Copy link
Contributor

@paulidale paulidale left a comment

LGTM. I thought my head would hurt by looking at this and it does.

crypto/rand/drbg_lib.c Outdated Show resolved Hide resolved
if (drbg->parent == NULL)
tsan_counter(&drbg->reseed_counter);
else
tsan_store(&drbg->reseed_counter,

This comment has been minimized.

@paulidale

paulidale Sep 8, 2020
Contributor

I don't understand why tsan_store is required since only this thread accesses this counter. Still, it's harmless to have it here.

This comment has been minimized.

@mspncp

mspncp Sep 9, 2020
Author Contributor

I also pondered whether it is necessary, but then I left it because I had the case in mind where the child DRBG itself is accessed concurrently by "grandchildren" DRBGs.

This does not occur in our standard DRBG triple setup, but might be a setup in some application. (The new EVP_RAND does support chaining by the application, doesn't it?) Of course the application needs to guard concurrent access to the child DRBG (in the middle of the chain) by a lock, but that does not guarantee an atomic write operation of the reseed_counter, if the grandchildren use only atomics with relaxed semantics and not a read lock (of an rwlock). At least that's how I understood @bernd-edlinger's explanations.

This comment has been minimized.

@paulidale

paulidale Sep 9, 2020
Contributor

Good point.

@openssl-machine
Copy link

@openssl-machine openssl-machine commented Sep 9, 2020

This pull request is ready to merge

mspncp added 2 commits Aug 31, 2020
[will be squashed with the previous commit using the following commit message]

In a nutshell, reseed propagation is a compatibility feature with the sole
purpose to support the traditional way of (re-)seeding manually by calling
'RAND_add()' before 'RAND_bytes(). It ensures that the former has an immediate
effect on the latter *within the same thread*, but it does not care about
immediate reseed propagation to other threads. The implementation is lock-free,
i.e., it works without taking the lock of the primary DRBG.

Pull request #7399 not only fixed the data race issue #7394 but also changed
the original implementation of the seed propagation unnecessarily.
This commit reverts most of the changes of commit 1f98527 and intends to
fix the data race while retaining the original simplicity of the seed propagation.

- use atomics with relaxed semantics to load and store the seed counter
- add a new member drbg->enable_reseed_propagation to simplify the
  overflow treatment of the seed propagation counter
- don't handle races between different threads

This partially reverts commit 1f98527.
@mspncp mspncp force-pushed the mspncp:pr-drbg-fix-reseed-propagation branch Sep 9, 2020
@mspncp
Copy link
Contributor Author

@mspncp mspncp commented Sep 9, 2020

I had to rebase to resolve two merge conflicts which were caused by whitespace changes (if( -> if () inside hunks which this pull request removes (resolution: use incoming). These were the only changes I made.

<<<<<<< variant A
    drbg->reseed_next_counter = tsan_load(&drbg->reseed_prop_counter);
    if (drbg->reseed_next_counter) {
        drbg->reseed_next_counter++;
        if (!drbg->reseed_next_counter)
            drbg->reseed_next_counter = 1;
    }

>>>>>>> variant B
####### Ancestor
    drbg->reseed_next_counter = tsan_load(&drbg->reseed_prop_counter);
    if (drbg->reseed_next_counter) {
        drbg->reseed_next_counter++;
        if(!drbg->reseed_next_counter)
            drbg->reseed_next_counter = 1;
    }

======= end
<<<<<<< variant A

    drbg->reseed_next_counter = tsan_load(&drbg->reseed_prop_counter);
    if (drbg->reseed_next_counter) {
        drbg->reseed_next_counter++;
        if (!drbg->reseed_next_counter)
            drbg->reseed_next_counter = 1;
    }

>>>>>>> variant B
####### Ancestor

    drbg->reseed_next_counter = tsan_load(&drbg->reseed_prop_counter);
    if (drbg->reseed_next_counter) {
        drbg->reseed_next_counter++;
        if(!drbg->reseed_next_counter)
            drbg->reseed_next_counter = 1;
    }

======= end
Copy link
Contributor

@paulidale paulidale left a comment

👍

mspncp added 2 commits Sep 9, 2020
The original names were more intuitive: the generate_counter counts the
number of generate requests, and the reseed_counter counts the number
of reseedings (of the principal DRBG).

    reseed_gen_counter  -> generate_counter
    reseed_prop_counter -> reseed_counter

This partially reverts commit 35a3450.
@mspncp mspncp force-pushed the mspncp:pr-drbg-fix-reseed-propagation branch to f1a5d7c Sep 9, 2020
@mspncp
Copy link
Contributor Author

@mspncp mspncp commented Sep 9, 2020

The last force-push replaces the two last commits, because I commuted the fixup with the renaming. These are the tree changes I made.

@mspncp
Copy link
Contributor Author

@mspncp mspncp commented Sep 9, 2020

Sorry, your approval came too quick :-)

Copy link
Contributor

@paulidale paulidale left a comment

SLGTM

@paulidale
Copy link
Contributor

@paulidale paulidale commented Sep 9, 2020

I'm okay for the 24 hour wait to be considered done at this point. The subsequent changes are trivial or asked for.

@mspncp
Copy link
Contributor Author

@mspncp mspncp commented Sep 9, 2020

Thanks for the review and the exemption from the grace period. Nevertheless, I'll sleep over it and do the merge tomorrow morning. :-)

@mspncp
Copy link
Contributor Author

@mspncp mspncp commented Sep 9, 2020

I'm so sorry, but I had a last look and found another tiny issue, which is fixed in e0eed1a.

Copy link
Contributor

@paulidale paulidale left a comment

Good catch.

@openssl-machine
Copy link

@openssl-machine openssl-machine commented Sep 10, 2020

24 hours has passed since 'approval: done' was set, but as this PR has been updated in that time the label 'approval: ready to merge' is not being automatically set. Please review the updates and set the label manually.

openssl-machine pushed a commit that referenced this pull request Sep 10, 2020
In a nutshell, reseed propagation is a compatibility feature with the sole
purpose to support the traditional way of (re-)seeding manually by calling
'RAND_add()' before 'RAND_bytes(). It ensures that the former has an immediate
effect on the latter *within the same thread*, but it does not care about
immediate reseed propagation to other threads. The implementation is lock-free,
i.e., it works without taking the lock of the primary DRBG.

Pull request #7399 not only fixed the data race issue #7394 but also changed
the original implementation of the seed propagation unnecessarily.
This commit reverts most of the changes of commit 1f98527 and intends to
fix the data race while retaining the original simplicity of the seed propagation.

- use atomics with relaxed semantics to load and store the seed counter
- add a new member drbg->enable_reseed_propagation to simplify the
  overflow treatment of the seed propagation counter
- don't handle races between different threads

This partially reverts commit 1f98527.

Reviewed-by: Paul Dale <paul.dale@oracle.com>
(Merged from #12759)
openssl-machine pushed a commit that referenced this pull request Sep 10, 2020
The original names were more intuitive: the generate_counter counts the
number of generate requests, and the reseed_counter counts the number
of reseedings (of the principal DRBG).

    reseed_gen_counter  -> generate_counter
    reseed_prop_counter -> reseed_counter

This partially reverts commit 35a3450.

Reviewed-by: Paul Dale <paul.dale@oracle.com>
(Merged from #12759)
@mspncp
Copy link
Contributor Author

@mspncp mspncp commented Sep 10, 2020

Merged, thanks for your patient review!

8380f45 Revert two renamings backported from master
958fec7 Fix the DRBG seed propagation

@mspncp mspncp closed this Sep 10, 2020
@mspncp mspncp deleted the mspncp:pr-drbg-fix-reseed-propagation branch Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants