Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with OpenSSL 1.1.1c caused by DEVRANDOM_WAIT #9078

Closed
mspncp opened this issue Jun 4, 2019 · 66 comments
Closed

Performance issues with OpenSSL 1.1.1c caused by DEVRANDOM_WAIT #9078

mspncp opened this issue Jun 4, 2019 · 66 comments
Labels
branch: master Merge to master branch branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch

Comments

@mspncp
Copy link
Contributor

mspncp commented Jun 4, 2019

In pull request #8251, a select() call on /dev/random was added before reading from /dev/urandom. Let's call it the DEVRANDOM_WAIT feature, named after the preprocessor macro involved.

#ifdef DEVRANDOM_WAIT
static int wait_done = 0;
/*
* On some implementations reading from /dev/urandom is possible
* before it is initialized. Therefore we wait for /dev/random
* to be readable to make sure /dev/urandom is initialized.
*/
if (!wait_done && bytes_needed > 0) {
int f = open(DEVRANDOM_WAIT, O_RDONLY);
if (f >= 0) {
fd_set fds;
FD_ZERO(&fds);
FD_SET(f, &fds);
while (select(f+1, &fds, NULL, NULL, NULL) < 0
&& errno == EINTR);
close(f);
}
wait_done = 1;
}
#endif

The intention of this change was to mitigate a weakness of the /dev/urandom device, namely that it does not block until being initially seeded, contrary to the getrandom() system call, see issue #8215. Note that this change does not affect the seeding using the getrandom() source, which is the preferred method on newer operating systems.

There was a some discussion on the issue thread before PR #8251 was raised. While people unanimously agreed that using getentropy() is fine, because it blocks at boot time (see also @kroeckx's comment) until the kernel CSPRNG is properly seeded, there was some dissent about whether the /dev/urandom weakness should be addressed by OpenSSL.

@t8m and I agreed that it is actually the business of the OS (resp. its init system) to take care of this and not OpenSSL's business. (see #8215 (comment) and #8215 (comment)). Pull request #8251 emerged as a compromise; the idea was to wait for /dev/random to become readable (without reading any bytes) before reading from /dev/urandom. Nobody protested anymore, except for @t8m (here and here), who was no committer yet at that time, and his protests went unnoticed.

However, there is a recent issue report on openssl-users that this change introduced a regression and has unwanted performance impacts.

I built OpenSSL 1.1.1c from the recent release, but have noticed what
seems like a significant performance drop compared with 1.1.1b. I
notice this when starting lighttpd. With 1.1.1b, lighttpd starts in a
few seconds, but with 1.1.1c, it takes several minutes.

I think I have tracked down the change in 1.1.1c that is causing this.
It is the addition of the DEVRANDOM_WAIT functionality for linux in
e_os.h and crypto/rand/rand_unix.c. lighttpd (libcrypto) is waiting in
a select() call on /dev/random. After this eventually wakes up, it then
reads from /dev/urandom. OpenSSL 1.1.1b did not do this, but instead
just read from /dev/urandom. Is there more information about this
change (i.e., a rationale)? I did not see anything in the CHANGES file
about it.

Jay

I think we need to reconsider pull request #8251 and decide whether it was really good idea to fix a problem of the kernel resp. the init system in the OpenSSL library, or whether the negative side-effects outweigh the benefits. Currently, I'm leaning towards reverting the DEVRANDOM_WAIT commits on master and 1.1.1.

What's your opinion @openssl/committers?

@mspncp mspncp added branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch branch: master Merge to master branch labels Jun 4, 2019
@mspncp
Copy link
Contributor Author

mspncp commented Jun 4, 2019

Side Note: @t8m noticed that this feature can be disabled by the following 'hack': just add an explicit -DDEVRANDOM=... compiler argument, then the following conditional code will not be compiled and DEVRANDOM_WAIT will be undefined.

openssl/e_os.h

Lines 25 to 34 in f308fa2

# ifndef DEVRANDOM
/*
* set this to a comma-separated list of 'random' device files to try out. By
* default, we will try to read at least one of these files
*/
# define DEVRANDOM "/dev/urandom", "/dev/random", "/dev/hwrng", "/dev/srandom"
# ifdef __linux
# define DEVRANDOM_WAIT "/dev/random"
# endif
# endif

@mspncp
Copy link
Contributor Author

mspncp commented Jun 4, 2019

I pinged the original reporter on openssl-users and hope he will join our discussion here.

@t8m
Copy link
Member

t8m commented Jun 4, 2019

So my opinion is of course still the same, i.e. we should not attempt to outsmart the kernel and/or init system.

@richsalz
Copy link
Contributor

richsalz commented Jun 4, 2019

I understand not trying to outsmart the local platform, but most OpenSSL users are naive about cryptography, and especially about CSPRNG seeding. We should avoid simple cases where they can make mistakes that result in guessable key material, as in https://crypto.stanford.edu/RealWorldCrypto/slides/nadia.pdf et al.

@kroeckx
Copy link
Member

kroeckx commented Jun 4, 2019 via email

@mspncp
Copy link
Contributor Author

mspncp commented Jun 4, 2019

Switching to getentropy() also had the effect that people got long
delays during boot or starting services the first time. So we're
now in the situation that using getentropy() or /dev/urandom
will both make us wait until the kernel tells it's ready.

There is one important difference between getentropy() and the DEVRANDOM_WAIT method IMO: with the former, you only wait once in the early boot phase, until the initial seeding has completed, while the select() call can block at any time afterwards if the entropy_avail count goes down to zero. So you might not be starving other processes, because you don't actually read from /dev/random, but you might be starved by other processes hogging the entropy count of the random pool.

@t-j-h
Copy link
Member

t-j-h commented Jun 4, 2019

I would revert the commit. @tm8 is correct in that this isn't where we should be trying to fix things.

@richsalz
Copy link
Contributor

richsalz commented Jun 4, 2019 via email

@kroeckx
Copy link
Member

kroeckx commented Jun 4, 2019 via email

@richsalz
Copy link
Contributor

richsalz commented Jun 5, 2019

Okay, so tell me what Apache's mod_ssl should do that's portable for all the OpenSSL versions that it supports. And also what should OpenSSL do when running on an old kernel, for unaware developers porting packages?

Right now, OpenSSL is safe but maybe slow to startup, and a careful consumer can rebuild openssl to not be slow. That seems to me to be exactly the right trade-off.

Remind me again why you wanted RSA 2K keys as the minimum, Kurt? :)

@paulidale
Copy link
Contributor

I think OpenSSL should be attempting to ensure good entropy.

Ideally the kernel/OS would but not all do, so having a fallback of libcrypto trying seems prudent.

There is one important difference between getentropy() and the DEVRANDOM_WAIT method IMO: with the former, you only wait once in the early boot phase, until the initial seeding has completed, while the select() call can block at any time afterwards if the entropy_avail count goes down to zero. So you might not be starving other processes, because you don't actually read from /dev/random, but you might be starved by other processes hogging the entropy count of the random pool.

Adding a flag to only wait once would avoid this difference. Once seeded /dev/urandom should be good essentially forever. Instead of a flag it could be in a run once RNG code path.

@vdukhovni
Copy link

We can try to make things as simple as possible for the user, BUT no simpler.
This change crossed the boundary and needs to be reverted. Ensuring early boot entropy is an O/S platform task, OpenSSL cannot solve this problem, and must not try.

@mspncp
Copy link
Contributor Author

mspncp commented Jun 5, 2019

Adding a flag to only wait once would avoid this difference. Once seeded /dev/urandom should be good essentially forever. Instead of a flag it could be in a run once RNG code path.

There is already such a flag, it is called wait_done.

if (!wait_done && bytes_needed > 0) {
int f = open(DEVRANDOM_WAIT, O_RDONLY);

Maybe my #9078 (comment) gave the wrong impression that forking processes would be waiting several times. Actually, we don't have that much information about the exact circumstances, apart form the facts cited from the original report.

I built OpenSSL 1.1.1c from the recent release, but have noticed what
seems like a significant performance drop compared with 1.1.1b. I
notice this when starting lighttpd. With 1.1.1b, lighttpd starts in a
few seconds, but with 1.1.1c, it takes several minutes.

In particular, we don't know whether the massive delay occurs only at early boot time or always. And no information about OS/kernel version. I hope the original reporter shows up and provides more detail.

@paulidale
Copy link
Contributor

@mspncp I should have looked at the code :(

@paulidale
Copy link
Contributor

-1 (vote forcing). This one might come down to an OMC vote since it seems two in favour and two agin at present.

@paulidale
Copy link
Contributor

I also wonder what happens in FIPS mode where /dev/urandom is an acceptable entropy source only if the vendor must first demonstrate that the initial call (...) to /dev/urandom returns the claimed amount of entropy.

I'm not sure how this can be done generally. Refer to IG 7.15.

@vdukhovni
Copy link

Best I can determine, select on the Linux /dev/random will block any time there is insufficient "fresh entropy" in the kernel entropy pool, not just initially at boot. While OpenSSL will block at most once per process, the delay is unbounded, unpredictable and unacceptable. If /dev/random were to return "ready" for select even though a read(2) would block for lack of entropy, that would be a kernel bug.

The original PR must be reverted. If someone wants to block on /dev/random, they'll have to do that explicitly at the application layer. Perhaps ssh-keygen and/or openssl genpkey could elect to do that, but this must not be inflicted on unsuspecting applications by default.

@paulidale
Copy link
Contributor

I agree that select(2) on /dev/random isn't ideal.
What other metric can we use to determine that /dev/urandom is adequately seeded?

@mspncp
Copy link
Contributor Author

mspncp commented Jun 5, 2019

-1 (vote forcing). This one might come down to an OMC vote since it seems two in favour and two agin at present.

I agree. Given the current progress of the discussion, an OMC vote seems inevitable. If there is no hurry, I would prefer however that we collect more information about the case first. I just sent another invitation to the reporter on the mailing list.

@mspncp
Copy link
Contributor Author

mspncp commented Jun 5, 2019

The original PR must be reverted.

Commit cff8ff87ad76b93874922dd8c99f5f0fbd09e757 actually contains two parts and only the first part needs to be reverted IMO:

1) Use select to wait for /dev/random in readable state,
   but do not actually read anything from /dev/random,
   use /dev/urandom first.

2) Use linux define __NR_getrandom instead of the
   glibc define SYS_getrandom, in case the kernel headers
   are more current than the glibc headers.

(For the second part, see also #8215 (comment) and #8215 (comment)).

@paulidale
Copy link
Contributor

I'm not aware that calling a vote requires a strict timeline (I could easily be wrong).

I'd also prefer to see a better method to ensure that /dev/urandom has been seeded. I'm not aware of one at this point.

@mspncp
Copy link
Contributor Author

mspncp commented Jun 5, 2019

Best I can determine, select on the Linux /dev/random will block any time there is insufficient "fresh entropy" in the kernel entropy pool, not just initially at boot.

FWIW: The draining effect is a fact which actually can be verified by simple experiments:

In one terminal, run

watch -n 1 cat /proc/sys/kernel/random/entropy_avail

and in the other terminal, run either

dd if=/dev/urandom of=/dev/null bs=64 count=1

or

dd if=/dev/random of=/dev/null bs=64 count=1

several times. Reading from /dev/urandom does not affect the entropy count of the random pool, while reading from /dev/random drains it, such that the read finally blocks. The DEVRANDOM_WAIT case could equally be simulated using a small C program that selects on /dev/random before reading from /dev/urandom. But I havn't done it yet.

@mspncp
Copy link
Contributor Author

mspncp commented Jun 5, 2019

I'm not aware that calling a vote requires a strict timeline (I could easily be wrong).

Adding a '-1' of course doesn't. But an official announcement of the vote on the mailing list would include a timeline. I just attempted to preempt the latter.

@vdukhovni
Copy link

  1. Use select to wait for /dev/random in readable state,
    but do not actually read anything from /dev/random,
    use /dev/urandom first.

  2. Use linux define __NR_getrandom instead of the
    glibc define SYS_getrandom, in case the kernel headers
    are more current than the glibc headers.

Yes, of course, we don't need to revert part 2, that's a feature, not a bug. Many Linux systems have kernels that are substantially newer than the userland.

@slontis
Copy link
Member

slontis commented Jun 5, 2019 via email

@mspncp
Copy link
Contributor Author

mspncp commented Jun 5, 2019

It may be required for FIPS_MODE,

If I understood Pauli correctly, the FIPS lab is perfectly happy with getrandom(), which blocks until the kernel CSPRNG is initially seeded, but not afterwards anymore. So we could simply make it a minimum requirement for FIPS mode, that the kernel is new enough to support it, instead of worrying about the quirks of /dev/urandom.

@vdukhovni
Copy link

Perhaps FIPS-mode applications deserve all the punishment they signed up for. :-) If no other options are available the FIPS RNG can wait for /dev/random if required. FIPS aside, let's revert "part 1".

@mspncp
Copy link
Contributor Author

mspncp commented Jun 5, 2019

It may be required for FIPS_MODE,

If at some time in the (hopefully near) future we finally have the ability to configure our seed sources, the DEVRANDOM_WAIT feature could be reintroduced as an optional feature which is disabled by default:

rand_seed=getrandom,devurandom_wait,devrandom

@mspncp
Copy link
Contributor Author

mspncp commented Jun 5, 2019

No problem. It's always a pleasure to fight a battle of words with you ;-)

And once again I'm amazed how much emotions such a dry subject like a random generator can raise... :-P

@hyc
Copy link

hyc commented Jun 5, 2019

If you revert this wait, you will flunk most commercial security audits.
https://paragonie.com/blog/2016/05/how-generate-secure-random-numbers-in-various-programming-languages

I'm not saying they're correct. Just saying, some time back, "everyone" agreed this was the right thing to do. Fwiw I never agreed with it back then either. Just be aware that you will meet political resistance to a change like this.

@vdukhovni
Copy link

Just be aware that you will meet political resistance to a change like this.

That's OK, asbestos underwear is on. Reverting to 1.1.1 as released originally and 1.1.1[ab] is quite defensible.

@richsalz
Copy link
Contributor

richsalz commented Jun 5, 2019

So @mspncp, in 1.1.1c things got more secure and this removes that putting the onus on the end-user or application. So, I guess I was right?

@mspncp
Copy link
Contributor Author

mspncp commented Jun 5, 2019

Sigh... I'm exhausted. You are wasting your energy on me, in the end it will be the OMC which makes the decisions.

Apropos energy: your energy would probably be better spent convincing Theodore Ts'o to rethink his decision not to block /dev/urandom during early boot in order not to break userland programs (see his commit message c6e9d6f38894). This is the right place were you could advertise your ideas to make /dev/urandom more secure. ;-)

@mspncp
Copy link
Contributor Author

mspncp commented Jun 5, 2019

But we are turning in circles. The kernel already has been improved and people just need to move on to using getrandom() instead of /dev/urandom. Which is what OpenSSL did.

@mspncp mspncp closed this as completed Jun 5, 2019
@mspncp mspncp reopened this Jun 5, 2019
@richsalz
Copy link
Contributor

richsalz commented Jun 5, 2019

I'll stop; I certainly don't want to (heh) exhaust a vital resource like yourself.

"Just upgrade kernels" isn't always feasible, and I think backing out of a security improvement is premature. Shrug. YMMV.

mspncp added a commit to mspncp/openssl that referenced this issue Jun 6, 2019
The DEVRANDOM_WAIT feature added a select() call to wait for the
`/dev/random` device to become readable before reading from the
`/dev/urandom` device. It was introduced in commit 38023b8
in order to mitigate the fact that the `/dev/urandom` device
does not block until the initial seeding of the kernel CSPRNG
has completed, contrary to the behaviour of the `getrandom()`
system call.

It turned out that this change had negative side effects on the
performance which were not acceptable. After some discussion it
was decided to revert this feature and leave it up to the OS
resp. the platform maintainer to ensure a proper initialization
during early boot time.

Fixes openssl#9078

This partially reverts commit 38023b8.
@t8m
Copy link
Member

t8m commented Jun 6, 2019

Apropos energy: your energy would probably be better spent convincing Theodore Ts'o to rethink his decision not to block /dev/urandom during early boot in order not to break userland programs (see his commit message c6e9d6f38894). This is the right place were you could advertise your ideas to make /dev/urandom more secure. ;-)

Nope, this is not a good use of anyone's energy either :), find another thing to spend the energy on. The non-blockability of /dev/urandom on Linux is set in stone and it will not change. And moreover even if the iface was not set in stone it does not make any sense to change it as everyone should be moving to getrandom() anyway which has the desired property and also more advantages such as not needing a file descriptor and a device on system.

@mspncp
Copy link
Contributor Author

mspncp commented Jun 6, 2019

Nope, this is not a good use of anyone's energy either :)

This occurred to me too after writing it. That's why I posted #9078 (comment).

mspncp added a commit to mspncp/openssl that referenced this issue Jun 9, 2019
The DEVRANDOM_WAIT feature added a select() call to wait for the
`/dev/random` device to become readable before reading from the
`/dev/urandom` device. It was introduced in commit 38023b8
in order to mitigate the fact that the `/dev/urandom` device
does not block until the initial seeding of the kernel CSPRNG
has completed, contrary to the behaviour of the `getrandom()`
system call.

It turned out that this change had negative side effects on the
performance which were not acceptable. After some discussion it
was decided to revert this feature and leave it up to the OS
resp. the platform maintainer to ensure a proper initialization
during early boot time.

Fixes openssl#9078

This partially reverts commit 38023b8.

(cherry picked from commit c19c5a6)
@levitte levitte closed this as completed in a08714e Jun 9, 2019
levitte pushed a commit that referenced this issue Jun 9, 2019
The DEVRANDOM_WAIT feature added a select() call to wait for the
`/dev/random` device to become readable before reading from the
`/dev/urandom` device. It was introduced in commit 38023b8
in order to mitigate the fact that the `/dev/urandom` device
does not block until the initial seeding of the kernel CSPRNG
has completed, contrary to the behaviour of the `getrandom()`
system call.

It turned out that this change had negative side effects on
performance which were not acceptable. After some discussion it
was decided to revert this feature and leave it up to the OS
resp. the platform maintainer to ensure a proper initialization
during early boot time.

Fixes #9078

This partially reverts commit 38023b8.

Reviewed-by: Tim Hudson <tjh@openssl.org>
Reviewed-by: Viktor Dukhovni <viktor@openssl.org>

(cherry picked from commit a08714e)

(Merged from #9118)
@aalexand
Copy link

aalexand commented Aug 1, 2019

As a data point, I've spent good amount of time chasing sudden hangs of npm when running in a virtual machine on Google Cloud. It turned out to be related to upgrade of Node version to 10.16.1 which upgrades to OpenSSL 1.1.1c and this issue apparently is the cause - virtual machines have less entropy available in them I think.

@bernd-edlinger
Copy link
Member

which linux version is that?

@kroeckx
Copy link
Member

kroeckx commented Aug 2, 2019 via email

@bernd-edlinger
Copy link
Member

I think the described behaviour can most likely occur when a recent linux version is used
but the cross compiler does not have a define for __NR_getrandom under low entropy.
Then the all entropy from the input pool is periodically transferred to the CRNG.
But in such systems the /dev/random is not supposed to be used.
Can someone confirm if I am right with that theory ?
I would need the linux version and if __NR_getrandom is available when compiling OpenSSL.

@jasonster17
Copy link

When was this issue manifesting, for TCP connections? Was the delay only happening on the handshake process? Or did it occur repeatedly on data transfer?

@paulidale
Copy link
Contributor

Library startup.

@jasonster17
Copy link

Library startup.

Thanks for the info!

@guozhaohui
Copy link

guozhaohui commented Nov 12, 2020

I can still confirm those DEVRANDOM_WAIT related implementation in ver.1.1.1d and face the performance issue, i.e. my program takes several minutes to start.

By read this long discussion, I found that this performance issues was solved by the following commit.
commit ad416c8
Author: Dr. Matthias St. Pierre Matthias.St.Pierre@ncp-e.com
Date: Wed Jun 5 11:09:46 2019 +0200

Revert the DEVRANDOM_WAIT feature

But after checked the log, I found that the revert was restored by the following commit.
commit 3ff98f5
Author: Pauli paul.dale@oracle.com
Date: Tue Aug 20 16:19:20 2019 +1000

Start up DEVRANDOM entropy improvement for older Linux devices.

@paulidale
Copy link
Contributor

This isn't so much a performance issue, it is a security issue.

High quality random numbers are critical to maintain security. What the waiting is doing is getting such high quality randomness. Not waiting means you won't have as much (or often any) level of security.

@t8m
Copy link
Member

t8m commented Nov 12, 2020

It was always a little bit controversial topic. In my opinion having the wait does not belong to openssl library itself but should be handled by external ways. In case the shared memory marker does not work for some reason openssl will always unnecessarily wait although the system RNG was already fully seeded with entropy. But yes, having the kernel RNG sufficiently seeded with quality entropy on the system boot is absolutely critical to security.

@paulidale
Copy link
Contributor

In most situations, OpenSSL won't wait even if the shared memory marker doesn't work.

Depending on the kernel and operating system: select(/dev/random) will succeed immediately and read(/dev/random, 1) will consume one byte. Once everything is properly seeded of course.

@t8m
Copy link
Member

t8m commented Nov 12, 2020

The read(/dev/random, 1) will consume all the available random bytes quickly if openssl is started many times at once. Unless you have rngd/jitterentropy or some HW entropy source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch: master Merge to master branch branch: 1.1.1 Merge to OpenSSL_1_1_1-stable branch
Projects
None yet
Development

Successfully merging a pull request may close this issue.