Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent hangs with 10.16.1 #28932

Closed
ofrobots opened this issue Aug 2, 2019 · 12 comments

Comments

@ofrobots
Copy link
Contributor

commented Aug 2, 2019

Over at Google Cloud we are noticing intermittent hangs on some virtual machines with the latest 10.16.1. This is likely due to the OpenSSL upgrade due to 1.1.1c (see openssl/openssl#9078).

Hangs seem to coincide with when the system has low-entropy and has the following stack:

(gdb) bt
#0  0x00007fb4b2d7403f in __GI___select (nfds=18, readfds=0x7ffdd2bf5c30, writefds=0x0, exceptfds=0x0, timeout=0x0) at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x00000000014d3c4c in rand_pool_acquire_entropy ()
#2  0x00000000014d2058 in rand_drbg_get_entropy ()
#3  0x00000000014cf5c4 in RAND_DRBG_instantiate ()
#4  0x00000000014cf8fe in drbg_setup ()
#5  0x00000000014cfbc7 in do_rand_drbg_init_ossl_ ()
#6  0x00007fb4b305d827 in __pthread_once_slow (once_control=0x2667838 <rand_drbg_init>, init_routine=0x14cfb80 <do_rand_drbg_init_ossl_>) at pthread_once.c:116
#7  0x00000000014febe9 in CRYPTO_THREAD_run_once ()
#8  0x00000000014ceb66 in drbg_status ()
#9  0x0000000000a104a0 in node::crypto::EntropySource(unsigned char*, unsigned long) ()
#10 0x00000000017aa753 in v8::base::RandomNumberGenerator::RandomNumberGenerator() ()
#11 0x00000000017a9901 in v8::base::CallOnceImpl(long*, std::function<void ()>) ()
#12 0x00000000017ac19f in v8::base::OS::GetRandomMmapAddr() ()
#13 0x0000000000af7947 in v8::internal::GetRandomMmapAddr() ()
#14 0x0000000000f037af in v8::internal::Heap::SetUp() ()
#15 0x0000000000fb6713 in v8::internal::Isolate::Init(v8::internal::StartupDeserializer*) ()
#16 0x00000000011ffd3a in v8::internal::Snapshot::Initialize(v8::internal::Isolate*) ()
#17 0x0000000000b14988 in v8::Isolate::Initialize(v8::Isolate*, v8::Isolate::CreateParams const&) ()
#18 0x0000000000b149da in v8::Isolate::New(v8::Isolate::CreateParams const&) ()
#19 0x0000000000902954 in node::Start(int, char**) ()
#20 0x00007fb4b2c7eb97 in __libc_start_main (main=0x8bbd50 <main>, argc=2, argv=0x7ffdd2bf6788, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdd2bf6778) at ../csu/libc-start.c:310
#21 0x00000000008bbe85 in _start ()

@addaleax addaleax added the openssl label Aug 2, 2019

@ofrobots

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2019

Here's my experiment on some GCE virtual machines:

for i in $(seq 1 50) ; do echo run $i; cat /proc/sys/kernel/random/entropy_avail ; time npm >/dev/null; sleep 1; done
...
172

real	0m0.257s
user	0m0.222s
sys	0m0.032s
run 6
173

real	0m0.264s
user	0m0.216s
sys	0m0.044s
run 7
5
^C
real	33m50.490s
...

Note the entropy value of 5 at the time of the hang.

@addaleax

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

I think this is openssl/openssl#9078 / openssl/openssl#9084, and cherry-picking the latter would resolve this?

@ofrobots

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2019

I believe so. I will try this out in the failing environment tomorrow.

@targos

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

I'm pretty sure I already saw an issue here about this a few days ago.

Edit: forget it, it was before 10.16.1

@MylesBorins

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Should we escalate a 10.16.2 to fix this? @ofrobots is there an easy way to test if I can get an rc together?

@MylesBorins

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

ofrobots added a commit to ofrobots/node that referenced this issue Aug 5, 2019

deps:openssl: cherry-pick c19c5a6 from upstream
Original commit message:
    Revert the DEVRANDOM_WAIT feature

    The DEVRANDOM_WAIT feature added a select() call to wait for the
    `/dev/random` device to become readable before reading from the
    `/dev/urandom` device. It was introduced in commit 38023b8
    in order to mitigate the fact that the `/dev/urandom` device
    does not block until the initial seeding of the kernel CSPRNG
    has completed, contrary to the behaviour of the `getrandom()`
    system call.

    It turned out that this change had negative side effects on the
    performance which were not acceptable. After some discussion it
    was decided to revert this feature and leave it up to the OS
    resp. the platform maintainer to ensure a proper initialization
    during early boot time.

    Fixes 9078

    This partially reverts commit 38023b8.

Refs: openssl/openssl#9084
Fixes: nodejs#28932
@ofrobots

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2019

BethGriggs added a commit that referenced this issue Aug 6, 2019

deps: cherry-pick c19c5a6 from openssl upstream
Original commit message:
    Revert the DEVRANDOM_WAIT feature

    The DEVRANDOM_WAIT feature added a select() call to wait for the
    `/dev/random` device to become readable before reading from the
    `/dev/urandom` device. It was introduced in commit 38023b8
    in order to mitigate the fact that the `/dev/urandom` device
    does not block until the initial seeding of the kernel CSPRNG
    has completed, contrary to the behaviour of the `getrandom()`
    system call.

    It turned out that this change had negative side effects on the
    performance which were not acceptable. After some discussion it
    was decided to revert this feature and leave it up to the OS
    resp. the platform maintainer to ensure a proper initialization
    during early boot time.

    Fixes 9078

    This partially reverts commit 38023b8.

Refs: openssl/openssl#9084
Fixes: #28932

PR-URL: #28983
Reviewed-By: Shigeki Ohtsu <ohtsu@ohtsu.org>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Beth Griggs <Bethany.Griggs@uk.ibm.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>

BethGriggs added a commit that referenced this issue Aug 6, 2019

2019-08-06, Version 10.16.2 'Dubnium' (LTS)
Notable changes:

This release patches a regression in the OpenSSL upgrade to 1.1.1c that
 causes intermittent hangs in machines that have low entropy.

Refs: #28932

BethGriggs added a commit that referenced this issue Aug 6, 2019

2019-08-06, Version 10.16.2 'Dubnium' (LTS)
Notable changes:

This release patches a regression in the OpenSSL upgrade to 1.1.1c that
 causes intermittent hangs in machines that have low entropy.

Refs: #28932

PR-URL: #29021

BethGriggs added a commit that referenced this issue Aug 6, 2019

2019-08-06, Version 10.16.2 'Dubnium' (LTS)
Notable changes:

This release patches a regression in the OpenSSL upgrade to 1.1.1c that
 causes intermittent hangs in machines that have low entropy.

Refs: #28932

PR-URL: #29021
@BethGriggs

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

This should be fixed with the v10.16.2 release, please reopen if not.

@BethGriggs BethGriggs closed this Aug 7, 2019

@bernd-edlinger

This comment has been minimized.

Copy link

commented Aug 18, 2019

@ofrobots I am curious which linux version did you use where the available entropy
dropped to zero again?

@ofrobots

This comment has been minimized.

Copy link
Contributor Author

commented Aug 19, 2019

This was a Ubuntu instance IIRC.

@bernd-edlinger

This comment has been minimized.

Copy link

commented Aug 19, 2019

I ask because the code with the code with the select should not be executed unless the
linux kernel is older than v3.17, that is rather old, ubuntu 14.04 uses linux v3.13 for instance,
but that is rather old, could you log in and try cat /proc/version, what does that look like?

@bernd-edlinger

This comment has been minimized.

Copy link

commented Aug 19, 2019

The other way round I observed a similar behaviour that you described, when
linux v4.8 or newer is used in an arm machine, and every 3-5 minutes when
the CRNG gets reseeded it pulls all entroy from the input pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.