rand usage regression 2.29.2 -> ~2.30 #496

matthew-l-weber · 2017-08-12T02:58:51Z

New logic was added at configure time and new conditional code
in lib/randutils.c between versions 2.29.2 and >= 2.30. The logic
determines if the glibc or syscall API should be used for
rand calls. This has been observed causing issues in a
configuration of a 4.1 kernel and glibc2.25. A tool like
parted when used at boot hangs for ~40x the time and when
debugged with gdb shows blocking on genrandom() call in util-linux, even though
a entropy check from a hardware rng used by rngd is adequate
before the parted tool is used.

We did notice that if we straced the parted tool and let all that output hit console it didn't block and take the complete 40sec to return. So our theory was that entropy was created via the uart output. We also noticed similar if we enabled networking that the tool would return much faster. So we wondered if these commits are actually using a API that doesn't leverage a hardware rng output as ours was setup with a value of ~3000 when we checked the entropy quality.

Reverted the following commits against 2.30.2 for my Buildroot build.
https://git.kernel.org/pub/scm/utils/util-linux/util-linux.git/commit/?h=stable/v2.30&id=b192dd6943e5bb5d2a3773b2c9b06cbd4eb28258
https://git.kernel.org/pub/scm/utils/util-linux/util-linux.git/commit/?h=stable/v2.30&id=cc01c2dca4f62e36505570d5cb15f868aa44bf54

Bubu · 2017-08-12T12:22:55Z

@kerolasa Maybe you have some ideas for this? I recently debugged a similar problem with dbus blocking in libexpat during boot due to a call to getRandom() [1]

[1] https://git.buildroot.net/buildroot/commit/?id=5a5e76381f8b000baa09c902ca89d45725c47f04

karelzak · 2017-08-14T07:43:15Z

The getrandom() uses /dev/urandom pool. The current status of the pool is available in

  /proc/sys/kernel/random/entropy_avail
  /proc/sys/kernel/random/poolsize

in util-linux we're asking for relatively small random data. The question is why parted asks so many times.

Anyway, I guess you have to initialize the urandom pool. For example systemd provides systemd-random-seed.service, do you have enabled this service?

I'll improve util-linux getrandom() usage. Now it is not able to use non-fully filled buffer and repeat the syscal. It's mistake. The old /dev/urandom code has been more friendly to the kernel.

The getrandom() does not have to return all requested bytes (missing entropy or when interrupted by signal). The current implementation in util-linux stupidly asks for all random data again, rather than only for missing bytes. The current code also does not care if we repeat our requests for ever; that's bad. This patch uses the same way as we already use for reading from /dev/urandom. It means: * repeat getrandom() for only missing bytes * limit number of unsuccessful request (16 times) * fallback to /dev/urandom on ENOSYS (old kernel or so...) Addresses: #496 Signed-off-by: Karel Zak <kzak@redhat.com>

karelzak · 2017-08-14T08:54:02Z

@Bubu please, git-pull and test util-linux from the master branch in your environment.

Now the getrandom() code behaves like old /dev/urandom based code. So, it's more friendly to kernel and does not repeat unsuccessful requests for ever...

matthew-l-weber · 2017-08-14T13:37:39Z

We do start rngd before using the parted tool as our first S00 startup script. Below are our status values after it starts. (It is running in the background at this point)
Entropy Avail [3095]
Pool Size [4096]

With the following commits applied to 2.30.1, I still see the same ~40-50sec delay.
0001-lib-randutils.c-Fall-back-gracefully-when-kernel-doe.patch
0002-lib-randutils.c-More-paranoia-in-getrandom-call.patch
0003-lib-randutils-improve-getrandom-usage.patch

I did notice though while testing that things started to work after the "random: nonblocking pool is initialized" was printed to the screen (after the delay). It's interesting the entropy avail can read a high value but the pool can be in a uninitialized state for that long. I'll continue to investigate but any ideas are appreciated.

matthew-l-weber · 2017-08-14T14:18:34Z

Found the issue. I'm using a 4.1 kernel with this bug.
https://www.spinics.net/lists/linux-crypto/msg24584.html

With the kernel patched, I don't require any util-linux patches. Sorry for the confusion on this, a bump to GLIBC2.25 plus the util-linux update uncovered this issue in my system.

karelzak · 2017-08-15T09:49:12Z

No problem, the issue forces me to review and improve our getrand() based code. So.. thanks! ;-)

The getrandom() does not have to return all requested bytes (missing entropy or when interrupted by signal). The current implementation in util-linux stupidly asks for all random data again, rather than only for missing bytes. The current code also does not care if we repeat our requests for ever; that's bad. This patch uses the same way as we already use for reading from /dev/urandom. It means: * repeat getrandom() for only missing bytes * limit number of unsuccessful request (16 times) * fallback to /dev/urandom on ENOSYS (old kernel or so...) Addresses: #496 Signed-off-by: Karel Zak <kzak@redhat.com>

karelzak closed this as completed Aug 15, 2017

karelzak mentioned this issue Sep 20, 2017

Revert "utils/util-linux: Update to 2.30.1" lede-project/source#1330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rand usage regression 2.29.2 -> ~2.30 #496

rand usage regression 2.29.2 -> ~2.30 #496

matthew-l-weber commented Aug 12, 2017

Bubu commented Aug 12, 2017

karelzak commented Aug 14, 2017

karelzak commented Aug 14, 2017

matthew-l-weber commented Aug 14, 2017

matthew-l-weber commented Aug 14, 2017

karelzak commented Aug 15, 2017

rand usage regression 2.29.2 -> ~2.30 #496

rand usage regression 2.29.2 -> ~2.30 #496

Comments

matthew-l-weber commented Aug 12, 2017

Bubu commented Aug 12, 2017

karelzak commented Aug 14, 2017

karelzak commented Aug 14, 2017

matthew-l-weber commented Aug 14, 2017

matthew-l-weber commented Aug 14, 2017

karelzak commented Aug 15, 2017