Introduce a random sleep in the Netdata updater #9079

prologic · 2020-05-18T10:59:15Z

Summary

ssia

Component Name

area/packaging

Test Plan

Tested the shell expression:

sleep $(((RANDOM % 10) + 1))s

Additional Information

This will hopefully aliviate the stampede like effect we're seeing in Netdata
Cloud Production where we see spikes of disconnections and reconenctions.
The timing lines up with the default Debian crontab entry for "daily" jobs
and thus adding a random sleep into the updater should help spread this out
a bit and reduce load on our system9s) in short intervals.

squash-labs · 2020-05-18T10:59:23Z

Manage this branch in Squash

Test this branch here: https://prologicrandom-sleep-updater-mobde.squash.io

cakrit · 2020-05-18T18:13:02Z

The sleep command is too early in the script, there are a couple of possible failure conditions that would prevent installation further down. It should probably go before the actual running of the installation code, after the checksum validation.

This is a bad hack, I don't like it at all. Have we really exhausted all other options and have to resort to this?

Ferroin · 2020-05-18T18:40:36Z

This is a bad hack, I don't like it at all. Have we really exhausted all other options and have to resort to this?

There is zero functionality to provide this in any of the task schedulers we do or could use for our updater. The expectation is that if a job needs random jitter like this, it will handle it itself.. I'm not hugely fond of it myself (it should ideally be set up so it only runs when run non-interactively, and I'd go with a much longer range (more like 5-10 minutes, at least)), but it's better than nothing.

Aside from the stampede effect we're seeing in Netdata Cloud due to updates, it's also good practice to help avoid putting undue load on our update infrastructure.

Also, this is generally expected behavior for automated update systems. See for example FreeBSD's update tool (it refuses to run non-interactively without special handling, and the cron job mode sleeps a random number of seconds up to a full hour before actually doing it's job).

cakrit · 2020-05-18T18:45:35Z

Fine, go ahead. I see we are trying to spread out the downloads themselves as well, so the sleep can't go much later than where it is...

packaging/installer/netdata-updater.sh

prologic · 2020-05-26T00:49:42Z

Sanity check before merging...

root@ubuntu2004:~# tail -f /var/log/syslog
May 26 00:45:01 ubuntu2004 CRON[81223]: (root) CMD (/usr/libexec/netdata/netdata-updater.sh)
root@ubuntu2004:~# ps aux | grep netdata-updater
root       81223  0.0  0.0   2608   544 ?        Ss   00:45   0:00 /bin/sh -c /usr/libexec/netdata/netdata-updater.sh
root       81224  0.0  0.3   7024  3412 ?        S    00:45   0:00 bash /usr/libexec/netdata/netdata-updater.sh
root       81227  0.0  0.0   6436   740 pts/0    S+   00:45   0:00 grep --color=auto netdata-updater
root@ubuntu2004:~# strace -p 81224
strace: Process 81224 attached
wait4(-1, ^Cstrace: Process 81224 detached
 <detached ...>
root@ubuntu2004:~# ps aux | grep sleep
root       81225  0.0  0.0   5476   592 ?        S    00:45   0:00 sleep 336s
root       81234  0.0  0.0   6432   724 pts/0    S+   00:46   0:00 grep --color=auto sleep
root@ubuntu2004:~# pstree -p 81223
sh(81223)───bash(81224)───sleep(81225)

Behaving as intended via CronD.

…etween 30m-60m

packaging/installer/netdata-updater.sh

prologic · 2020-05-26T01:36:50Z

Travis CI failures are becuase of:

00:00:49.878291  Running BATS file..
00:00:49.896820  1..3
00:02:22.894917  ok 1 install stable netdata using kickstart
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
The build has been terminated

Unintended side-effect of introducing a random delay.

prologic · 2020-05-26T01:45:24Z

This can be fixed by disabling Bash's $RANDOM special variable by doing:

unset RANDOM; export RANODM=0

This permeates to subshells too;

$ unset RANDOM; export RANDOM=0; ( echo $RANDOM ); /bin/sh -c 'echo $RANDOM'
0
14598

This reverts commit cea8a3f.

…ata#9079)" (netdata#9161)" This reverts commit e92d2ce.

…ly. (#9245) * Revert "Revert "Introduce a random sleep in the Netdata updater (#9079)" (#9161)" This reverts commit e92d2ce. * Add option to updater to disable randomized delay. Primarily intended for CI, also useful for automated deployment tools like Ansible. * Use correct paths in CI. * Mke variable name match option name.

* Introduce a random sleep in the Netdata updater * Only sleep if we're not a tty (e.g: cron) and use a random interval between 30m-60m * Set lower bound to 1s * Disable random sleep / netdata-updater splay in lifecycle tests

…" (netdata#9161) This reverts commit cea8a3f.

…ly. (netdata#9245) * Revert "Revert "Introduce a random sleep in the Netdata updater (netdata#9079)" (netdata#9161)" This reverts commit e92d2ce. * Add option to updater to disable randomized delay. Primarily intended for CI, also useful for automated deployment tools like Ansible. * Use correct paths in CI. * Mke variable name match option name.

prologic requested a review from cosmix May 18, 2020 10:59

prologic self-assigned this May 18, 2020

prologic requested review from Ferroin, knatsakis, gobwas and hmoragrega and removed request for gobwas May 18, 2020 10:59

prologic marked this pull request as ready for review May 18, 2020 11:00

prologic requested a review from ncmans as a code owner May 18, 2020 11:00

github-actions bot added the area/packaging Packaging and operating systems support label May 18, 2020

Ferroin approved these changes May 18, 2020

View reviewed changes

ncmans removed their request for review May 19, 2020 01:34

Ferroin requested changes May 19, 2020

View reviewed changes

packaging/installer/netdata-updater.sh Outdated Show resolved Hide resolved

prologic requested review from Ferroin, ilyam8 and mfundul May 20, 2020 12:34

Ferroin reviewed May 20, 2020

View reviewed changes

packaging/installer/netdata-updater.sh Outdated Show resolved Hide resolved

Ferroin previously approved these changes May 20, 2020

View reviewed changes

knatsakis previously approved these changes May 25, 2020

View reviewed changes

prologic added 3 commits May 26, 2020 10:52

Introduce a random sleep in the Netdata updater

8c31928

Only sleep if we're not a tty (e.g: cron) and use a random interval b…

d2b01d8

…etween 30m-60m

Set lower bound to 1s

eb6459e

prologic dismissed stale reviews from knatsakis and Ferroin via eb6459e May 26, 2020 00:52

prologic requested a review from Ferroin May 26, 2020 00:54

prologic requested a review from knatsakis May 26, 2020 00:54

Ferroin reviewed May 26, 2020

View reviewed changes

packaging/installer/netdata-updater.sh Show resolved Hide resolved

Ferroin previously approved these changes May 26, 2020

View reviewed changes

thiagoftsm previously approved these changes May 26, 2020

View reviewed changes

Disable random sleep / netdata-updater splay in lifecycle tests

b51fe00

prologic dismissed stale reviews from thiagoftsm and Ferroin via b51fe00 May 26, 2020 01:48

prologic requested a review from ktsaou as a code owner May 26, 2020 01:48

github-actions bot added the area/tests label May 26, 2020

thiagoftsm requested review from Ferroin and thiagoftsm May 26, 2020 02:10

thiagoftsm approved these changes May 26, 2020

View reviewed changes

knatsakis approved these changes May 26, 2020

View reviewed changes

prologic merged commit cea8a3f into netdata:master May 26, 2020

prologic deleted the random_sleep_updater branch May 26, 2020 05:58

prologic added a commit that referenced this pull request May 26, 2020

Revert "Introduce a random sleep in the Netdata updater (#9079)"

ec84b18

This reverts commit cea8a3f.

prologic mentioned this pull request May 26, 2020

Revert "Introduce a random sleep in the Netdata updater" #9161

Merged

prologic added a commit that referenced this pull request May 26, 2020

Revert "Introduce a random sleep in the Netdata updater (#9079)" (#9161)

e92d2ce

This reverts commit cea8a3f.

Ferroin added a commit to Ferroin/netdata that referenced this pull request Jun 1, 2020

Revert "Revert "Introduce a random sleep in the Netdata updater (netd…

810a8da

…ata#9079)" (netdata#9161)" This reverts commit e92d2ce.

Ferroin mentioned this pull request Jun 1, 2020

Add a random offset to the update script when running non-interactively. #9245

Merged

prologic mentioned this pull request Jun 17, 2020

Automatic changelog generation doesn't work well #8117

Closed

Saruspete pushed a commit to Saruspete/netdata that referenced this pull request Jun 17, 2020

Revert "Introduce a random sleep in the Netdata updater (netdata#9079)…

5710018

…" (netdata#9161) This reverts commit cea8a3f.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a random sleep in the Netdata updater #9079

Introduce a random sleep in the Netdata updater #9079

prologic commented May 18, 2020

squash-labs bot commented May 18, 2020

cakrit commented May 18, 2020

Ferroin commented May 18, 2020

cakrit commented May 18, 2020

prologic commented May 26, 2020

prologic commented May 26, 2020

prologic commented May 26, 2020

Introduce a random sleep in the Netdata updater #9079

Introduce a random sleep in the Netdata updater #9079

Conversation

prologic commented May 18, 2020

Summary

Component Name

Test Plan

Additional Information

squash-labs bot commented May 18, 2020

Manage this branch in Squash

cakrit commented May 18, 2020

Ferroin commented May 18, 2020

cakrit commented May 18, 2020

prologic commented May 26, 2020

prologic commented May 26, 2020

prologic commented May 26, 2020