Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a random sleep in the Netdata updater #9079

Merged
merged 4 commits into from
May 26, 2020
Merged

Introduce a random sleep in the Netdata updater #9079

merged 4 commits into from
May 26, 2020

Conversation

prologic
Copy link
Contributor

Summary

ssia

Component Name
  • area/packaging
Test Plan

Tested the shell expression:

sleep $(((RANDOM % 10) + 1))s
Additional Information

This will hopefully aliviate the stampede like effect we're seeing in Netdata
Cloud Production where we see spikes of disconnections and reconenctions.
The timing lines up with the default Debian crontab entry for "daily" jobs
and thus adding a random sleep into the updater should help spread this out
a bit and reduce load on our system9s) in short intervals.

@squash-labs
Copy link

squash-labs bot commented May 18, 2020

Manage this branch in Squash

Test this branch here: https://prologicrandom-sleep-updater-mobde.squash.io

@prologic prologic requested a review from cosmix May 18, 2020 10:59
@prologic prologic self-assigned this May 18, 2020
@prologic prologic requested review from Ferroin, knatsakis, gobwas and hmoragrega and removed request for gobwas May 18, 2020 10:59
@prologic prologic marked this pull request as ready for review May 18, 2020 11:00
@prologic prologic requested a review from ncmans as a code owner May 18, 2020 11:00
@github-actions github-actions bot added the area/packaging Packaging and operating systems support label May 18, 2020
@cakrit
Copy link
Contributor

cakrit commented May 18, 2020

The sleep command is too early in the script, there are a couple of possible failure conditions that would prevent installation further down. It should probably go before the actual running of the installation code, after the checksum validation.

This is a bad hack, I don't like it at all. Have we really exhausted all other options and have to resort to this?

@Ferroin
Copy link
Member

Ferroin commented May 18, 2020

This is a bad hack, I don't like it at all. Have we really exhausted all other options and have to resort to this?

There is zero functionality to provide this in any of the task schedulers we do or could use for our updater. The expectation is that if a job needs random jitter like this, it will handle it itself.. I'm not hugely fond of it myself (it should ideally be set up so it only runs when run non-interactively, and I'd go with a much longer range (more like 5-10 minutes, at least)), but it's better than nothing.

Aside from the stampede effect we're seeing in Netdata Cloud due to updates, it's also good practice to help avoid putting undue load on our update infrastructure.

Also, this is generally expected behavior for automated update systems. See for example FreeBSD's update tool (it refuses to run non-interactively without special handling, and the cron job mode sleeps a random number of seconds up to a full hour before actually doing it's job).

@cakrit
Copy link
Contributor

cakrit commented May 18, 2020

Fine, go ahead. I see we are trying to spread out the downloads themselves as well, so the sleep can't go much later than where it is...

@ncmans ncmans removed their request for review May 19, 2020 01:34
packaging/installer/netdata-updater.sh Outdated Show resolved Hide resolved
Ferroin
Ferroin previously approved these changes May 20, 2020
knatsakis
knatsakis previously approved these changes May 25, 2020
@prologic
Copy link
Contributor Author

Sanity check before merging...

root@ubuntu2004:~# tail -f /var/log/syslog
May 26 00:45:01 ubuntu2004 CRON[81223]: (root) CMD (/usr/libexec/netdata/netdata-updater.sh)
root@ubuntu2004:~# ps aux | grep netdata-updater
root       81223  0.0  0.0   2608   544 ?        Ss   00:45   0:00 /bin/sh -c /usr/libexec/netdata/netdata-updater.sh
root       81224  0.0  0.3   7024  3412 ?        S    00:45   0:00 bash /usr/libexec/netdata/netdata-updater.sh
root       81227  0.0  0.0   6436   740 pts/0    S+   00:45   0:00 grep --color=auto netdata-updater
root@ubuntu2004:~# strace -p 81224
strace: Process 81224 attached
wait4(-1, ^Cstrace: Process 81224 detached
 <detached ...>
root@ubuntu2004:~# ps aux | grep sleep
root       81225  0.0  0.0   5476   592 ?        S    00:45   0:00 sleep 336s
root       81234  0.0  0.0   6432   724 pts/0    S+   00:46   0:00 grep --color=auto sleep
root@ubuntu2004:~# pstree -p 81223
sh(81223)───bash(81224)───sleep(81225)

Behaving as intended via CronD.

@prologic prologic dismissed stale reviews from knatsakis and Ferroin via eb6459e May 26, 2020 00:52
@prologic prologic requested a review from Ferroin May 26, 2020 00:54
@prologic prologic requested a review from knatsakis May 26, 2020 00:54
Ferroin
Ferroin previously approved these changes May 26, 2020
thiagoftsm
thiagoftsm previously approved these changes May 26, 2020
@prologic
Copy link
Contributor Author

Travis CI failures are becuase of:

00:00:49.878291  Running BATS file..
00:00:49.896820  1..3
00:02:22.894917  ok 1 install stable netdata using kickstart
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
The build has been terminated

Unintended side-effect of introducing a random delay.

@prologic
Copy link
Contributor Author

This can be fixed by disabling Bash's $RANDOM special variable by doing:

unset RANDOM; export RANODM=0

This permeates to subshells too;

$ unset RANDOM; export RANDOM=0; ( echo $RANDOM ); /bin/sh -c 'echo $RANDOM'
0
14598

@prologic prologic dismissed stale reviews from thiagoftsm and Ferroin via b51fe00 May 26, 2020 01:48
@prologic prologic requested a review from ktsaou as a code owner May 26, 2020 01:48
@prologic prologic merged commit cea8a3f into netdata:master May 26, 2020
@prologic prologic deleted the random_sleep_updater branch May 26, 2020 05:58
prologic added a commit that referenced this pull request May 26, 2020
prologic added a commit that referenced this pull request May 26, 2020
Ferroin added a commit to Ferroin/netdata that referenced this pull request Jun 1, 2020
Ferroin added a commit that referenced this pull request Jun 3, 2020
…ly. (#9245)

* Revert "Revert "Introduce a random sleep in the Netdata updater (#9079)" (#9161)"

This reverts commit e92d2ce.

* Add option to updater to disable randomized delay.

Primarily intended for CI, also useful for automated deployment tools
like Ansible.

* Use correct paths in CI.

* Mke variable name match option name.
Saruspete pushed a commit to Saruspete/netdata that referenced this pull request Jun 17, 2020
* Introduce a random sleep in the Netdata updater

* Only sleep if we're not a tty (e.g: cron) and use a random interval between 30m-60m

* Set lower bound to 1s

* Disable random sleep / netdata-updater splay in lifecycle tests
Saruspete pushed a commit to Saruspete/netdata that referenced this pull request Jun 17, 2020
Saruspete pushed a commit to Saruspete/netdata that referenced this pull request Jun 17, 2020
…ly. (netdata#9245)

* Revert "Revert "Introduce a random sleep in the Netdata updater (netdata#9079)" (netdata#9161)"

This reverts commit e92d2ce.

* Add option to updater to disable randomized delay.

Primarily intended for CI, also useful for automated deployment tools
like Ansible.

* Use correct paths in CI.

* Mke variable name match option name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/packaging Packaging and operating systems support area/tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants