-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a random sleep in the Netdata updater #9079
Conversation
Manage this branch in SquashTest this branch here: https://prologicrandom-sleep-updater-mobde.squash.io |
The sleep command is too early in the script, there are a couple of possible failure conditions that would prevent installation further down. It should probably go before the actual running of the installation code, after the checksum validation. This is a bad hack, I don't like it at all. Have we really exhausted all other options and have to resort to this? |
There is zero functionality to provide this in any of the task schedulers we do or could use for our updater. The expectation is that if a job needs random jitter like this, it will handle it itself.. I'm not hugely fond of it myself (it should ideally be set up so it only runs when run non-interactively, and I'd go with a much longer range (more like 5-10 minutes, at least)), but it's better than nothing. Aside from the stampede effect we're seeing in Netdata Cloud due to updates, it's also good practice to help avoid putting undue load on our update infrastructure. Also, this is generally expected behavior for automated update systems. See for example FreeBSD's update tool (it refuses to run non-interactively without special handling, and the cron job mode sleeps a random number of seconds up to a full hour before actually doing it's job). |
Fine, go ahead. I see we are trying to spread out the downloads themselves as well, so the sleep can't go much later than where it is... |
Sanity check before merging...
Behaving as intended via CronD. |
Travis CI failures are becuase of:
Unintended side-effect of introducing a random delay. |
This can be fixed by disabling Bash's
This permeates to subshells too;
|
…ata#9079)" (netdata#9161)" This reverts commit e92d2ce.
…ly. (#9245) * Revert "Revert "Introduce a random sleep in the Netdata updater (#9079)" (#9161)" This reverts commit e92d2ce. * Add option to updater to disable randomized delay. Primarily intended for CI, also useful for automated deployment tools like Ansible. * Use correct paths in CI. * Mke variable name match option name.
* Introduce a random sleep in the Netdata updater * Only sleep if we're not a tty (e.g: cron) and use a random interval between 30m-60m * Set lower bound to 1s * Disable random sleep / netdata-updater splay in lifecycle tests
…" (netdata#9161) This reverts commit cea8a3f.
…ly. (netdata#9245) * Revert "Revert "Introduce a random sleep in the Netdata updater (netdata#9079)" (netdata#9161)" This reverts commit e92d2ce. * Add option to updater to disable randomized delay. Primarily intended for CI, also useful for automated deployment tools like Ansible. * Use correct paths in CI. * Mke variable name match option name.
Summary
ssia
Component Name
Test Plan
Tested the shell expression:
Additional Information
This will hopefully aliviate the stampede like effect we're seeing in Netdata
Cloud Production where we see spikes of disconnections and reconenctions.
The timing lines up with the default Debian crontab entry for "daily" jobs
and thus adding a random sleep into the updater should help spread this out
a bit and reduce load on our system9s) in short intervals.