Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Added a bit of jitter to the metric sending worker period. #7642
Conversation
alesstimec
changed the base branch from
develop
to
master
Jul 14, 2017
alesstimec
changed the base branch from
master
to
develop
Jul 14, 2017
|
"In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%.." How would adding jitter to the period help here? The units are unlikely to have come up at the same time so their sending intervals are likely to be staggered anyway |
| @@ -16,8 +17,18 @@ import ( | ||
| // and that the function was able to stop cleanly | ||
| var ErrKilled = errors.New("worker killed") | ||
| +type PeriodicWorkerOption func(w *periodicWorker) |
|
In experience that isn't true. There are a lot of events (upgrading a
controller) that cause synchronization among units. By randomizing their
intervals it forces them to stay reasonably evenly distributed even when
there is something that would cause them to want to synchronize.
John
=:->
…
|
|
!!build!! |
|
@jameinel it really isn't my place to say - but wouldn't it be easier to tackle the problem of the controller synchronising the units rather than adding jitter to all the workers? |
tasdomas
approved these changes
Jul 18, 2017
LGTM
A more elegant solution would probably be to make period a func () time.Duration and have a constant and jittered implementations.
| w := &periodicWorker{newTimer: timerFunc} | ||
| + for _, option := range options { | ||
| + option(w) | ||
| + } |
mjs
Jul 18, 2017
Contributor
This is a little odd/cute/unexpected. You could have also replaced the period arg with a func() time.Duration.
I'm not going to fight it too hard though.
| + window := (2.0 * w.jitter) * float64(period) | ||
| + offset := float64(r.Int63n(int64(window))) | ||
| + p = time.Duration(lower + offset) | ||
| + } |
| + c.Fatalf("The doWork function should have been called by now") | ||
| + } | ||
| + c.Assert(float64(actualPeriod)/float64(defaultPeriod) <= 1.2, jc.IsTrue) | ||
| + c.Assert(float64(actualPeriod)/float64(defaultPeriod) >= 0.8, jc.IsTrue) |
mjs
Jul 18, 2017
Contributor
I'm concerned that this is going to end up being flaky. I've seen Go's timers be quite inaccurate in our test suites, going off both much later and earlier than requested, especially on busy or underprovisioned hosts. Notice how none of the other tests specifically check the period. The way that the worker is currently structured, this can't be reliably tested.
Possible solutions:
- set wide margins the allowed period - but I think they'll actually need to be so wide that they'll be meaningless
- use an internal test which ensures that
w.jitteris set and as the required effect (by factoring outnextPeriodas suggested) - rework the worker in terms of an injected Clock like we do elsewhere so that tests aren't involved with wall time (the best solution but also the most time consuming)
|
!!build!! |
|
!!build!! |
|
!!build!! |
|
!!build!! |
|
!!build!! |
| + case <-funcHasRun: | ||
| + c.Fatalf("After the kill we don't expect anymore calls to the function") | ||
| + case <-time.After(defaultFireOnceWait): | ||
| + } |
mjs
Jul 19, 2017
Contributor
This is much better. Thanks. A test of the default nextPeriod implementation would be nice too. Maybe just run it a number of times with and without jitter and check that numbers in the correct range come out?
|
!!build!! |
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
alesstimec commentedJul 14, 2017
Description of change
Why is this change needed?
In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%..
QA steps
Metrics collection should still work as expected. The small jitter introduced to the period should not affect that. Use "juju metrics" to verify that metrics are still being collected as expected.
Documentation changes
No documentation change required.
Bug reference
N/A