New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a bit of jitter to the metric sending worker period. #7642
Conversation
"In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%.." How would adding jitter to the period help here? The units are unlikely to have come up at the same time so their sending intervals are likely to be staggered anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
worker/periodicworker.go
Outdated
@@ -16,8 +17,18 @@ import ( | |||
// and that the function was able to stop cleanly | |||
var ErrKilled = errors.New("worker killed") | |||
|
|||
type PeriodicWorkerOption func(w *periodicWorker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
godoc please
In experience that isn't true. There are a lot of events (upgrading a
controller) that cause synchronization among units. By randomizing their
intervals it forces them to stay reasonably evenly distributed even when
there is something that would cause them to want to synchronize.
John
=:->
…On Jul 14, 2017 13:13, "Matt Williams" ***@***.***> wrote:
"In case all units regain connection to the controller, this change will
prevent all of them from sending metrics at the same time. Instead they
will send with a rand metrics with a period of 5minutes +- 20%.."
How would adding jitter to the period help here? The units are unlikely to
have come up at the same time so their sending intervals are likely to be
staggered anyway
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7642 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAMYfWc5ZGNt-l1OTp_2pzghQJg-g2_Tks5sNzEdgaJpZM4OX6Nw>
.
|
!!build!! |
fd276e0
to
ce345de
Compare
@jameinel it really isn't my place to say - but wouldn't it be easier to tackle the problem of the controller synchronising the units rather than adding jitter to all the workers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A more elegant solution would probably be to make period a func () time.Duration
and have a constant and jittered implementations.
worker/periodicworker.go
Outdated
window := (2.0 * w.jitter) * float64(period) | ||
offset := float64(r.Int63n(int64(window))) | ||
p = time.Duration(lower + offset) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be cleaner to have all this in a nextPeriod()
method
w := &periodicWorker{newTimer: timerFunc} | ||
for _, option := range options { | ||
option(w) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little odd/cute/unexpected. You could have also replaced the period
arg with a func() time.Duration
.
I'm not going to fight it too hard though.
worker/periodicworker_test.go
Outdated
c.Fatalf("The doWork function should have been called by now") | ||
} | ||
c.Assert(float64(actualPeriod)/float64(defaultPeriod) <= 1.2, jc.IsTrue) | ||
c.Assert(float64(actualPeriod)/float64(defaultPeriod) >= 0.8, jc.IsTrue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this is going to end up being flaky. I've seen Go's timers be quite inaccurate in our test suites, going off both much later and earlier than requested, especially on busy or underprovisioned hosts. Notice how none of the other tests specifically check the period. The way that the worker is currently structured, this can't be reliably tested.
Possible solutions:
- set wide margins the allowed period - but I think they'll actually need to be so wide that they'll be meaningless
- use an internal test which ensures that
w.jitter
is set and as the required effect (by factoring outnextPeriod
as suggested) - rework the worker in terms of an injected Clock like we do elsewhere so that tests aren't involved with wall time (the best solution but also the most time consuming)
ce345de
to
160c712
Compare
!!build!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM
!!build!! |
3 similar comments
!!build!! |
!!build!! |
!!build!! |
case <-funcHasRun: | ||
c.Fatalf("After the kill we don't expect anymore calls to the function") | ||
case <-time.After(defaultFireOnceWait): | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better. Thanks. A test of the default nextPeriod implementation would be nice too. Maybe just run it a number of times with and without jitter and check that numbers in the correct range come out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
160c712
to
a3e59c2
Compare
a3e59c2
to
31b135c
Compare
!!build!! |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
Added a bit of jitter to the metric sending worker period. Why is this change needed? In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%.. Note: This already landed in develop (#7642) QA steps Metrics collection should still work as expected. The small jitter introduced to the period should not affect that. Use "juju metrics" to verify that metrics are still being collected as expected. Documentation changes No documentation change required. Bug reference N/A
Description of change
Why is this change needed?
In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%..
QA steps
Metrics collection should still work as expected. The small jitter introduced to the period should not affect that. Use "juju metrics" to verify that metrics are still being collected as expected.
Documentation changes
No documentation change required.
Bug reference
N/A