Added a bit of jitter to the metric sending worker period. #7642

Merged
merged 1 commit into from Jul 19, 2017

Conversation

Projects
None yet
7 participants
Member

alesstimec commented Jul 14, 2017

Description of change

Why is this change needed?

In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%..

QA steps

Metrics collection should still work as expected. The small jitter introduced to the period should not affect that. Use "juju metrics" to verify that metrics are still being collected as expected.

Documentation changes

No documentation change required.

Bug reference

N/A

@alesstimec alesstimec changed the base branch from develop to master Jul 14, 2017

@alesstimec alesstimec changed the base branch from master to develop Jul 14, 2017

Member

mattyw commented Jul 14, 2017

"In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%.."

How would adding jitter to the period help here? The units are unlikely to have come up at the same time so their sending intervals are likely to be staggered anyway

cmars approved these changes Jul 14, 2017

LGTM

worker/periodicworker.go
@@ -16,8 +17,18 @@ import (
// and that the function was able to stop cleanly
var ErrKilled = errors.New("worker killed")
+type PeriodicWorkerOption func(w *periodicWorker)
@cmars

cmars Jul 14, 2017

Owner

godoc please

Owner

jameinel commented Jul 15, 2017

Owner

cmars commented Jul 15, 2017

!!build!!

Member

mattyw commented Jul 17, 2017

@jameinel it really isn't my place to say - but wouldn't it be easier to tackle the problem of the controller synchronising the units rather than adding jitter to all the workers?

LGTM

A more elegant solution would probably be to make period a func () time.Duration and have a constant and jittered implementations.

w := &periodicWorker{newTimer: timerFunc}
+ for _, option := range options {
+ option(w)
+ }
@mjs

mjs Jul 18, 2017

Contributor

This is a little odd/cute/unexpected. You could have also replaced the period arg with a func() time.Duration.

I'm not going to fight it too hard though.

worker/periodicworker.go
+ window := (2.0 * w.jitter) * float64(period)
+ offset := float64(r.Int63n(int64(window)))
+ p = time.Duration(lower + offset)
+ }
@mjs

mjs Jul 18, 2017

Contributor

it would be cleaner to have all this in a nextPeriod() method

worker/periodicworker_test.go
+ c.Fatalf("The doWork function should have been called by now")
+ }
+ c.Assert(float64(actualPeriod)/float64(defaultPeriod) <= 1.2, jc.IsTrue)
+ c.Assert(float64(actualPeriod)/float64(defaultPeriod) >= 0.8, jc.IsTrue)
@mjs

mjs Jul 18, 2017

Contributor

I'm concerned that this is going to end up being flaky. I've seen Go's timers be quite inaccurate in our test suites, going off both much later and earlier than requested, especially on busy or underprovisioned hosts. Notice how none of the other tests specifically check the period. The way that the worker is currently structured, this can't be reliably tested.

Possible solutions:

  • set wide margins the allowed period - but I think they'll actually need to be so wide that they'll be meaningless
  • use an internal test which ensures that w.jitter is set and as the required effect (by factoring out nextPeriod as suggested)
  • rework the worker in terms of an injected Clock like we do elsewhere so that tests aren't involved with wall time (the best solution but also the most time consuming)
Member

alesstimec commented Jul 18, 2017

!!build!!

cmars approved these changes Jul 18, 2017

Still LGTM

Member

tasdomas commented Jul 18, 2017

!!build!!

Member

tasdomas commented Jul 18, 2017

!!build!!

Member

alesstimec commented Jul 19, 2017

!!build!!

Member

alesstimec commented Jul 19, 2017

!!build!!

mjs approved these changes Jul 19, 2017

+ case <-funcHasRun:
+ c.Fatalf("After the kill we don't expect anymore calls to the function")
+ case <-time.After(defaultFireOnceWait):
+ }
@mjs

mjs Jul 19, 2017

Contributor

This is much better. Thanks. A test of the default nextPeriod implementation would be nice too. Maybe just run it a number of times with and without jitter and check that numbers in the correct range come out?

Member

alesstimec commented Jul 19, 2017

!!build!!

Member

alesstimec commented Jul 19, 2017

$$merge$$

Contributor

jujubot commented Jul 19, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 1f3f66e into juju:develop Jul 19, 2017

1 check passed

github-check-merge-juju Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details

jujubot added a commit that referenced this pull request Jul 20, 2017

Merge pull request #7657 from alesstimec/periodic-worker-jitter-2.2
Added a bit of jitter to the metric sending worker period.

Why is this change needed?

In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%..

Note: This already landed in develop (#7642)

QA steps

Metrics collection should still work as expected. The small jitter introduced to the period should not affect that. Use "juju metrics" to verify that metrics are still being collected as expected.

Documentation changes

No documentation change required.

Bug reference

N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment