Added a bit of jitter to the metric sending worker period. #7642

alesstimec · 2017-07-14T07:13:10Z

Description of change

Why is this change needed?

In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%..

QA steps

Metrics collection should still work as expected. The small jitter introduced to the period should not affect that. Use "juju metrics" to verify that metrics are still being collected as expected.

Documentation changes

No documentation change required.

Bug reference

N/A

mattyw · 2017-07-14T09:13:00Z

"In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%.."

How would adding jitter to the period help here? The units are unlikely to have come up at the same time so their sending intervals are likely to be staggered anyway

cmars

LGTM

cmars · 2017-07-14T15:02:39Z

worker/periodicworker.go

@@ -16,8 +17,18 @@ import (
 // and that the function was able to stop cleanly
 var ErrKilled = errors.New("worker killed")

+type PeriodicWorkerOption func(w *periodicWorker)


godoc please

jameinel · 2017-07-15T03:27:48Z

In experience that isn't true. There are a lot of events (upgrading a controller) that cause synchronization among units. By randomizing their intervals it forces them to stay reasonably evenly distributed even when there is something that would cause them to want to synchronize. John =:->

…

On Jul 14, 2017 13:13, "Matt Williams" ***@***.***> wrote: "In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%.." How would adding jitter to the period help here? The units are unlikely to have come up at the same time so their sending intervals are likely to be staggered anyway — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#7642 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAMYfWc5ZGNt-l1OTp_2pzghQJg-g2_Tks5sNzEdgaJpZM4OX6Nw> .

cmars · 2017-07-15T13:41:48Z

!!build!!

mattyw · 2017-07-17T17:22:30Z

@jameinel it really isn't my place to say - but wouldn't it be easier to tackle the problem of the controller synchronising the units rather than adding jitter to all the workers?

tasdomas

LGTM

A more elegant solution would probably be to make period a func () time.Duration and have a constant and jittered implementations.

mjs · 2017-07-18T08:25:20Z

worker/periodicworker.go

+			window := (2.0 * w.jitter) * float64(period)
+			offset := float64(r.Int63n(int64(window)))
+			p = time.Duration(lower + offset)
+		}


it would be cleaner to have all this in a nextPeriod() method

mjs · 2017-07-18T08:28:14Z

worker/periodicworker.go

 	w := &periodicWorker{newTimer: timerFunc}
+	for _, option := range options {
+		option(w)
+	}


This is a little odd/cute/unexpected. You could have also replaced the period arg with a func() time.Duration.

I'm not going to fight it too hard though.

mjs · 2017-07-18T08:45:51Z

worker/periodicworker_test.go

+		c.Fatalf("The doWork function should have been called by now")
+	}
+	c.Assert(float64(actualPeriod)/float64(defaultPeriod) <= 1.2, jc.IsTrue)
+	c.Assert(float64(actualPeriod)/float64(defaultPeriod) >= 0.8, jc.IsTrue)


I'm concerned that this is going to end up being flaky. I've seen Go's timers be quite inaccurate in our test suites, going off both much later and earlier than requested, especially on busy or underprovisioned hosts. Notice how none of the other tests specifically check the period. The way that the worker is currently structured, this can't be reliably tested.

Possible solutions:

set wide margins the allowed period - but I think they'll actually need to be so wide that they'll be meaningless

use an internal test which ensures that w.jitter is set and as the required effect (by factoring out nextPeriod as suggested)

rework the worker in terms of an injected Clock like we do elsewhere so that tests aren't involved with wall time (the best solution but also the most time consuming)

alesstimec · 2017-07-18T15:04:42Z

!!build!!

cmars

Still LGTM

tasdomas · 2017-07-18T18:53:26Z

!!build!!

tasdomas · 2017-07-18T19:58:18Z

!!build!!

alesstimec · 2017-07-19T06:53:29Z

!!build!!

alesstimec · 2017-07-19T07:26:10Z

!!build!!

mjs · 2017-07-19T08:37:24Z

worker/periodicworker_test.go

+	case <-funcHasRun:
+		c.Fatalf("After the kill we don't expect anymore calls to the function")
+	case <-time.After(defaultFireOnceWait):
+	}


This is much better. Thanks. A test of the default nextPeriod implementation would be nice too. Maybe just run it a number of times with and without jitter and check that numbers in the correct range come out?

alesstimec · 2017-07-19T09:30:08Z

!!build!!

alesstimec · 2017-07-19T09:41:00Z

$$merge$$

jujubot · 2017-07-19T09:42:06Z

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Added a bit of jitter to the metric sending worker period. Why is this change needed? In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%.. Note: This already landed in develop (#7642) QA steps Metrics collection should still work as expected. The small jitter introduced to the period should not affect that. Use "juju metrics" to verify that metrics are still being collected as expected. Documentation changes No documentation change required. Bug reference N/A

alesstimec changed the base branch from develop to master July 14, 2017 07:13

alesstimec changed the base branch from master to develop July 14, 2017 07:13

cmars approved these changes Jul 14, 2017

View reviewed changes

alesstimec force-pushed the periodic-worker-jitter branch from fd276e0 to ce345de Compare July 17, 2017 07:15

tasdomas approved these changes Jul 18, 2017

View reviewed changes

mjs reviewed Jul 18, 2017

View reviewed changes

alesstimec force-pushed the periodic-worker-jitter branch from ce345de to 160c712 Compare July 18, 2017 14:25

tasdomas approved these changes Jul 18, 2017

View reviewed changes

cmars approved these changes Jul 18, 2017

View reviewed changes

mjs approved these changes Jul 19, 2017

View reviewed changes

alesstimec force-pushed the periodic-worker-jitter branch from 160c712 to a3e59c2 Compare July 19, 2017 09:10

Added a bit of jitter to the metric sending worker period.

31b135c

alesstimec force-pushed the periodic-worker-jitter branch from a3e59c2 to 31b135c Compare July 19, 2017 09:13

jujubot merged commit 1f3f66e into juju:develop Jul 19, 2017

alesstimec mentioned this pull request Jul 19, 2017

Added a bit of jitter to the metric sending worker period. #7657

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a bit of jitter to the metric sending worker period. #7642

Added a bit of jitter to the metric sending worker period. #7642

alesstimec commented Jul 14, 2017

mattyw commented Jul 14, 2017

cmars left a comment

cmars Jul 14, 2017

jameinel commented Jul 15, 2017 via email

cmars commented Jul 15, 2017

mattyw commented Jul 17, 2017

tasdomas left a comment

mjs Jul 18, 2017

mjs Jul 18, 2017

mjs Jul 18, 2017

alesstimec commented Jul 18, 2017

cmars left a comment

tasdomas commented Jul 18, 2017

tasdomas commented Jul 18, 2017

alesstimec commented Jul 19, 2017

alesstimec commented Jul 19, 2017

mjs Jul 19, 2017

alesstimec Jul 19, 2017

alesstimec commented Jul 19, 2017

alesstimec commented Jul 19, 2017

jujubot commented Jul 19, 2017

Added a bit of jitter to the metric sending worker period. #7642

Added a bit of jitter to the metric sending worker period. #7642

Conversation

alesstimec commented Jul 14, 2017

Description of change

QA steps

Documentation changes

Bug reference

mattyw commented Jul 14, 2017

cmars left a comment

Choose a reason for hiding this comment

cmars Jul 14, 2017

Choose a reason for hiding this comment

jameinel commented Jul 15, 2017 via email

cmars commented Jul 15, 2017

mattyw commented Jul 17, 2017

tasdomas left a comment

Choose a reason for hiding this comment

mjs Jul 18, 2017

Choose a reason for hiding this comment

mjs Jul 18, 2017

Choose a reason for hiding this comment

mjs Jul 18, 2017

Choose a reason for hiding this comment

alesstimec commented Jul 18, 2017

cmars left a comment

Choose a reason for hiding this comment

tasdomas commented Jul 18, 2017

tasdomas commented Jul 18, 2017

alesstimec commented Jul 19, 2017

alesstimec commented Jul 19, 2017

mjs Jul 19, 2017

Choose a reason for hiding this comment

alesstimec Jul 19, 2017

Choose a reason for hiding this comment

alesstimec commented Jul 19, 2017

alesstimec commented Jul 19, 2017

jujubot commented Jul 19, 2017