Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a bit of jitter to the metric sending worker period. #7642

Merged
merged 1 commit into from Jul 19, 2017

Conversation

alesstimec
Copy link
Member

Description of change

Why is this change needed?

In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%..

QA steps

Metrics collection should still work as expected. The small jitter introduced to the period should not affect that. Use "juju metrics" to verify that metrics are still being collected as expected.

Documentation changes

No documentation change required.

Bug reference

N/A

@alesstimec alesstimec changed the base branch from develop to master July 14, 2017 07:13
@alesstimec alesstimec changed the base branch from master to develop July 14, 2017 07:13
@mattyw
Copy link
Contributor

mattyw commented Jul 14, 2017

"In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%.."

How would adding jitter to the period help here? The units are unlikely to have come up at the same time so their sending intervals are likely to be staggered anyway

Copy link
Contributor

@cmars cmars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -16,8 +17,18 @@ import (
// and that the function was able to stop cleanly
var ErrKilled = errors.New("worker killed")

type PeriodicWorkerOption func(w *periodicWorker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godoc please

@jameinel
Copy link
Member

jameinel commented Jul 15, 2017 via email

@cmars
Copy link
Contributor

cmars commented Jul 15, 2017

!!build!!

@mattyw
Copy link
Contributor

mattyw commented Jul 17, 2017

@jameinel it really isn't my place to say - but wouldn't it be easier to tackle the problem of the controller synchronising the units rather than adding jitter to all the workers?

Copy link
Contributor

@tasdomas tasdomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

A more elegant solution would probably be to make period a func () time.Duration and have a constant and jittered implementations.

window := (2.0 * w.jitter) * float64(period)
offset := float64(r.Int63n(int64(window)))
p = time.Duration(lower + offset)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be cleaner to have all this in a nextPeriod() method

w := &periodicWorker{newTimer: timerFunc}
for _, option := range options {
option(w)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little odd/cute/unexpected. You could have also replaced the period arg with a func() time.Duration.

I'm not going to fight it too hard though.

c.Fatalf("The doWork function should have been called by now")
}
c.Assert(float64(actualPeriod)/float64(defaultPeriod) <= 1.2, jc.IsTrue)
c.Assert(float64(actualPeriod)/float64(defaultPeriod) >= 0.8, jc.IsTrue)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this is going to end up being flaky. I've seen Go's timers be quite inaccurate in our test suites, going off both much later and earlier than requested, especially on busy or underprovisioned hosts. Notice how none of the other tests specifically check the period. The way that the worker is currently structured, this can't be reliably tested.

Possible solutions:

  • set wide margins the allowed period - but I think they'll actually need to be so wide that they'll be meaningless
  • use an internal test which ensures that w.jitter is set and as the required effect (by factoring out nextPeriod as suggested)
  • rework the worker in terms of an injected Clock like we do elsewhere so that tests aren't involved with wall time (the best solution but also the most time consuming)

@alesstimec
Copy link
Member Author

!!build!!

Copy link
Contributor

@cmars cmars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM

@tasdomas
Copy link
Contributor

!!build!!

3 similar comments
@tasdomas
Copy link
Contributor

!!build!!

@alesstimec
Copy link
Member Author

!!build!!

@alesstimec
Copy link
Member Author

!!build!!

case <-funcHasRun:
c.Fatalf("After the kill we don't expect anymore calls to the function")
case <-time.After(defaultFireOnceWait):
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better. Thanks. A test of the default nextPeriod implementation would be nice too. Maybe just run it a number of times with and without jitter and check that numbers in the correct range come out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@alesstimec
Copy link
Member Author

!!build!!

@alesstimec
Copy link
Member Author

$$merge$$

@jujubot
Copy link
Collaborator

jujubot commented Jul 19, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 1f3f66e into juju:develop Jul 19, 2017
jujubot added a commit that referenced this pull request Jul 20, 2017
Added a bit of jitter to the metric sending worker period.

Why is this change needed?

In case all units regain connection to the controller, this change will prevent all of them from sending metrics at the same time. Instead they will send with a rand metrics with a period of 5minutes +- 20%..

Note: This already landed in develop (#7642)

QA steps

Metrics collection should still work as expected. The small jitter introduced to the period should not affect that. Use "juju metrics" to verify that metrics are still being collected as expected.

Documentation changes

No documentation change required.

Bug reference

N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants