Model config to control the update-status hook interval #7519

Merged
merged 5 commits into from Jun 21, 2017

Conversation

Projects
None yet
3 participants
Owner

wallyworld commented Jun 20, 2017

Description of change

Provide a model config attribute "update-status-hook-interval" which controls how often to run the update-status hook. Whatever value is provided has a random +/- 20% offset applied to avoid all hooks for all units firing art once.
This implementation doesn't yet listen for changes in the model config value - whatever value is provided as bootstrap via --config is what's used.

QA steps

juju bootstrap --debug
juju deploy mysql
juju debug-log

watch for update-status hooks firing at staggered but regular intervals

Documentation changes

model config doc needs updating

Bug reference

https://bugs.launchpad.net/juju/+bug/1697841
https://bugs.launchpad.net/juju/+bug/1698139

I have a comment about how we compute the window, and I'd like to understand how we find out about changes to the interval so that we can have users changing the value and seeing that we are actually doing something.

I'm happy with the 20% +/- as a range. Its a bit arbitrary, that seems like a good starting point.

environs/config/config.go
+ return errors.Annotatef(err, "update status hook frequency %v cannot be less than 1m", f)
+ }
+ if f > 60*time.Minute {
+ return errors.Annotatef(err, "update status hook frequency %v cannot be less than 60m", f)
@jameinel

jameinel Jun 20, 2017

Owner

I assume this is "greater than"

environs/config/config.go
+// update-status hook.
+func (c *Config) UpdateStatusHookInterval() time.Duration {
+ // Value has already been validated.
+ val, _ := time.ParseDuration(c.mustString(UpdateStatusHookInterval))
@jameinel

jameinel Jun 20, 2017

Owner

Given previous issues, does this handle when the value doesn't exist ahead of time. I also wonder if someone puts just '30' would that be treated as a string, or will this fail and we should we just interpret it as seconds (minutes?)

I definitely want to handle 'if its "" or nil, treat it as default' rather than having failures.

I would be ok not supporting floats/ints in the field, but I'd want it to be a clear error of "value 30 does not have any units" or something like that, rather than "int is not a valid string".

@wallyworld

wallyworld Jun 20, 2017

Owner

There's an upgrade step to add the value to an existing model.
There will not be a case where the value is "" if the upgrade is done properly.

@wallyworld

wallyworld Jun 20, 2017

Owner

The value is a time duration - it's a string eg "30m". There's no ambiguity with units.

@jameinel

jameinel Jun 20, 2017

Owner

My point is that when a user types a value they may just type 30 and the error message is going to be "not a string", which is a terrible user experience.

Given what we saw in the last round of "what happens when we introduce mandatory config", and the fact that IMO it is not ambiguous. An empty value can just be treated as "use the default", I'd really rather we layer defense in depth here so that we don't get bugs around it.

@wallyworld

wallyworld Jun 20, 2017

Owner

I can't repro a scenario where this is an issue, but none the less I've added a second lot of default handling in the getter method. We should remove it once 2.1.x becomes EOL. Previously it was controller config so that could explain any corner case there may have been.

@@ -300,6 +299,12 @@ func (w *RemoteStateWatcher) loop(unitTag names.UnitTag) (err error) {
observedEvent(&seenLeadershipChange)
}
+ // TODO(wallyworld) - listen for changes to this value
@jameinel

jameinel Jun 20, 2017

Owner

What is the plan around this? update-status interval is currently fixed and you can change the setting and restart Jujud?

@wallyworld

wallyworld Jun 20, 2017

Owner

No, we need to properly add apis for listening to well defined model config changes. All we have now is something to listen to any model config change. That would kill the system if lots of uniters had to be notified about any model change.

+ lower := 0.8 * float64(wait)
+ window := 0.4 * float64(wait)
+ offset := float64(r.Int63n(int64(window)))
+ wait = time.Duration(lower + offset)
@jameinel

jameinel Jun 20, 2017

Owner

This seems to randomize the time that any given unit will wait, but not randomize the time that any given wait will trigger. Is that correct?

As in, with this in play, eg mysql/0 will wake up every 3m and trigger, while mysql/1 will wake up every 6m and trigger.

I'd rather see it where on every sleep cycle we recompute a variable wait.
So that mysql/0 will wait 3, then 6, then 4, then 5, etc. That means that all units still average 5min wait times (or whatever was nominally set), but it should still push out when the explicit ticks are happening.

Or maybe that's what you're doing and I'm just confused by the double-nested func returning a func.
It feels like we just want to move the recalculation of specific wait time into the second func rather than this first one.

@wallyworld

wallyworld Jun 20, 2017

Owner

No, the wait is calculated new each time. Each time the channel fires, the function is called again and a new random time is used for the next wait period.

@jameinel

jameinel Jun 20, 2017

Owner

Ideally we would have some sort of proof around that.

@wallyworld

wallyworld Jun 20, 2017

Owner

I refactored a little to add a test for the generation of the random duration

worker/uniter/timer.go
-func NewUpdateStatusTimer() func() <-chan time.Time {
- return updateStatusSignal
+ return func() <-chan time.Time {
+ // TODO(fwereade): 2016-03-17 lp:1558657
@jameinel

jameinel Jun 20, 2017

Owner

Any chance that since you're touching this code now you could have it interact with Clocks instead of 'time.After' ?

@wallyworld

wallyworld Jun 20, 2017

Owner

Thought about it but a lot to try and get done for 2.2.1. I can see if I get time.

@jameinel

jameinel Jun 20, 2017

Owner

This doesn't seem very high priority, to me. It seemed like it might have been low fruit, but if it isn't, don't worry about it. We have a way to inject something different for testing (give it a different func).

worker/uniter/timer_test.go
+ wait := timer(nominal)
+ c.Assert(min, jc.LessThan, wait)
+ c.Assert(wait, jc.LessThan, max)
+ }
@jameinel

jameinel Jun 20, 2017

Owner

Only problem with this test is that it passes if NewUpdateStatusTimer just returns nominal on every request. However, since we're using rand() we can't do a lot. I wanted to do something similar in my 'batch pings' code. This is what I came up with:

pingBatcher := NewPingBatcher(nil, interval)
var lastTime time.Duration
var measuredMinTime time.Duration
var measuredMaxTime time.Duration

for i := 0; i < 1000; i++ {
	next := pingBatcher.nextSleep()
	// We use Assert rather than Check because we don't want 100s of failures
	c.Assert(next, jc.GreaterThan, minTime)
	c.Assert(next, jc.LessThan, maxTime)
	if lastTime == 0 {
		measuredMinTime = next
		measuredMaxTime = next
	} else {
		// We are using a range in 100s of milliseconds at a
		// resolution of nanoseconds. The chance of getting the
		// same random value 2x in a row is sufficiently low that
		// we can just assert the value is changing.
		// (Chance of collision is roughly 1 in 100 Million)
		c.Assert(next, gc.Not(gc.Equals), lastTime)
		if next < measuredMinTime {
			measuredMinTime = next
		}
		if next > measuredMaxTime {
			measuredMaxTime = next
		}
	}
	lastTime = next
}
// Check that we're actually using the full range that was requested.
// Assert that after 1000 tries we've used a good portion of the range
// If we sampled perfectly, then we would have fully sampled the range,
// spread very 1/1000 of the range.
// If we set the required range to 1/100, then a given sample would fail 99%
// of the time, 1000 samples would fail 0.99^1000=4e-5 or ~1-in-20,000 times.
// (actual measurements showed 18 in 20,000, probably due to double ended range vs single ended)
// However, at 1/10 its 0.9^1000=1.7e-46, or 10^41 times less likely to fail.
// In 100,000 runs, a range of 1/10 never failed
expectedCloseness := (maxTime - minTime) / 10
c.Check(measuredMinTime, jc.LessThan, minTime + expectedCloseness)
c.Check(measuredMaxTime, jc.GreaterThan, maxTime - expectedCloseness)

Running the loop 1000 times still is 4ms, so it isn't particularly expensive (I
put an outer loop of 100,000 times to check that in never failed
intermittently).

It is still probablistic, but I've given some really good odds. My
probabilities may be off slightly, but I still expect it to be vanishingly
unlikely that the test will ever intermittently fail.

@wallyworld

wallyworld Jun 20, 2017

Owner

Thanks for the better test - updated.

Owner

wallyworld commented Jun 20, 2017

$$merge$$

Contributor

jujubot commented Jun 20, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 80414fa into juju:2.2 Jun 21, 2017

1 check passed

github-check-merge-juju Ran tests against PR. Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment