Introduce retained_scheduling_randomize_window #277

jacobbaungard · 2018-12-13T10:48:55Z

With use_retained_scheduling_info enabled, we would schedule checks
which was missed with less than one check_interval, within one
interval_lenght.

This commit introduces a new setting
retained_scheduling_randomize_window which allows users to configure
the window in which checks that were missed over a restart is rescheduled.

This can be useful in order to increase the load balacing done after a
restart, and might be able to help fixing CPU load spikes, due to checks
being unevenly scheduled.

This part of MON-11418

Signed-off-by: Jacob Hansen jhansen@op5.com

sni · 2018-12-13T13:17:29Z

i don't think this is a good idea. With this, basically all missed checks would be randomly rescheduled within their check_interval which could be several hours which basically is the oposite of use_retained_scheduling. Maybe disabling retaining scheduling would solve the initial issue here?

jacobbaungard · 2018-12-13T14:10:20Z

Yes, disabling use_retained_scheduling more or less guarantees that Naemon will schedule the load balance evenly (given all checks take an equal amount of resources), and is a workaround for the problem encountered here.

But with the current implementation I think there is a quite high possibility of uneven load manifesting over time, which is not good. This patch should ensure the load is more or less optimal as before use_retained_scheduling info was fixed, while we still keep most check scheduling over restarts.

sni · 2018-12-13T15:09:03Z

the current implementation is not that bad I'd say. It randomly distributes load over one minute already.
I am unsure which problem the new implementation would solve. It has problems with long check intervals, because a check which only runs every couple of hours would be delayed by a long time which is better with the current implemention. And for checks with a very short interval the current implementation would randomize them anyway over a minute, so nothing would change here.

nook24 · 2018-12-13T18:46:16Z

Sorry guys, I totally missed that you mentioned me in #259 (Unfortunately I haven't tested it now)

I think to make use of interval_length is not the best idea. I use an interval_length of 1. So the "load balancing" will be like:

delay = ranged_urand(0, 1);

in the current version.

What is the value of temp_service->next_check for new created hosts and services? 0 or current_time?

sni · 2018-12-13T21:44:50Z

valid point, so my favorite would be the current implementation with random delay up to 1 minute.

jacobbaungard · 2018-12-17T10:49:03Z

I am OK with hard-coding the value instead of using the interval_length.

However I still think this is rather likely to cause issues with load balancing. On our side, 5 minutes is the default (which I think is probably common ?). So with the current implementation that is already a very significant unbalance if you happen to restart twice in roughly the same time within that 5 minute interval (if you have a large setup where restart takes 10+ seconds)

With the solution in this PR, there is indeed a chance to miss a check with a long check_period during a restart, and then there could be significant wait time until the next check. However, the chance of this happening more than once in a row is very low, I would think. Missing one check is probably not the end of the world, none the less not very nice.

Could an alternative be to have a limit of when we schedule within 1 minute, or when we randomly reschedule? I.e. all checks missed with an check_interval less than 5, or perhaps 10 minutes are randomly scheduled, and all missed checks with longer check intervals are scheduled within 1 minute. That way we still get a good load balancing of checks with short check intervals and missed checks with long intervals are executed shortly after a restart.

jacobbaungard · 2018-12-17T15:56:28Z

Updated PR, see edited PR description.

Hope this is OK with everyone.

nook24 · 2018-12-17T15:58:26Z

I will take a look

nook24 · 2018-12-17T16:14:09Z

@jacobbaungard What is the value of temp_service->next_check for new created objects?

With use_retained_scheduling_info enabled, we would schedule checks which was missed with less than one check_interval, within one interval_lenght. This commit introduces a new setting retained_scheduling_randomize_window which allows users to configure the window in which checks that were missed over a restart is rescheduled. This can be useful in order to increase the load balacing done after a restart, and might be able to help fixing CPU load spikes, due to checks being unevenly scheduled. This part of MON-11418 Signed-off-by: Jacob Hansen <jhansen@op5.com>

jacobbaungard · 2018-12-17T16:20:33Z

Jeez, sorry for all the random force-pushes. Couldn't get an alignment right, and then couldn't spell properly apparently. Should work again now.

@nook24 afaik it is 0, and hence the logic in place will randomly select the first check to happen within the first check_interval after the restart.

One thing to note about this approach, which is less optimal, is that if you have a check_interval < retained_scheduling_randomize_window then it might happen that the next check after restart is longer than one check_interval away. Perhaps it should be the maximum window size, but if any checks has a check interval less than the window, the check_interval is used instead?

sni

looks good

nook24 · 2018-12-17T16:56:21Z

Looks good for me too.

If the retained_scheduling_randomize window is larger than the objects check_interval, then we use the check_interval for scheduling instead. This ensures that the object is always scheduled within the first check_interval after a restart. Signed-off-by: Jacob Hansen <jhansen@op5.com>

jacobbaungard · 2018-12-18T09:29:17Z

Just added a new commit to ensure that if the retained_scheduling_randomize_window is larger than the objects check_interval, then we use the check_interval instead.

sni · 2018-12-18T09:30:33Z

agreed, that sounds like a good idea.

Signed-off-by: Jacob Hansen <jhansen@op5.com>

jacobbaungard force-pushed the bugfix/MON-11418-uneven-check-schedule branch from a428016 to f31369a Compare December 13, 2018 11:05

jacobbaungard force-pushed the bugfix/MON-11418-uneven-check-schedule branch from f31369a to 9191944 Compare December 17, 2018 15:53

jacobbaungard changed the title ~~Schedule missed checks randomly after restart~~ Introduce retained_scheduling_randomize_window Dec 17, 2018

jacobbaungard force-pushed the bugfix/MON-11418-uneven-check-schedule branch from 9191944 to c1e87c8 Compare December 17, 2018 15:55

jacobbaungard force-pushed the bugfix/MON-11418-uneven-check-schedule branch 2 times, most recently from 53eb10e to 247de84 Compare December 17, 2018 16:13

jacobbaungard force-pushed the bugfix/MON-11418-uneven-check-schedule branch from 247de84 to f457a5b Compare December 17, 2018 16:16

sni approved these changes Dec 17, 2018

View reviewed changes

Corrected retained window typo in sample-config

4ac82d7

Signed-off-by: Jacob Hansen <jhansen@op5.com>

jacobbaungard merged commit ed6e3c1 into naemon:master Dec 18, 2018

jacobbaungard deleted the bugfix/MON-11418-uneven-check-schedule branch December 18, 2018 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce retained_scheduling_randomize_window #277

Introduce retained_scheduling_randomize_window #277

jacobbaungard commented Dec 13, 2018 •

edited

sni commented Dec 13, 2018

jacobbaungard commented Dec 13, 2018

sni commented Dec 13, 2018

nook24 commented Dec 13, 2018

sni commented Dec 13, 2018

jacobbaungard commented Dec 17, 2018 •

edited

jacobbaungard commented Dec 17, 2018

nook24 commented Dec 17, 2018

nook24 commented Dec 17, 2018

jacobbaungard commented Dec 17, 2018

sni left a comment

nook24 commented Dec 17, 2018

jacobbaungard commented Dec 18, 2018

sni commented Dec 18, 2018

Introduce retained_scheduling_randomize_window #277

Introduce retained_scheduling_randomize_window #277

Conversation

jacobbaungard commented Dec 13, 2018 • edited

sni commented Dec 13, 2018

jacobbaungard commented Dec 13, 2018

sni commented Dec 13, 2018

nook24 commented Dec 13, 2018

sni commented Dec 13, 2018

jacobbaungard commented Dec 17, 2018 • edited

jacobbaungard commented Dec 17, 2018

nook24 commented Dec 17, 2018

nook24 commented Dec 17, 2018

jacobbaungard commented Dec 17, 2018

sni left a comment

Choose a reason for hiding this comment

nook24 commented Dec 17, 2018

jacobbaungard commented Dec 18, 2018

sni commented Dec 18, 2018

jacobbaungard commented Dec 13, 2018 •

edited

jacobbaungard commented Dec 17, 2018 •

edited