Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce retained_scheduling_randomize_window #277

Conversation

jacobbaungard
Copy link
Contributor

@jacobbaungard jacobbaungard commented Dec 13, 2018

With use_retained_scheduling_info enabled, we would schedule checks
which was missed with less than one check_interval, within one
interval_lenght.

This commit introduces a new setting
retained_scheduling_randomize_window which allows users to configure
the window in which checks that were missed over a restart is rescheduled.

This can be useful in order to increase the load balacing done after a
restart, and might be able to help fixing CPU load spikes, due to checks
being unevenly scheduled.

This part of MON-11418

Signed-off-by: Jacob Hansen jhansen@op5.com

@jacobbaungard jacobbaungard force-pushed the bugfix/MON-11418-uneven-check-schedule branch from a428016 to f31369a Compare December 13, 2018 11:05
@sni
Copy link
Contributor

sni commented Dec 13, 2018

i don't think this is a good idea. With this, basically all missed checks would be randomly rescheduled within their check_interval which could be several hours which basically is the oposite of use_retained_scheduling. Maybe disabling retaining scheduling would solve the initial issue here?

@jacobbaungard
Copy link
Contributor Author

Yes, disabling use_retained_scheduling more or less guarantees that Naemon will schedule the load balance evenly (given all checks take an equal amount of resources), and is a workaround for the problem encountered here.

But with the current implementation I think there is a quite high possibility of uneven load manifesting over time, which is not good. This patch should ensure the load is more or less optimal as before use_retained_scheduling info was fixed, while we still keep most check scheduling over restarts.

@sni
Copy link
Contributor

sni commented Dec 13, 2018

the current implementation is not that bad I'd say. It randomly distributes load over one minute already.
I am unsure which problem the new implementation would solve. It has problems with long check intervals, because a check which only runs every couple of hours would be delayed by a long time which is better with the current implemention. And for checks with a very short interval the current implementation would randomize them anyway over a minute, so nothing would change here.

@nook24
Copy link
Member

nook24 commented Dec 13, 2018

Sorry guys, I totally missed that you mentioned me in #259 (Unfortunately I haven't tested it now)

I think to make use of interval_length is not the best idea. I use an interval_length of 1. So the "load balancing" will be like:

delay = ranged_urand(0, 1);

in the current version.

What is the value of temp_service->next_check for new created hosts and services? 0 or current_time?

@sni
Copy link
Contributor

sni commented Dec 13, 2018

valid point, so my favorite would be the current implementation with random delay up to 1 minute.

@jacobbaungard
Copy link
Contributor Author

jacobbaungard commented Dec 17, 2018

I am OK with hard-coding the value instead of using the interval_length.

However I still think this is rather likely to cause issues with load balancing. On our side, 5 minutes is the default (which I think is probably common ?). So with the current implementation that is already a very significant unbalance if you happen to restart twice in roughly the same time within that 5 minute interval (if you have a large setup where restart takes 10+ seconds)

With the solution in this PR, there is indeed a chance to miss a check with a long check_period during a restart, and then there could be significant wait time until the next check. However, the chance of this happening more than once in a row is very low, I would think. Missing one check is probably not the end of the world, none the less not very nice.

Could an alternative be to have a limit of when we schedule within 1 minute, or when we randomly reschedule? I.e. all checks missed with an check_interval less than 5, or perhaps 10 minutes are randomly scheduled, and all missed checks with longer check intervals are scheduled within 1 minute. That way we still get a good load balancing of checks with short check intervals and missed checks with long intervals are executed shortly after a restart.

@jacobbaungard jacobbaungard force-pushed the bugfix/MON-11418-uneven-check-schedule branch from f31369a to 9191944 Compare December 17, 2018 15:53
@jacobbaungard jacobbaungard changed the title Schedule missed checks randomly after restart Introduce retained_scheduling_randomize_window Dec 17, 2018
@jacobbaungard jacobbaungard force-pushed the bugfix/MON-11418-uneven-check-schedule branch from 9191944 to c1e87c8 Compare December 17, 2018 15:55
@jacobbaungard
Copy link
Contributor Author

Updated PR, see edited PR description.

Hope this is OK with everyone.

@nook24
Copy link
Member

nook24 commented Dec 17, 2018

I will take a look

@jacobbaungard jacobbaungard force-pushed the bugfix/MON-11418-uneven-check-schedule branch 2 times, most recently from 53eb10e to 247de84 Compare December 17, 2018 16:13
@nook24
Copy link
Member

nook24 commented Dec 17, 2018

@jacobbaungard What is the value of temp_service->next_check for new created objects?

With use_retained_scheduling_info enabled, we would schedule checks
which was missed with less than one check_interval, within one
interval_lenght.

This commit introduces a new setting
retained_scheduling_randomize_window which allows users to configure
the window in which checks that were missed over a restart is
rescheduled.

This can be useful in order to increase the load balacing done after a
restart, and might be able to help fixing CPU load spikes, due to checks
being unevenly scheduled.

This part of MON-11418

Signed-off-by: Jacob Hansen <jhansen@op5.com>
@jacobbaungard jacobbaungard force-pushed the bugfix/MON-11418-uneven-check-schedule branch from 247de84 to f457a5b Compare December 17, 2018 16:16
@jacobbaungard
Copy link
Contributor Author

Jeez, sorry for all the random force-pushes. Couldn't get an alignment right, and then couldn't spell properly apparently. Should work again now.

@nook24 afaik it is 0, and hence the logic in place will randomly select the first check to happen within the first check_interval after the restart.

One thing to note about this approach, which is less optimal, is that if you have a check_interval < retained_scheduling_randomize_window then it might happen that the next check after restart is longer than one check_interval away. Perhaps it should be the maximum window size, but if any checks has a check interval less than the window, the check_interval is used instead?

Copy link
Contributor

@sni sni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@nook24
Copy link
Member

nook24 commented Dec 17, 2018

Looks good for me too.

If the retained_scheduling_randomize window is larger than the objects
check_interval, then we use the check_interval for scheduling instead.
This ensures that the object is always scheduled within the first
check_interval after a restart.

Signed-off-by: Jacob Hansen <jhansen@op5.com>
@jacobbaungard
Copy link
Contributor Author

Just added a new commit to ensure that if the retained_scheduling_randomize_window is larger than the objects check_interval, then we use the check_interval instead.

@sni
Copy link
Contributor

sni commented Dec 18, 2018

agreed, that sounds like a good idea.

Signed-off-by: Jacob Hansen <jhansen@op5.com>
@jacobbaungard jacobbaungard merged commit ed6e3c1 into naemon:master Dec 18, 2018
@jacobbaungard jacobbaungard deleted the bugfix/MON-11418-uneven-check-schedule branch December 18, 2018 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants