-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce retained_scheduling_randomize_window #277
Introduce retained_scheduling_randomize_window #277
Conversation
a428016
to
f31369a
Compare
i don't think this is a good idea. With this, basically all missed checks would be randomly rescheduled within their check_interval which could be several hours which basically is the oposite of |
Yes, disabling use_retained_scheduling more or less guarantees that Naemon will schedule the load balance evenly (given all checks take an equal amount of resources), and is a workaround for the problem encountered here. But with the current implementation I think there is a quite high possibility of uneven load manifesting over time, which is not good. This patch should ensure the load is more or less optimal as before use_retained_scheduling info was fixed, while we still keep most check scheduling over restarts. |
the current implementation is not that bad I'd say. It randomly distributes load over one minute already. |
Sorry guys, I totally missed that you mentioned me in #259 (Unfortunately I haven't tested it now) I think to make use of
in the current version. What is the value of |
valid point, so my favorite would be the current implementation with random delay up to 1 minute. |
I am OK with hard-coding the value instead of using the interval_length. However I still think this is rather likely to cause issues with load balancing. On our side, 5 minutes is the default (which I think is probably common ?). So with the current implementation that is already a very significant unbalance if you happen to restart twice in roughly the same time within that 5 minute interval (if you have a large setup where restart takes 10+ seconds) With the solution in this PR, there is indeed a chance to miss a check with a long check_period during a restart, and then there could be significant wait time until the next check. However, the chance of this happening more than once in a row is very low, I would think. Missing one check is probably not the end of the world, none the less not very nice. Could an alternative be to have a limit of when we schedule within 1 minute, or when we randomly reschedule? I.e. all checks missed with an check_interval less than 5, or perhaps 10 minutes are randomly scheduled, and all missed checks with longer check intervals are scheduled within 1 minute. That way we still get a good load balancing of checks with short check intervals and missed checks with long intervals are executed shortly after a restart. |
f31369a
to
9191944
Compare
9191944
to
c1e87c8
Compare
Updated PR, see edited PR description. Hope this is OK with everyone. |
I will take a look |
53eb10e
to
247de84
Compare
@jacobbaungard What is the value of |
With use_retained_scheduling_info enabled, we would schedule checks which was missed with less than one check_interval, within one interval_lenght. This commit introduces a new setting retained_scheduling_randomize_window which allows users to configure the window in which checks that were missed over a restart is rescheduled. This can be useful in order to increase the load balacing done after a restart, and might be able to help fixing CPU load spikes, due to checks being unevenly scheduled. This part of MON-11418 Signed-off-by: Jacob Hansen <jhansen@op5.com>
247de84
to
f457a5b
Compare
Jeez, sorry for all the random force-pushes. Couldn't get an alignment right, and then couldn't spell properly apparently. Should work again now. @nook24 afaik it is 0, and hence the logic in place will randomly select the first check to happen within the first check_interval after the restart. One thing to note about this approach, which is less optimal, is that if you have a check_interval < retained_scheduling_randomize_window then it might happen that the next check after restart is longer than one check_interval away. Perhaps it should be the maximum window size, but if any checks has a check interval less than the window, the check_interval is used instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
Looks good for me too. |
If the retained_scheduling_randomize window is larger than the objects check_interval, then we use the check_interval for scheduling instead. This ensures that the object is always scheduled within the first check_interval after a restart. Signed-off-by: Jacob Hansen <jhansen@op5.com>
Just added a new commit to ensure that if the retained_scheduling_randomize_window is larger than the objects check_interval, then we use the check_interval instead. |
agreed, that sounds like a good idea. |
Signed-off-by: Jacob Hansen <jhansen@op5.com>
With use_retained_scheduling_info enabled, we would schedule checks
which was missed with less than one check_interval, within one
interval_lenght.
This commit introduces a new setting
retained_scheduling_randomize_window which allows users to configure
the window in which checks that were missed over a restart is rescheduled.
This can be useful in order to increase the load balacing done after a
restart, and might be able to help fixing CPU load spikes, due to checks
being unevenly scheduled.
This part of MON-11418
Signed-off-by: Jacob Hansen jhansen@op5.com