Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain next_check schedule on restart (#224, #156) #259

Conversation

jacobbaungard
Copy link
Contributor

@jacobbaungard jacobbaungard commented Sep 10, 2018

This commit ensures that the next_check schedule for hosts and services
are retained on Naemon restart, given that use_retained_scheduling_info
is enabled.

The logic is as follows:

  • If use_retained_scheduling_info is disabled, set a random time (as
    before)
  • If use_retained_schedule_info is enabled:
    • If we didn't miss the check during the restart, retain the old
      next_check time
    • If we missed one check, schedule the service/host within the next
      interval_length (usually 60 seconds)
    • If we missed more than one check, schedule the next check randomly.

We schedule missed checks within 60 seconds, rather than immediately in
order to do some load balacing. This is also the rationale for
scheduling the check randomly, in case we missed more than one check
(this indicates Naemon has been down for a longer period of time).

This fixes:

Signed-off-by: Jacob Hansen jhansen@op5.com

@sni
Copy link
Contributor

sni commented Sep 10, 2018

thats great news, thanks

This commit ensures that the next_check schedule for hosts and services
are retained on Naemon restart, given that use_retained_scheduling_info
is enabled.

The logic is as follows:

- If use_retained_scheduling_info is disabled, set a random time (as
  before)
- If use_retained_schedule_info is enabled:
  - If we didn't miss the check during the restart, retain the old
    next_check time
  - If we missed one check, schedule the service/host within the next
    interval_length (usually 60 seconds)
  - If we missed more than one check, schedule the next check randomly.

We schedule missed checks within 60 seconds, rather than immediately in
order to do some load balacing. This is also the rationale for
scheduling the check randomly, in case we missed more than one check
(this indicates Naemon has been down for a longer period of time).

This fixes:
- naemon#224
- naemon#156
- MON-10720 (https://jira.op5.com/browse/MON-10720)

Signed-off-by: Jacob Hansen <jhansen@op5.com>
This commit adds tests to ensure that the next_check is set correctly
after Naemon restarts. This ensures the logic is from the previous
commit is correctly followed.

This fixes:
- naemon#224
- naemon#156
- MON-10720 (https://jira.op5.com/browse/MON-10720)

Signed-off-by: Jacob Hansen <jhansen@op5.com>
@jacobbaungard jacobbaungard force-pushed the bugfix/MON-10720_retain-next-schedule branch from 98f96b0 to 402706b Compare September 11, 2018 09:59
@jacobbaungard
Copy link
Contributor Author

I think this is ready now.

  • Added tests
  • Fixed an issue from the previous commit (check_interval was in minutes instead of seconds as it should be).

Copy link

@roengstrom roengstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix, looks good to me.

@jacobbaungard
Copy link
Contributor Author

@nook24, @jvigna, how does the logic in this PR sound for you guys? You think it would solve the problems you have reported?

@jvigna
Copy link

jvigna commented Sep 26, 2018

Look very good to me!

@jacobbaungard
Copy link
Contributor Author

jacobbaungard commented Sep 27, 2018

I will go ahead and merge this, this afternoon unless there are any objections. (Edit: Okay didn't manage that, will do so next week. Any comments are still appreciated).

@sni
Copy link
Contributor

sni commented Sep 28, 2018

i was on vacation... :-) no objections.

@jacobbaungard jacobbaungard merged commit 183178c into naemon:master Oct 1, 2018
@jacobbaungard jacobbaungard deleted the bugfix/MON-10720_retain-next-schedule branch October 1, 2018 08:17
jacobbaungard added a commit to jacobbaungard/naemon-core that referenced this pull request Oct 10, 2018
After naemon#259 we now keep the
next_check schedule over restarts if use_retained_schedule_info is
enabled. However after this patch, if one would lower the check_interval
it was possible that after the restart, the next check of an object
would be more than one check_interval away.

This commit ensures that if the next_check is more than one
check_interval away, then we randomly schedule the next check, instead
of using the retention data.

This fixed MON-11295 (https://jira.op5.com/browse/MON-11295)

Signed-off-by: Jacob Hansen <jhansen@op5.com>
jacobbaungard added a commit to jacobbaungard/naemon-core that referenced this pull request Oct 10, 2018
After naemon#259 we now keep the
next_check schedule over restarts if use_retained_schedule_info is
enabled. However after this patch, if one would lower the check_interval
it was possible that after the restart, the next check of an object
would be more than one check_interval away.

This commit ensures that if the next_check is more than one
check_interval away, then we randomly schedule the next check, instead
of using the retention data.

This fixed MON-11295 (https://jira.op5.com/browse/MON-11295)

Signed-off-by: Jacob Hansen <jhansen@op5.com>
jacobbaungard added a commit to jacobbaungard/naemon-core that referenced this pull request Oct 10, 2018
After naemon#259 we now keep the
next_check schedule over restarts if use_retained_schedule_info is
enabled. However after this patch, if one would lower the check_interval
it was possible that after the restart, the next check of an object
would be more than one check_interval away.

This commit ensures that if the next_check is more than one
check_interval away, then we randomly schedule the next check, instead
of using the retention data.

This fixed MON-11295 (https://jira.op5.com/browse/MON-11295)

Signed-off-by: Jacob Hansen <jhansen@op5.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants