Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naemon fake state change events if the host is down #287

Closed
ibering opened this issue Mar 29, 2019 · 16 comments

Comments

Projects
None yet
4 participants
@ibering
Copy link

commented Mar 29, 2019

Today my colleagues discovered a weird behavior with
"fake" a hard state change for the services.

What's the issue with this?

This manipulates (and breaks) the data of nebstruct_statechange_data.

Time Check # State StateType State Change Notes
0 1/5 DOWN SOFT YES Host change from HARD Up to Down SOFT - OK
1 1/3 Critical SOFT YES Service change from Ok HARD to Critical SOFT - OK
2 2/3 Unknown HARD YES Services change from Critical SOFT to Unknown HARD. last_hard_state gets set to 3 - ERROR should be SOFT state and last_hard_state still be 0

Bildschirmfoto 2019-03-29 um 12 04 41

Expected behavior

Time Check # State StateType State Change Notes
0 1 DOWN SOFT YES Host change from HARD Up to Down SOFT
1 1/3 Critical SOFT YES Service change from Ok HARD to Critical SOFT
2 2/3 Unknown SOFT YES Service change from Critical SOFT to Unknown SOFT -Set SOFT State, dont touch last_hard_state

This only happens if the host is in a non Up state.

If the host is Up, the last_hard_state works as expected:

Bildschirmfoto 2019-03-29 um 12 04 47

I guess this is the code which is causing the issue:

/* if the host is down or unreachable ... */
/* The host might be in a SOFT problem state due to host check retries/caching. Not sure if we should take that into account and do something different or not... */
if (route_result != STATE_UP) {
log_debug_info(DEBUGL_CHECKS, 2, "Host is not UP, so we mark state changes if appropriate\n");
/* "fake" a hard state change for the service - well, its not really fake, but it didn't get caught earlier... */
if (temp_service->last_hard_state != temp_service->current_state) {
hard_state_change = TRUE;
nm_log(NSLOG_INFO_MESSAGE, "SERVICE INFO: %s;%s; Service switch to hard down state due to host down.\n", temp_service->host_name, temp_service->description);
}
/* update last state change times */
if (state_change == TRUE || hard_state_change == TRUE)
temp_service->last_state_change = temp_service->last_check;
if (hard_state_change == TRUE) {
temp_service->last_hard_state_change = temp_service->last_check;
temp_service->state_type = HARD_STATE;
temp_service->last_hard_state = temp_service->current_state;
}
/* put service into a hard state without attempting check retries and don't send out notifications about it */
temp_service->host_problem_at_last_check = TRUE;
}

This looks wrong to me: temp_service->last_hard_state = temp_service->current_state;

@jacobbaungard

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

This is by design basically. The service status is changed to HARD if the associated host is in (iirc) hard down.

The reason for this is to avoid notification storms. A service which goes to hard down, does not send out notifications if the host is also down. Imagine a host with 100 services, and then, for some weird coincidence, lets say that all the services ends up in hard down state before the host. Then instead of having one notification event which is saying the host is down, you would get 100 service down notifications. I don't think anyone wants that.

Of course it is unlikely that all services goes to hard-down before the host, but it is not at all unlikely that quite a lot of services would be in hard down, before the host gets into hard down, resulting in a lot of notifications.

@ibering

This comment has been minimized.

Copy link
Author

commented Mar 29, 2019

This is by design basically. The service status is changed to HARD if the associated host is in (iirc) hard down.

Please correct me if I'm wrong, but I can't see a check if the host is in a hard state.

here is the snippet from the naemon.log:

[1553854390] HOST ALERT: localhost TEST;DOWN;SOFT;1;Test alert
[1553854427] SERVICE INFO: localhost TEST;Ping; Service switch to hard down state due to host down.
[1553854427] SERVICE ALERT: localhost TEST;Ping;UNKNOWN;HARD;2;Test alert
[1553854427] HOST ALERT: localhost TEST;UP;SOFT;2;OK - 127.0.0.1: rta 0.037ms, lost 0%

The log reports, the host is Down SOFT, and my Service switch to hard down state due to host down.

@jacobbaungard

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2019

Yes wasn't sure whether the switch was done when the host was soft or hard. Although thinking about it further, it makes sense even when the host is in soft. This to protect against the scenario where both host and service are 2/3 soft, but the service check happens first (which would then cause a notification event if we only switch the service to hard-down when the host reached a hard state). Unless the notification logic doesn't send notifications if the host is in soft-down.

@ibering

This comment has been minimized.

Copy link
Author

commented Mar 29, 2019

This to protect against the scenario where both host and service are 2/3 soft, but the service check happens first (which would then cause a notification event if we only switch the service to hard-down when the host reached a hard state).

I see what you try to achieve... However, why gets last_hard_state changed to current state? Why is this value touched at all?

Log a notice and suppress the notification should be enough. The current behavior makes it mostly impossible to calculate availability reports because last_hard_state was Ok and suddenly last_hard_state is Unknown which never really occurred.

@ibering

This comment has been minimized.

Copy link
Author

commented Apr 9, 2019

Any news on this? My database contains over 13 million potential wrong records. :(

@jacobbaungard

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

I have not really had time to look into this in any great detail but in my opinion last_hard_state should correspond to the last hard state, even if the service state is forced into a hard state due to the host being down. As far as I can see from the table you included in the issue report, that is the case here?

The service returned a NON-OK state when its host was in either down or unreachable state, and as such (and according to the documentation) the service state was changed to hard unknown, and the last_hard_state changed accordingly?

Not really sure why the the state changed from critical to unknown. Hopefully that is according to what the plugin actually returned during that service check.

@nook24

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

The problem is with this line:

temp_service->last_hard_state = temp_service->current_state;

Please don't overwrite last_hard_state with current state, this makes no sense because last_hard_state is not current_state.
It is Ok that the services switches to hard state. But you should definitely keep the last_hard_state as it is.

Bildschirmfoto 2019-04-10 um 10 09 31

I guess you just need to remove this line and the problem should be gone...

@jacobbaungard

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

I really don't see the logic behind having a field called last_hard_state which is not equal to the current state, assuming the current state type is in fact hard.

The code you link to changes the current state_type to hard, and therefore the last_hard_state is equal to the current state? That seems super sensible to me.

@jacobbaungard

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

Perhaps what you want is something like previous_hard_state, which always has the last hard_state but not including the current state?

@nook24

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

What is the difference between previous_hard_state and last_hard_state ?

I mean, struct service has

int    current_state;
int    last_state;
int    last_hard_state;
int    state_type;

What is the job of last_hard_state?
last_hard_state can not be equal to current_state because it is in the past?

I'm sorry, but I need to be missing something here.

(And objects_service.h mix up \t and spaces)

@jacobbaungard

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

Probably "last" should really be "latest" if we are looking looking at the English definition of the word.

last_hard_state is different from current_state when the state_type is soft. Having a parameter that behaves like this is useful in its own right, and I am pretty hesitant to change the logic of it.

I am all for implementing something like previous_hard_state which would behave different compared to last_hard_state in that it doesn't take into account the current_state, even if the current_state is hard.

@sni

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

nothing changed at that point, last_hard_state has always been set to the current_state as soon as there was a hard state change. The only new thing is, that there is a fake hard state change if the host goes down.

@nook24

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

As long as the host is Up, last_hard_state works as expected. It holds the previous_hard_state.
If tested this using NEBCALLBACK_STATE_CHANGE_DATA.
Usually in NEBCALLBACK_STATE_CHANGE_DATA is last_hard_state == previous_hard_state

If the host is down and the service gets forced by the fake hard state code the behavior slidely differs. In the NEBCALLBACK_STATE_CHANGE_DATA event the gets fired by this fake hard state, last_hard_state gets set to current_state. So last_hard_state == current_state.

@sni

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

the name is missleading, last_hard_state always contains the current/latest hard state.

@nook24

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2019

Maybe I found the issue. I will run some tests and send an PR if so.

nook24 added a commit to nook24/naemon-core that referenced this issue Apr 10, 2019

@nook24 nook24 referenced this issue Apr 10, 2019

Merged

Issue 287 #291

jacobbaungard added a commit that referenced this issue Apr 15, 2019

Move last_hard_state change to after event handler
Previously the last_hard_state was changed *after* the event handler in case of a service returning from critical->ok whereas it would be changed *before* the event handler in case of a service going from ok->critical.

This commit ensures that the behavior is consistent for both scenarios, and the last_hard_state is changed after the event handler is run.

The fixes issue #287 .
@jacobbaungard

This comment has been minimized.

Copy link
Contributor

commented Apr 15, 2019

Fixed by #291

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.