Skip to content

Commit

Permalink
Merge bf96c8c into fc4a0fe
Browse files Browse the repository at this point in the history
  • Loading branch information
geektophe committed May 10, 2021
2 parents fc4a0fe + bf96c8c commit 795180a
Show file tree
Hide file tree
Showing 6 changed files with 265 additions and 183 deletions.
20 changes: 20 additions & 0 deletions doc/source/03_configuration/configmain.rst
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,26 @@ Default:
This option is used to know if we apply or not the state change when a host or service is impacted by a root problem (like the service's host going down or a host's parent being down too). The state will be changed by UNKNONW for a service and UNREACHABLE for a host until their next schedule check. This state change do not count as a attempt, it's just for console so the users know that theses objects got problems and the previous states are not sure.


.. _configuration/configmain#enable_problem_impacts_states_reprocessing:

Enable problem/impacts states change
-------------------------------------

Format:

::

enable_problem_impacts_states_reprocessing=<0/1>

Default:

::

enable_problem_impacts_states_reprocessing=0

This option is used to enforce the reprocessing of the problem/impact state after the retention data has been loaded in the scheduler. I's off by default, meaning that the problem/impact related attributes only get modified when a new state change is detected (a new active or passive check result has arrived with a different state). When enabled, this feature walks through the entire objects dependency tree to re-evaluate the object state after the basic objects state has been restored.


.. _configuration/configmain#disable_old_nagios_parameters_whining:

Disable Old Nagios Parameters Whining
Expand Down
22 changes: 17 additions & 5 deletions doc/source/09_architecture/problems-and-impacts.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
.. _architecture/problems-and-impacts:

============================================
Problems and impacts correlation management
Problems and impacts correlation management
============================================


What is this correlation ?
What is this correlation ?
===========================

The main role of this feature is to allow users to have the same correlation views in the console than they got in the notifications.
Expand All @@ -29,10 +29,10 @@ It"s important to see that such state change do not interfere with the HARD/SOFT
Here gateway is already in DOWN/HARD. We can see that all servers do not have an output: they are not already checked, but we already set the UNREACHABLE state. When they will be checks, there will be an output and they will keep this state.


How to enable it?
How to enable it?
==================

It's quite easy, all you need is to enable the parameter
It's quite easy, all you need is to enable the parameter

::

Expand All @@ -41,7 +41,7 @@ It's quite easy, all you need is to enable the parameter
See :ref:`enable_problem_impacts_states_change <configuration/configmain#enable_problem_impacts_states_change>` for more information about it.


Dynamic Business Impact
Dynamic Business Impact
========================

There is a good thing about problems and impacts when you do not identify a parent devices Business Impact: your problem will dynamically inherit the maximum business impact of the failed child!
Expand All @@ -55,3 +55,15 @@ There are 2 nights:
* the second night, the switch got a problem, but this time it impacts the production environment! This time, the computed impact is set at 5 (the one of the max impact, here the production application), so it's higher than the min_criticity of the contact, so the notification is send. The admin is awaken, and can solve this problem before too many users are impacted :)


Enforce problem/impact state calculation
=========================================

The problem/impact calculation that defines if a failed check is the root cause of the problem or is an impact is done when a new check result arrives. When the scheduler restarts or receives a new configuration, it may save an restore the previous objects state from the retention data. There's situations where the problem/impact attributes can get back to their default value, even if an object state is not OK. By default, the attributes are only recalculated when a new check result arrives.

It's possible to enforce the problem/impact processing of all the objects after the retention data has been loaded. To enable this feature you only have to enable the parameter

::

enable_problem_impacts_states_reprocessing=1

See :ref:`enable_problem_impacts_states_reprocessing <configuration/configmain#enable_problem_impacts_states_reprocessing>` for more information about it.
3 changes: 3 additions & 0 deletions shinken/objects/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -567,6 +567,9 @@ class Config(Item):
'enable_problem_impacts_states_change':
BoolProp(default=False, class_inherit=[(Host, None), (Service, None)]),

'enable_problem_impacts_states_reprocessing':
BoolProp(default=False, class_inherit=[(Host, None), (Service, None)]),

# More a running value in fact
'resource_macros_names':
ListProp(default=[]),
Expand Down
194 changes: 138 additions & 56 deletions shinken/objects/schedulingitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,14 +252,27 @@ def do_check_freshness(self):
)
return None

# Raise all impact from my error. I'm setting myself
# as a problem, and I register myself as this in all
# hosts/services that depend_on_me. So they are now my
# impacts
def set_myself_as_problem(self):
def set_myself_as_problem(self, send_brok=True):
"""
Raise all impact from my error. I'm setting myself as a problem, and
I register myself as this in all hosts/services that depend_on_me.
So they are now my impacts.
This method may be called to correctly reinitialize the object state
after the retention data has been loaded. In such a situation, a
brok shold not be emitted if the state is modified. The send_brok
variable reflects this.
:param bool send_brok: Should a brok be emitted if the object state
is modified.
"""
now = time.time()
updated = False

if self.is_problem is False:
self.is_problem = True
updated = True

self.is_problem = True
# we should warn potentials impact of our problem
# and they should be cool to register them so I've got
# my impacts list
Expand All @@ -271,21 +284,22 @@ def set_myself_as_problem(self):
# now check if we should bailout because of a
# not good timeperiod for dep
if tp is None or tp.is_time_valid(now):
new_impacts = impact.register_a_problem(self)
new_impacts = impact.register_a_problem(self, send_brok)
impacts.extend(new_impacts)

# Only update impacts and create new brok if impacts changed.
s_impacts = set(impacts)
if s_impacts == set(self.impacts):
return
self.impacts = list(s_impacts)
if s_impacts != set(self.impacts):
self.impacts = list(s_impacts)

# We can update our business_impact value now
self.update_business_impact_value()
# We can update our business_impact value now
self.update_business_impact_value()
updated = True

# And we register a new broks for update status
b = self.get_update_status_brok()
self.broks.append(b)
if send_brok is True and updated is True:
# And we register a new broks for update status
b = self.get_update_status_brok()
self.broks.append(b)

# We update our 'business_impact' value with the max of
# the impacts business_impact if we got impacts. And save our 'configuration'
Expand Down Expand Up @@ -323,15 +337,25 @@ def update_business_impact_value(self):
if self.my_own_business_impact != -1 and not in_modulation:
self.business_impact = self.my_own_business_impact

# Look for my impacts, and remove me from theirs problems list
def no_more_a_problem(self):
def no_more_a_problem(self, send_brok=True):
"""
Look for my impacts, and remove me from theirs problems list
This method may be called to correctly reinitialize the object state
after the retention data has been loaded. In such a situation, a
brok shold not be emitted if the state is modified. The send_brok
variable reflects this.
:param bool send_brok: Should a brok be emitted if the object state
is modified.
"""
was_pb = self.is_problem
if self.is_problem:
self.is_problem = False

# we warn impacts that we are no more a problem
for impact in self.impacts:
impact.deregister_a_problem(self)
impact.deregister_a_problem(self, send_brok)

# we can just drop our impacts list
self.impacts = []
Expand All @@ -341,16 +365,26 @@ def no_more_a_problem(self):

# If we were a problem, we say to everyone
# our new status, with good business_impact value
if was_pb:
if send_brok is True and was_pb:
# And we register a new broks for update status
b = self.get_update_status_brok()
self.broks.append(b)

# Call recursively by potentials impacts so they
# update their source_problems list. But do not
# go below if the problem is not a real one for me
# like If I've got multiple parents for examples
def register_a_problem(self, pb):
def register_a_problem(self, pb, send_brok=True):
"""
Call recursively by potentials impacts so they update their
source_problems list. But do not go below if the problem is not a
real one for me like If I've got multiple parents for examples.
This method may be called to correctly reinitialize the object state
after the retention data has been loaded. In such a situation, a
brok shold not be emitted if the state is modified. The send_brok
variable reflects this.
:param Item pb: The source problem
:param bool send_brok: Should a brok be emitted if the object state
is modified.
"""
# Maybe we already have this problem? If so, bailout too
if pb in self.source_problems:
return []
Expand All @@ -364,44 +398,54 @@ def register_a_problem(self, pb):
impacts = []
# Ok, if we are impacted, we can add it in our
# problem list
# TODO: remove this unused check
if self.is_impact:
# Maybe I was a problem myself, now I can say: not my fault!
if self.is_problem:
self.no_more_a_problem()

# Ok, we are now an impact, we should take the good state
# but only when we just go in impact state
if not was_an_impact:
self.set_impact_state()

# Ok now we can be a simple impact
impacts.append(self)
if pb not in self.source_problems:
self.source_problems.append(pb)
# we should send this problem to all potential impact that
# depend on us
for (impact, status, dep_type, tp, inh_par) in self.act_depend_of_me:
# Check if the status is ok for impact
for s in status:
if self.is_state(s):
# now check if we should bailout because of a
# not good timeperiod for dep
if tp is None or tp.is_time_valid(now):
new_impacts = impact.register_a_problem(pb)
impacts.extend(new_impacts)
# Maybe I was a problem myself, now I can say: not my fault!
if self.is_problem:
self.no_more_a_problem()

# Ok, we are now an impact, we should take the good state
# but only when we just go in impact state
if not was_an_impact:
self.set_impact_state()

# Ok now we can be a simple impact
impacts.append(self)
if pb not in self.source_problems:
self.source_problems.append(pb)
# we should send this problem to all potential impact that
# depend on us
for (impact, status, dep_type, tp, inh_par) in self.act_depend_of_me:
# Check if the status is ok for impact
for s in status:
if self.is_state(s):
# now check if we should bailout because of a
# not good timeperiod for dep
if tp is None or tp.is_time_valid(now):
new_impacts = impact.register_a_problem(pb)
impacts.extend(new_impacts)

if send_brok is True:
# And we register a new broks for update status
b = self.get_update_status_brok()
self.broks.append(b)

# now we return all impacts (can be void of course)
return impacts

# Just remove the problem from our problems list
# and check if we are still 'impacted'. It's not recursif because problem
# got the list of all its impacts
def deregister_a_problem(self, pb):
def deregister_a_problem(self, pb, send_brok=True):
"""
Just remove the problem from our problems list
and check if we are still 'impacted'. It's not recursif because problem
got the list of all its impacts
This method may be called to correctly reinitialize the object state
after the retention data has been loaded. In such a situation, a
brok shold not be emitted if the state is modified. The send_brok
variable reflects this.
:param bool send_brok: Should a brok be emitted if the object state
is modified.
"""
self.source_problems.remove(pb)

# For know if we are still an impact, maybe our dependencies
Expand Down Expand Up @@ -842,9 +886,33 @@ def get_snapshot(self):
# ok we can put it in our temp action queue
self.actions.append(e)

# Force the evaluation of scheduled_downtime_depth and in_scheduled_downtime
# attributes
def reset_ack_and_downtimes_state(self):
def reprocess_state(self):
"""
Resets object state after retention has been reloaded
"""
# Processes the downtime depth from the currently active downtimes
self.reprocess_ack_and_downtimes_state()
# Enforces the problem/impact attributes processing if the feature is
# enabled

enable_problem_impact = getattr(
self,
"enable_problem_impacts_states_change",
False
)
reprocess_problem_impact = getattr(
self,
"enable_problem_impacts_states_reprocessing",
False
)
if enable_problem_impact is True and reprocess_problem_impact is True:
self.reprocess_problem_impact_state()

def reprocess_ack_and_downtimes_state(self):
"""
Force the evaluation of scheduled_downtime_depth and in_scheduled_downtime
attributes
"""
self.scheduled_downtime_depth = 0
for dt in self.downtimes:
if dt.in_scheduled_downtime():
Expand All @@ -858,6 +926,19 @@ def reset_ack_and_downtimes_state(self):
else:
self.problem_has_been_acknowledged = False

def reprocess_problem_impact_state(self):
"""
Resets the problem/impact related attributes, which are reprocess to their
default value after the retention data has been reloaded.
"""
no_action = self.is_no_action_dependent()

if not no_action and self.state_id != 0 and self.state_type == "HARD":
self.set_myself_as_problem(False)
# We recheck just for network_dep. Maybe we are just unreachable
# and we need to override the state_id
self.check_and_set_unreachability()

# Whenever a non-ok hard state is reached, we must check whether this
# host/service has a flexible downtime waiting to be activated
def check_for_flexible_downtime(self):
Expand Down Expand Up @@ -1045,6 +1126,7 @@ def consume_state_result(self, c):
# We recheck just for network_dep. Maybe we are just unreachable
# and we need to override the state_id
self.check_and_set_unreachability()

# OK following a previous OK. perfect if we were not in SOFT
if c.exit_status == 0 and self.last_state in (OK_UP, 'PENDING'):
# print "Case 1 (OK following a previous OK):
Expand Down

0 comments on commit 795180a

Please sign in to comment.