Merge bf96c8c into fc4a0fe

shinken-solutions · May 10, 2021 · 795180a · 795180a
2 parents fc4a0fe + bf96c8c
commit 795180a
Show file tree

Hide file tree

Showing 6 changed files with 265 additions and 183 deletions.
diff --git a/doc/source/03_configuration/configmain.rst b/doc/source/03_configuration/configmain.rst
@@ -195,6 +195,26 @@ Default:
 This option is used to know if we apply or not the state change when a host or service is impacted by a root problem (like the service's host going down or a host's parent being down too). The state will be changed by UNKNONW for a service and UNREACHABLE for a host until their next schedule check. This state change do not count as a attempt, it's just for console so the users know that theses objects got problems and the previous states are not sure.
 
 
+.. _configuration/configmain#enable_problem_impacts_states_reprocessing:
+
+Enable problem/impacts states change
+-------------------------------------
+
+Format:
+
+::
+
+  enable_problem_impacts_states_reprocessing=<0/1>
+
+Default:
+
+::
+
+  enable_problem_impacts_states_reprocessing=0
+
+This option is used to enforce the reprocessing of the problem/impact state after the retention data has been loaded in the scheduler. I's off by default, meaning that the problem/impact related attributes only get modified when a new state change is detected (a new active or passive check result has arrived with a different state). When enabled, this feature walks through the entire objects dependency tree to re-evaluate the object state after the basic objects state has been restored.
+
+
 .. _configuration/configmain#disable_old_nagios_parameters_whining:
 
 Disable Old Nagios Parameters Whining

diff --git a/doc/source/09_architecture/problems-and-impacts.rst b/doc/source/09_architecture/problems-and-impacts.rst
@@ -1,11 +1,11 @@
 .. _architecture/problems-and-impacts:
 
 ============================================
-Problems and impacts correlation management 
+Problems and impacts correlation management
 ============================================
 
 
-What is this correlation ? 
+What is this correlation ?
 ===========================
 
 The main role of this feature is to allow users to have the same correlation views in the console than they got in the notifications.
@@ -29,10 +29,10 @@ It"s important to see that such state change do not interfere with the HARD/SOFT
 Here gateway is already in DOWN/HARD. We can see that all servers do not have an output: they are not already checked, but we already set the UNREACHABLE state. When they will be checks, there will be an output and they will keep this state.
 
 
-How to enable it? 
+How to enable it?
 ==================
 
-It's quite easy, all you need is to enable the parameter 
+It's quite easy, all you need is to enable the parameter
 
 ::
 
@@ -41,7 +41,7 @@ It's quite easy, all you need is to enable the parameter
 See :ref:`enable_problem_impacts_states_change <configuration/configmain#enable_problem_impacts_states_change>` for more information about it.
 
 
-Dynamic Business Impact 
+Dynamic Business Impact
 ========================
 
 There is a good thing about problems and impacts when you do not identify a parent devices Business Impact: your problem will dynamically inherit the maximum business impact of the failed child!
@@ -55,3 +55,15 @@ There are 2 nights:
   * the second night, the switch got a problem, but this time it impacts the production environment! This time, the computed impact is set at 5 (the one of the max impact, here the production application), so it's higher than the min_criticity of the contact, so the notification is send. The admin is awaken, and can solve this problem before too many users are impacted :)
 
 
+Enforce problem/impact state calculation
+=========================================
+
+The problem/impact calculation that defines if a failed check is the root cause of the problem or is an impact is done when a new check result arrives. When the scheduler restarts or receives a new configuration, it may save an restore the previous objects state from the retention data. There's situations where the problem/impact attributes can get back to their default value, even if an object state is not OK. By default, the attributes are only recalculated when a new check result arrives.
+
+It's possible to enforce the problem/impact processing of all the objects after the retention data has been loaded. To enable this feature you only have to enable the parameter
+
+::
+
+  enable_problem_impacts_states_reprocessing=1
+
+See :ref:`enable_problem_impacts_states_reprocessing <configuration/configmain#enable_problem_impacts_states_reprocessing>` for more information about it.
diff --git a/shinken/objects/config.py b/shinken/objects/config.py
@@ -567,6 +567,9 @@ class Config(Item):
         'enable_problem_impacts_states_change':
             BoolProp(default=False, class_inherit=[(Host, None), (Service, None)]),
 
+        'enable_problem_impacts_states_reprocessing':
+            BoolProp(default=False, class_inherit=[(Host, None), (Service, None)]),
+
         # More a running value in fact
         'resource_macros_names':
             ListProp(default=[]),

diff --git a/shinken/objects/schedulingitem.py b/shinken/objects/schedulingitem.py
@@ -252,14 +252,27 @@ def do_check_freshness(self):
                                 )
         return None
 
-    # Raise all impact from my error. I'm setting myself
-    # as a problem, and I register myself as this in all
-    # hosts/services that depend_on_me. So they are now my
-    # impacts
-    def set_myself_as_problem(self):
+    def set_myself_as_problem(self, send_brok=True):
+        """
+        Raise all impact from my error. I'm setting myself as a problem, and
+        I register myself as this in all hosts/services that depend_on_me.
+        So they are now my impacts.
+
+        This method may be called to correctly reinitialize the object state
+        after the retention data has been loaded. In such a situation, a
+        brok shold not be emitted if the state is modified. The send_brok
+        variable reflects this.
+
+        :param bool send_brok: Should a brok be emitted if the object state
+                               is modified.
+        """
         now = time.time()
+        updated = False
+
+        if self.is_problem is False:
+            self.is_problem = True
+            updated = True
 
-        self.is_problem = True
         # we should warn potentials impact of our problem
         # and they should be cool to register them so I've got
         # my impacts list
@@ -271,21 +284,22 @@ def set_myself_as_problem(self):
                     # now check if we should bailout because of a
                     # not good timeperiod for dep
                     if tp is None or tp.is_time_valid(now):
-                        new_impacts = impact.register_a_problem(self)
+                        new_impacts = impact.register_a_problem(self, send_brok)
                         impacts.extend(new_impacts)
 
         # Only update impacts and create new brok if impacts changed.
         s_impacts = set(impacts)
-        if s_impacts == set(self.impacts):
-            return
-        self.impacts = list(s_impacts)
+        if s_impacts != set(self.impacts):
+            self.impacts = list(s_impacts)
 
-        # We can update our business_impact value now
-        self.update_business_impact_value()
+            # We can update our business_impact value now
+            self.update_business_impact_value()
+            updated = True
 
-        # And we register a new broks for update status
-        b = self.get_update_status_brok()
-        self.broks.append(b)
+        if send_brok is True and updated is True:
+            # And we register a new broks for update status
+            b = self.get_update_status_brok()
+            self.broks.append(b)
 
     # We update our 'business_impact' value with the max of
     # the impacts business_impact if we got impacts. And save our 'configuration'
@@ -323,15 +337,25 @@ def update_business_impact_value(self):
         if self.my_own_business_impact != -1 and not in_modulation:
             self.business_impact = self.my_own_business_impact
 
-    # Look for my impacts, and remove me from theirs problems list
-    def no_more_a_problem(self):
+    def no_more_a_problem(self, send_brok=True):
+        """
+        Look for my impacts, and remove me from theirs problems list
+
+        This method may be called to correctly reinitialize the object state
+        after the retention data has been loaded. In such a situation, a
+        brok shold not be emitted if the state is modified. The send_brok
+        variable reflects this.
+
+        :param bool send_brok: Should a brok be emitted if the object state
+                               is modified.
+        """
         was_pb = self.is_problem
         if self.is_problem:
             self.is_problem = False
 
             # we warn impacts that we are no more a problem
             for impact in self.impacts:
-                impact.deregister_a_problem(self)
+                impact.deregister_a_problem(self, send_brok)
 
             # we can just drop our impacts list
             self.impacts = []
@@ -341,16 +365,26 @@ def no_more_a_problem(self):
 
         # If we were a problem, we say to everyone
         # our new status, with good business_impact value
-        if was_pb:
+        if send_brok is True and was_pb:
             # And we register a new broks for update status
             b = self.get_update_status_brok()
             self.broks.append(b)
 
-    # Call recursively by potentials impacts so they
-    # update their source_problems list. But do not
-    # go below if the problem is not a real one for me
-    # like If I've got multiple parents for examples
-    def register_a_problem(self, pb):
+    def register_a_problem(self, pb, send_brok=True):
+        """
+        Call recursively by potentials impacts so they update their
+        source_problems list. But do not go below if the problem is not a
+        real one for me like If I've got multiple parents for examples.
+
+        This method may be called to correctly reinitialize the object state
+        after the retention data has been loaded. In such a situation, a
+        brok shold not be emitted if the state is modified. The send_brok
+        variable reflects this.
+
+        :param Item pb: The source problem
+        :param bool send_brok: Should a brok be emitted if the object state
+                               is modified.
+        """
         # Maybe we already have this problem? If so, bailout too
         if pb in self.source_problems:
             return []
@@ -364,44 +398,54 @@ def register_a_problem(self, pb):
         impacts = []
         # Ok, if we are impacted, we can add it in our
         # problem list
-        # TODO: remove this unused check
-        if self.is_impact:
-            # Maybe I was a problem myself, now I can say: not my fault!
-            if self.is_problem:
-                self.no_more_a_problem()
 
-            # Ok, we are now an impact, we should take the good state
-            # but only when we just go in impact state
-            if not was_an_impact:
-                self.set_impact_state()
-
-            # Ok now we can be a simple impact
-            impacts.append(self)
-            if pb not in self.source_problems:
-                self.source_problems.append(pb)
-            # we should send this problem to all potential impact that
-            # depend on us
-            for (impact, status, dep_type, tp, inh_par) in self.act_depend_of_me:
-                # Check if the status is ok for impact
-                for s in status:
-                    if self.is_state(s):
-                        # now check if we should bailout because of a
-                        # not good timeperiod for dep
-                        if tp is None or tp.is_time_valid(now):
-                            new_impacts = impact.register_a_problem(pb)
-                            impacts.extend(new_impacts)
+        # Maybe I was a problem myself, now I can say: not my fault!
+        if self.is_problem:
+            self.no_more_a_problem()
+
+        # Ok, we are now an impact, we should take the good state
+        # but only when we just go in impact state
+        if not was_an_impact:
+            self.set_impact_state()
+
+        # Ok now we can be a simple impact
+        impacts.append(self)
+        if pb not in self.source_problems:
+            self.source_problems.append(pb)
+        # we should send this problem to all potential impact that
+        # depend on us
+        for (impact, status, dep_type, tp, inh_par) in self.act_depend_of_me:
+            # Check if the status is ok for impact
+            for s in status:
+                if self.is_state(s):
+                    # now check if we should bailout because of a
+                    # not good timeperiod for dep
+                    if tp is None or tp.is_time_valid(now):
+                        new_impacts = impact.register_a_problem(pb)
+                        impacts.extend(new_impacts)
 
+        if send_brok is True:
             # And we register a new broks for update status
             b = self.get_update_status_brok()
             self.broks.append(b)
 
         # now we return all impacts (can be void of course)
         return impacts
 
-    # Just remove the problem from our problems list
-    # and check if we are still 'impacted'. It's not recursif because problem
-    # got the list of all its impacts
-    def deregister_a_problem(self, pb):
+    def deregister_a_problem(self, pb, send_brok=True):
+        """
+        Just remove the problem from our problems list
+        and check if we are still 'impacted'. It's not recursif because problem
+        got the list of all its impacts
+
+        This method may be called to correctly reinitialize the object state
+        after the retention data has been loaded. In such a situation, a
+        brok shold not be emitted if the state is modified. The send_brok
+        variable reflects this.
+
+        :param bool send_brok: Should a brok be emitted if the object state
+                               is modified.
+        """
         self.source_problems.remove(pb)
 
         # For know if we are still an impact, maybe our dependencies
@@ -842,9 +886,33 @@ def get_snapshot(self):
         # ok we can put it in our temp action queue
         self.actions.append(e)
 
-    # Force the evaluation of scheduled_downtime_depth and in_scheduled_downtime
-    # attributes
-    def reset_ack_and_downtimes_state(self):
+    def reprocess_state(self):
+        """
+        Resets object state after retention has been reloaded
+        """
+        # Processes the downtime depth from the currently active downtimes
+        self.reprocess_ack_and_downtimes_state()
+        # Enforces the problem/impact attributes processing if the feature is
+        # enabled
+
+        enable_problem_impact =  getattr(
+            self,
+            "enable_problem_impacts_states_change",
+            False
+        )
+        reprocess_problem_impact = getattr(
+            self,
+            "enable_problem_impacts_states_reprocessing",
+            False
+        )
+        if enable_problem_impact is True and reprocess_problem_impact is True:
+            self.reprocess_problem_impact_state()
+
+    def reprocess_ack_and_downtimes_state(self):
+        """
+        Force the evaluation of scheduled_downtime_depth and in_scheduled_downtime
+        attributes
+        """
         self.scheduled_downtime_depth = 0
         for dt in self.downtimes:
             if dt.in_scheduled_downtime():
@@ -858,6 +926,19 @@ def reset_ack_and_downtimes_state(self):
         else:
             self.problem_has_been_acknowledged = False
 
+    def reprocess_problem_impact_state(self):
+        """
+        Resets the problem/impact related attributes, which are reprocess to their
+        default value after the retention data has been reloaded.
+        """
+        no_action = self.is_no_action_dependent()
+
+        if not no_action and self.state_id != 0 and self.state_type == "HARD":
+            self.set_myself_as_problem(False)
+        # We recheck just for network_dep. Maybe we are just unreachable
+        # and we need to override the state_id
+        self.check_and_set_unreachability()
+
     # Whenever a non-ok hard state is reached, we must check whether this
     # host/service has a flexible downtime waiting to be activated
     def check_for_flexible_downtime(self):
@@ -1045,6 +1126,7 @@ def consume_state_result(self, c):
             # We recheck just for network_dep. Maybe we are just unreachable
             # and we need to override the state_id
             self.check_and_set_unreachability()
+
         # OK following a previous OK. perfect if we were not in SOFT
         if c.exit_status == 0 and self.last_state in (OK_UP, 'PENDING'):
             # print "Case 1 (OK following a previous OK):