Fix clearing of ClusterSync Failing metric #1826

suhanime · 2022-07-20T18:16:08Z

ClusterSync Failing metric must be cleared when clustersync is no longer
failing, however the code path that evoked clearing of that metric in
clustersync controller when the failing condition was changed to success
was somehow never evoked.

This commit moves the clearing logic under this criterion to the
metrics controller. The controller assumes that if any clustersync is
not failing than the corresponding metric should be cleared.
Since the code path in clustersync controller wasn't being evoked,
this commit also reverts to the previous code, and moves the clearing
logic to the metrics controller, thereby ensuring we are only setting or
clearing the metric in either the metrics controller or the
clusterdeployment controller

xref: https://issues.redhat.com/browse/HIVE-1857

/assign @2uasimojo

suhanime · 2022-07-20T18:25:38Z

pkg/controller/clustersync/clustersync_controller.go

@@ -502,6 +502,13 @@ func (r *ReconcileClusterSync) Reconcile(ctx context.Context, request reconcile.
 	}
 	// Clear the cluster sync failing metric
 	if !failing {
+		// Set the metric to zero first since clearing the metric would only stop recording it.


I'd like to stress that this is not what we usually do, metrics should always be considered in conjunction with their timestamps, but I also do not see any problem with this - and we might want to consider doing so for other gauges - particularly failure reporting gauges across the codebase

As we discussed in scrum, I think "usually" we're tallying a discrete event in the past (e.g. "a provision failed") whereas here we're tracking in realtime how long the clustersync has been in the failing state.

codecov · 2022-07-20T19:49:57Z

Codecov Report

Merging #1826 (cedf447) into master (71e4b50) will decrease coverage by 0.00%.
The diff coverage is 61.53%.

@@            Coverage Diff             @@
##           master    #1826      +/-   ##
==========================================
- Coverage   41.59%   41.58%   -0.01%     
==========================================
  Files         362      362              
  Lines       34120    34118       -2     
==========================================
- Hits        14192    14188       -4     
- Misses      18725    18727       +2     
  Partials     1203     1203

Impacted Files	Coverage Δ
pkg/controller/metrics/metrics.go	`33.51% <50.00%> (+0.36%)`	⬆️
.../clusterdeployment/clusterdeployment_controller.go	`61.44% <66.66%> (ø)`
...g/controller/clustersync/clustersync_controller.go	`75.77% <100.00%> (-0.07%)`	⬇️

2uasimojo · 2022-07-20T20:54:20Z

pkg/controller/clustersync/clustersync_controller.go

@@ -502,6 +502,13 @@ func (r *ReconcileClusterSync) Reconcile(ctx context.Context, request reconcile.
 	}
 	// Clear the cluster sync failing metric
 	if !failing {
+		// Set the metric to zero first since clearing the metric would only stop recording it.


As we discussed in scrum, I think "usually" we're tallying a discrete event in the past (e.g. "a provision failed") whereas here we're tracking in realtime how long the clustersync has been in the failing state.

2uasimojo · 2022-07-20T20:55:52Z

pkg/controller/clustersync/clustersync_controller.go

+		if cond != nil && hivemetrics.ShouldLogGaugeDurationMetric(hivemetrics.MetricClusterSyncFailingSeconds,
+			time.Since(cond.LastTransitionTime.Time).Seconds()) {
+			hivemetrics.MetricClusterSyncFailingSeconds.WithLabelValues(
+				hivemetrics.GetNameSpacedName(clusterSync.Namespace, clusterSync.Name)).Set(0)


I think we also need to zero the metric in this code path by adding an else branch.

Having looked at that, I'm actually thinking we may want to get rid of handling the metric in this controller at all -- remove this whole block. Thoughts?

ShouldLogXDurationMetric requires seconds as a parameter - seconds since the lastTransitionTime. While the clusterSync is still failing, we can make that decision from the actual status, but when we need to clear the metric and set it to zero - that's when we need the lastTransitionTime of the condition before we change it - that is only available in the clusterSync controller. Including else part where the metrics controller computes the metric and log it to zero regardless of whether we logged the metric for the CS or not - needlessly complicates the logic and ends up with more entries than necessary
Clearing the metric is a different case because it clears it whether or not the metric was ever logged
Besides, the test for WaitingForCO metric shows how the metrics controller eventually does stop logging the metric, so I do not see this as a deal breaker

Oh. This is a good point. If sync succeeds, setFailedCondition (L487) will set the ClusterSyncFailed condition to False and update the LastTransitionTime to Now(). So ShouldLogGaugeDurationMetric won't trigger (unless configured to zero, which shouldn't be the case), and this branch won't be reached.

Point is, we're going to have to solve this problem regardless.

Since we're deleting it immediately, wouldn't it be okay to set it to zero even if it doesn't exist beforehand? It would take rather extreme timing for it to get scraped in that interval -- and I'm willing to accept that it would be zero in that case.

But wait again. Isn't setting to zero a no-op anyway if we're then going to delete right away? The scraper won't see that zero; it'll just see that the metric isn't there, and use the previously-reported value -- which is the whole problem we're trying to solve with this PR.

Let's sleep on this and discuss tomorrow.

2uasimojo · 2022-08-02T14:12:48Z

Summary of discussions (hive thread, monitoring thread):

Deleting the metric ought to make it stop showing up on the prom side
Need to ask QE to adjust test case to validate ⬆️
We don't need to do this zeroing exercise at all
We still may be able to get rid of the code path as mentioned here

ClusterSync Failing metric must be cleared when clustersync is no longer failing, however the code path that evoked clearing of that metric in clustersync controller when the failing condition was changed to success was somehow never evoked. - This commit moves the clearing logic under this criterion to the metrics controller. The controller assumes that if any clustersync is not failing than the corresponding metric should be cleared. - Since the code path in clustersync controller wasn't being evoked, this commit also reverts to the previous code, and moves the clearing logic to the metrics controller, thereby ensuring we are only setting or clearing the metric in either the metrics controller or the clusterdeployment controller

suhanime · 2022-08-02T21:00:58Z

@2uasimojo Fixed as per discussion.
Note: There was a bug in the code path for clusterSync controller, hence I moved the clearing logic to metrics controller. The execution now is much more clean and works as expected

openshift-ci · 2022-08-02T22:55:40Z

@suhanime: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

2uasimojo

Nice. Thanks for sticking with this through all the confusion.

/lgtm

openshift-ci · 2022-08-03T13:10:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, suhanime

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [2uasimojo,suhanime]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot assigned 2uasimojo Jul 20, 2022

openshift-ci bot requested review from 2uasimojo and akhil-rane July 20, 2022 18:17

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2022

suhanime commented Jul 20, 2022

View reviewed changes

2uasimojo reviewed Jul 20, 2022

View reviewed changes

suhanime force-pushed the hive_1857 branch from e7fbba4 to cedf447 Compare August 2, 2022 20:58

suhanime changed the title ~~Bring CS Failing metric to zero if success~~ Fix clearing of ClusterSync Failing metric Aug 2, 2022

2uasimojo reviewed Aug 3, 2022

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 3, 2022

openshift-ci bot merged commit f6b7a1c into openshift:master Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix clearing of ClusterSync Failing metric #1826

Fix clearing of ClusterSync Failing metric #1826

suhanime commented Jul 20, 2022 •

edited

suhanime Jul 20, 2022

2uasimojo Jul 20, 2022

codecov bot commented Jul 20, 2022 •

edited

2uasimojo Jul 20, 2022

2uasimojo Jul 20, 2022

suhanime Jul 27, 2022

2uasimojo Jul 27, 2022

2uasimojo commented Aug 2, 2022

suhanime commented Aug 2, 2022

openshift-ci bot commented Aug 2, 2022

2uasimojo left a comment

openshift-ci bot commented Aug 3, 2022

Fix clearing of ClusterSync Failing metric #1826

Fix clearing of ClusterSync Failing metric #1826

Conversation

suhanime commented Jul 20, 2022 • edited

suhanime Jul 20, 2022

Choose a reason for hiding this comment

2uasimojo Jul 20, 2022

Choose a reason for hiding this comment

codecov bot commented Jul 20, 2022 • edited

Codecov Report

2uasimojo Jul 20, 2022

Choose a reason for hiding this comment

2uasimojo Jul 20, 2022

Choose a reason for hiding this comment

suhanime Jul 27, 2022

Choose a reason for hiding this comment

2uasimojo Jul 27, 2022

Choose a reason for hiding this comment

2uasimojo commented Aug 2, 2022

suhanime commented Aug 2, 2022

openshift-ci bot commented Aug 2, 2022

2uasimojo left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Aug 3, 2022

suhanime commented Jul 20, 2022 •

edited

codecov bot commented Jul 20, 2022 •

edited