-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix clearing of ClusterSync Failing metric #1826
Conversation
@@ -502,6 +502,13 @@ func (r *ReconcileClusterSync) Reconcile(ctx context.Context, request reconcile. | |||
} | |||
// Clear the cluster sync failing metric | |||
if !failing { | |||
// Set the metric to zero first since clearing the metric would only stop recording it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to stress that this is not what we usually do, metrics should always be considered in conjunction with their timestamps, but I also do not see any problem with this - and we might want to consider doing so for other gauges - particularly failure reporting gauges across the codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in scrum, I think "usually" we're tallying a discrete event in the past (e.g. "a provision failed") whereas here we're tracking in realtime how long the clustersync has been in the failing state.
Codecov Report
@@ Coverage Diff @@
## master #1826 +/- ##
==========================================
- Coverage 41.59% 41.58% -0.01%
==========================================
Files 362 362
Lines 34120 34118 -2
==========================================
- Hits 14192 14188 -4
- Misses 18725 18727 +2
Partials 1203 1203
|
@@ -502,6 +502,13 @@ func (r *ReconcileClusterSync) Reconcile(ctx context.Context, request reconcile. | |||
} | |||
// Clear the cluster sync failing metric | |||
if !failing { | |||
// Set the metric to zero first since clearing the metric would only stop recording it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in scrum, I think "usually" we're tallying a discrete event in the past (e.g. "a provision failed") whereas here we're tracking in realtime how long the clustersync has been in the failing state.
if cond != nil && hivemetrics.ShouldLogGaugeDurationMetric(hivemetrics.MetricClusterSyncFailingSeconds, | ||
time.Since(cond.LastTransitionTime.Time).Seconds()) { | ||
hivemetrics.MetricClusterSyncFailingSeconds.WithLabelValues( | ||
hivemetrics.GetNameSpacedName(clusterSync.Namespace, clusterSync.Name)).Set(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to zero the metric in this code path by adding an else
branch.
Having looked at that, I'm actually thinking we may want to get rid of handling the metric in this controller at all -- remove this whole block. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShouldLogXDurationMetric
requires seconds as a parameter - seconds since the lastTransitionTime
. While the clusterSync is still failing, we can make that decision from the actual status, but when we need to clear the metric and set it to zero - that's when we need the lastTransitionTime of the condition before we change it - that is only available in the clusterSync controller. Including else
part where the metrics controller computes the metric and log it to zero regardless of whether we logged the metric for the CS or not - needlessly complicates the logic and ends up with more entries than necessary
Clearing the metric is a different case because it clears it whether or not the metric was ever logged
Besides, the test for WaitingForCO metric shows how the metrics controller eventually does stop logging the metric, so I do not see this as a deal breaker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. This is a good point. If sync succeeds, setFailedCondition (L487) will set the ClusterSyncFailed condition to False and update the LastTransitionTime to Now(). So ShouldLogGaugeDurationMetric won't trigger (unless configured to zero, which shouldn't be the case), and this branch won't be reached.
Point is, we're going to have to solve this problem regardless.
Since we're deleting it immediately, wouldn't it be okay to set it to zero even if it doesn't exist beforehand? It would take rather extreme timing for it to get scraped in that interval -- and I'm willing to accept that it would be zero in that case.
But wait again. Isn't setting to zero a no-op anyway if we're then going to delete right away? The scraper won't see that zero; it'll just see that the metric isn't there, and use the previously-reported value -- which is the whole problem we're trying to solve with this PR.
Let's sleep on this and discuss tomorrow.
Summary of discussions (hive thread, monitoring thread):
|
ClusterSync Failing metric must be cleared when clustersync is no longer failing, however the code path that evoked clearing of that metric in clustersync controller when the failing condition was changed to success was somehow never evoked. - This commit moves the clearing logic under this criterion to the metrics controller. The controller assumes that if any clustersync is not failing than the corresponding metric should be cleared. - Since the code path in clustersync controller wasn't being evoked, this commit also reverts to the previous code, and moves the clearing logic to the metrics controller, thereby ensuring we are only setting or clearing the metric in either the metrics controller or the clusterdeployment controller
@2uasimojo Fixed as per discussion. |
@suhanime: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Thanks for sticking with this through all the confusion.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 2uasimojo, suhanime The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ClusterSync Failing metric must be cleared when clustersync is no longer
failing, however the code path that evoked clearing of that metric in
clustersync controller when the failing condition was changed to success
was somehow never evoked.
metrics controller. The controller assumes that if any clustersync is
not failing than the corresponding metric should be cleared.
this commit also reverts to the previous code, and moves the clearing
logic to the metrics controller, thereby ensuring we are only setting or
clearing the metric in either the metrics controller or the
clusterdeployment controller
xref: https://issues.redhat.com/browse/HIVE-1857
/assign @2uasimojo