Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix clearing of ClusterSync Failing metric #1826

Merged
merged 1 commit into from
Aug 3, 2022

Conversation

suhanime
Copy link
Contributor

@suhanime suhanime commented Jul 20, 2022

ClusterSync Failing metric must be cleared when clustersync is no longer
failing, however the code path that evoked clearing of that metric in
clustersync controller when the failing condition was changed to success
was somehow never evoked.

  • This commit moves the clearing logic under this criterion to the
    metrics controller. The controller assumes that if any clustersync is
    not failing than the corresponding metric should be cleared.
  • Since the code path in clustersync controller wasn't being evoked,
    this commit also reverts to the previous code, and moves the clearing
    logic to the metrics controller, thereby ensuring we are only setting or
    clearing the metric in either the metrics controller or the
    clusterdeployment controller

xref: https://issues.redhat.com/browse/HIVE-1857

/assign @2uasimojo

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2022
@@ -502,6 +502,13 @@ func (r *ReconcileClusterSync) Reconcile(ctx context.Context, request reconcile.
}
// Clear the cluster sync failing metric
if !failing {
// Set the metric to zero first since clearing the metric would only stop recording it.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to stress that this is not what we usually do, metrics should always be considered in conjunction with their timestamps, but I also do not see any problem with this - and we might want to consider doing so for other gauges - particularly failure reporting gauges across the codebase

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in scrum, I think "usually" we're tallying a discrete event in the past (e.g. "a provision failed") whereas here we're tracking in realtime how long the clustersync has been in the failing state.

@codecov
Copy link

codecov bot commented Jul 20, 2022

Codecov Report

Merging #1826 (cedf447) into master (71e4b50) will decrease coverage by 0.00%.
The diff coverage is 61.53%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1826      +/-   ##
==========================================
- Coverage   41.59%   41.58%   -0.01%     
==========================================
  Files         362      362              
  Lines       34120    34118       -2     
==========================================
- Hits        14192    14188       -4     
- Misses      18725    18727       +2     
  Partials     1203     1203              
Impacted Files Coverage Δ
pkg/controller/metrics/metrics.go 33.51% <50.00%> (+0.36%) ⬆️
.../clusterdeployment/clusterdeployment_controller.go 61.44% <66.66%> (ø)
...g/controller/clustersync/clustersync_controller.go 75.77% <100.00%> (-0.07%) ⬇️

@@ -502,6 +502,13 @@ func (r *ReconcileClusterSync) Reconcile(ctx context.Context, request reconcile.
}
// Clear the cluster sync failing metric
if !failing {
// Set the metric to zero first since clearing the metric would only stop recording it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in scrum, I think "usually" we're tallying a discrete event in the past (e.g. "a provision failed") whereas here we're tracking in realtime how long the clustersync has been in the failing state.

if cond != nil && hivemetrics.ShouldLogGaugeDurationMetric(hivemetrics.MetricClusterSyncFailingSeconds,
time.Since(cond.LastTransitionTime.Time).Seconds()) {
hivemetrics.MetricClusterSyncFailingSeconds.WithLabelValues(
hivemetrics.GetNameSpacedName(clusterSync.Namespace, clusterSync.Name)).Set(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to zero the metric in this code path by adding an else branch.

Having looked at that, I'm actually thinking we may want to get rid of handling the metric in this controller at all -- remove this whole block. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShouldLogXDurationMetric requires seconds as a parameter - seconds since the lastTransitionTime. While the clusterSync is still failing, we can make that decision from the actual status, but when we need to clear the metric and set it to zero - that's when we need the lastTransitionTime of the condition before we change it - that is only available in the clusterSync controller. Including else part where the metrics controller computes the metric and log it to zero regardless of whether we logged the metric for the CS or not - needlessly complicates the logic and ends up with more entries than necessary
Clearing the metric is a different case because it clears it whether or not the metric was ever logged
Besides, the test for WaitingForCO metric shows how the metrics controller eventually does stop logging the metric, so I do not see this as a deal breaker

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. This is a good point. If sync succeeds, setFailedCondition (L487) will set the ClusterSyncFailed condition to False and update the LastTransitionTime to Now(). So ShouldLogGaugeDurationMetric won't trigger (unless configured to zero, which shouldn't be the case), and this branch won't be reached.

Point is, we're going to have to solve this problem regardless.

Since we're deleting it immediately, wouldn't it be okay to set it to zero even if it doesn't exist beforehand? It would take rather extreme timing for it to get scraped in that interval -- and I'm willing to accept that it would be zero in that case.

But wait again. Isn't setting to zero a no-op anyway if we're then going to delete right away? The scraper won't see that zero; it'll just see that the metric isn't there, and use the previously-reported value -- which is the whole problem we're trying to solve with this PR.

Let's sleep on this and discuss tomorrow.

@2uasimojo
Copy link
Member

Summary of discussions (hive thread, monitoring thread):

  • Deleting the metric ought to make it stop showing up on the prom side
  • Need to ask QE to adjust test case to validate ⬆️
  • We don't need to do this zeroing exercise at all
  • We still may be able to get rid of the code path as mentioned here

ClusterSync Failing metric must be cleared when clustersync is no longer
failing, however the code path that evoked clearing of that metric in
clustersync controller when the failing condition was changed to success
was somehow never evoked.
- This commit moves the clearing logic under this criterion to the
  metrics controller. The controller assumes that if any clustersync is
not failing than the corresponding metric should be cleared.
- Since the code path in clustersync controller wasn't being evoked,
  this commit also reverts to the previous code, and moves the clearing
logic to the metrics controller, thereby ensuring we are only setting or
clearing the metric in either the metrics controller or the
clusterdeployment controller
@suhanime suhanime changed the title Bring CS Failing metric to zero if success Fix clearing of ClusterSync Failing metric Aug 2, 2022
@suhanime
Copy link
Contributor Author

suhanime commented Aug 2, 2022

@2uasimojo Fixed as per discussion.
Note: There was a bug in the code path for clusterSync controller, hence I moved the clearing logic to metrics controller. The execution now is much more clean and works as expected

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 2, 2022

@suhanime: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks for sticking with this through all the confusion.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 3, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 3, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, suhanime

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot merged commit f6b7a1c into openshift:master Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants