Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INTLY-8120 Strip down rhmi_status metric to only have the overall stage #857

Merged

Conversation

david-martin
Copy link
Member

@david-martin david-martin commented Jun 4, 2020

Description

https://issues.redhat.com/browse/INTLY-8120

Removes all labels from the rhmi_status metric so the number of series is kept to a minimum (more info in jira & linked document).
These changes will ensure a limit of 8 series (1 for each status).
A metric series will be exposed at all times for the current stage, with a value of 1.
For example:

# HELP rhmi_status RHMI status of an installation
# TYPE rhmi_status gauge
rhmi_status{stage="complete"} 1

An expression to determine if the RHMI install is complete would be:

rhmi_status{stage="complete"} == 1

If you want to get the currently active stage:

rhmi_status == 1

and use the value of the stage label

This approach seems more inline with how metrics are used to expose non numerical data, but comes with caveats (hence the need to limit the possible values of labels).

More info on metric cardinality in https://www.robustperception.io/cardinality-is-key

To verify this change with the operator running locally, run:

while true; do curl localhost:8383/metrics|grep status & sleep 1; done

and verify the output looks similar to below, with the current stage having a value of 1 during the installation

# HELP rhmi_status RHMI status of an installation
# TYPE rhmi_status gauge
rhmi_status{stage="bootstrap"} 1

The stage should change as the CR status.stage field changes

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

  • This change requires a documentation update
  • I have added tests that prove my fix is effective or that my feature works
  • I have added a test case that will be used to verify my changes
  • Verified independently on a cluster by reviewer

@coveralls
Copy link

coveralls commented Jun 4, 2020

Coverage Status

Coverage increased (+0.2%) to 59.005% when pulling cbb7182 on INTLY-8120-rhmi_status_metrics_cardinality into 3ba5cef on master.

@david-martin
Copy link
Member Author

@jjaferson I'd appreciate your review here as you're familiar with the metrics being modified.

@david-martin
Copy link
Member Author

/hold pending further updates

@david-martin
Copy link
Member Author

/unhold

@david-martin david-martin force-pushed the INTLY-8120-rhmi_status_metrics_cardinality branch from 6703365 to f4c622a Compare June 9, 2020 09:26
productName = "threescale"
RHMIStatus.Reset()
if string(installation.Status.Stage) != "" {
RHMIStatus.With(prometheus.Labels{"stage": string(installation.Status.Stage)}).Set(float64(1))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have empty stage, you probably still want to report the metric, right? Is there meaning to "installed and this code is running but no stage is set" such that not reporting the value makes sense?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest avoiding reset and just setting the value always to "stage": string(installation.Status.Stage). That feels more natural use of the "status" label. If stage == "" has special meaning, and you want the value of the metric to be zero, then it makes the code more explicit with an else block (instead of reset).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maleck13 or @philbrookes What are your thoughts on this?
What does an empty stage mean?

I might suggest avoiding reset and just setting the value always to "stage": string(installation.Status.Stage)

@smarterclayton I might be using the prom library incorrectly, but if the metric isn't reset it results in the number of series going up by 1 each time the stage progresses. It's accumulative.

e.g.
scrape after startup

rhmi_status{stage=""} 1

scrape after <1m running

rhmi_status{stage=""} 1
rhmi_status{stage="Preflight Checks"} 1

scrape after >2m running

rhmi_status{stage=""} 1
rhmi_status{stage="Preflight Checks"} 1
rhmi_status{stage="cloud-resources"} 1

and so on until complete & there are 8 series.

The series seem to be held in memory.
So if the operator is killed/restarted, the metric is emptied of all series and starts building them up from whatever the current stage is.
This seems like odd/undesired behaviour vs. only have 1 series exposed at all times with the current stage, or have 0's for series unless it's the current stage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stage being empty is what would happen before bootstrap has started. This stage shouldn't last long, but could be prolonged by failing preflight checks. I'd be reasonably happy with the statement "" == "preflight"in this sense.

@Boomatang you initially implemented the stage field in the CR, do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stage field in the CR is set a the start of each reconcile stage if that stage is in progress. When starting there is only a very short time that the stage has no value till the preflight checks start. If for some reason the preflight checks fail the CR stage value will stay as "Prefight Checks". I don't see any way once the CR stage is set that it can go back to been a blank value.

The one caveat to be aware of is the stage value is set in the CR if the reconciler reports a stage is in progress. For example RHMI is full installed and the CR stage is marked as "completed". During the reconcile loop the "cloud-resources" actions start. Firstly the stage checks if the reconciler reports a "in progress" status for the "cloud-resources" and updates the CR if required. If during the rest of the "cloud-resources" actions the status gets set to "in progress" the CR stage will not be updated till the next pass of the reconcile loop. This is also true when doing an install. The CR stage may report the previous stage while starting to install the next stage but will be updated with the correct values on the next pass of the reconcile loop.

In short I agree with @philbrookes the length of time the CR stage value can be empty is so short there is no need to do anything with this. And sense this is a metric been reported my guess is we would have gotten pass the "preflight" stage I don't see how the CR stage value can ever be set back to a blank value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, because you're changing values. Yeah, reset is fine (we generally don't do labels like this so your approach is fine). I would recommend always writing one series, regardless of whether stage is empty. Whether you return 0 for empty stage and 1 for set stage, or 1 for all, I don't care too much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for feedback all

@jjaferson
Copy link
Contributor

/lgtm

@philbrookes
Copy link
Member

/approve

@david-martin
Copy link
Member Author

/retest

1 similar comment
@david-martin
Copy link
Member Author

/retest

@david-martin david-martin force-pushed the INTLY-8120-rhmi_status_metrics_cardinality branch from f4c622a to fcaf2d4 Compare June 11, 2020 08:26
@david-martin
Copy link
Member Author

@jjaferson could i get another lgtm? rebased

@philbrookes
Copy link
Member

/lgtm

@david-martin
Copy link
Member Author

/retest

3 similar comments
@david-martin
Copy link
Member Author

/retest

@david-martin
Copy link
Member Author

/retest

@david-martin
Copy link
Member Author

/retest

@david-martin
Copy link
Member Author

@david-martin
Copy link
Member Author

/retest

@david-martin
Copy link
Member Author

last failure is potentially a new test flake

        --- FAIL: TestIntegreatly/integreatly/Test_RHMI_installation_CR_metric (4.47s)
            rhmicr_metrics.go:46: rhmi_status metric stage does not match current stage: solution-explorer != complete

I believe this failure is possible if the latest scrape of metrics in prometheus has an out of date status compared to the current status in the RHMI CR.
Will investigate a way to avoid this.

@david-martin
Copy link
Member Author

/retest

@david-martin
Copy link
Member Author

I believe this failure is possible if the latest scrape of metrics in prometheus has an out of date status compared to the current status in the RHMI CR.

@jjaferson Do you have any ideas here why the metric from the most recent scrape might not match the current value in the RHMI CR?
I thought it might be because of a time lag, but we're calling the metrics endpoint directly from inside the operator pod after we've read the RHMI CR.
However, in this failure the CR status was complete but the metric was solution-explorer

@david-martin david-martin force-pushed the INTLY-8120-rhmi_status_metrics_cardinality branch from fcaf2d4 to 18f4d63 Compare June 16, 2020 13:09
This prevents a bug where the stage gets set to 'solution-explorer',
then quickly set back to 'complete' after.
@david-martin
Copy link
Member Author

@philbrookes can you give this an lgtm again.
I've applied the discussed changes and they solved the failing test with the incorrect stage.

@philbrookes
Copy link
Member

/lgtm
/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: philbrookes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit e8ec6e3 into master Jun 17, 2020
@david-martin david-martin deleted the INTLY-8120-rhmi_status_metrics_cardinality branch June 17, 2020 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants