New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INTLY-8120 Strip down rhmi_status metric to only have the overall stage #857
INTLY-8120 Strip down rhmi_status metric to only have the overall stage #857
Conversation
@jjaferson I'd appreciate your review here as you're familiar with the metrics being modified. |
/hold pending further updates |
81c9f54
to
6703365
Compare
/unhold |
6703365
to
f4c622a
Compare
productName = "threescale" | ||
RHMIStatus.Reset() | ||
if string(installation.Status.Stage) != "" { | ||
RHMIStatus.With(prometheus.Labels{"stage": string(installation.Status.Stage)}).Set(float64(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have empty stage, you probably still want to report the metric, right? Is there meaning to "installed and this code is running but no stage is set" such that not reporting the value makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might suggest avoiding reset and just setting the value always to "stage": string(installation.Status.Stage)
. That feels more natural use of the "status" label. If stage == "" has special meaning, and you want the value of the metric to be zero, then it makes the code more explicit with an else block (instead of reset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maleck13 or @philbrookes What are your thoughts on this?
What does an empty stage mean?
I might suggest avoiding reset and just setting the value always to
"stage": string(installation.Status.Stage)
@smarterclayton I might be using the prom library incorrectly, but if the metric isn't reset it results in the number of series going up by 1 each time the stage progresses. It's accumulative.
e.g.
scrape after startup
rhmi_status{stage=""} 1
scrape after <1m running
rhmi_status{stage=""} 1
rhmi_status{stage="Preflight Checks"} 1
scrape after >2m running
rhmi_status{stage=""} 1
rhmi_status{stage="Preflight Checks"} 1
rhmi_status{stage="cloud-resources"} 1
and so on until complete & there are 8 series.
The series seem to be held in memory.
So if the operator is killed/restarted, the metric is emptied of all series and starts building them up from whatever the current stage is.
This seems like odd/undesired behaviour vs. only have 1 series exposed at all times with the current stage, or have 0's for series unless it's the current stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stage being empty is what would happen before bootstrap has started. This stage shouldn't last long, but could be prolonged by failing preflight checks. I'd be reasonably happy with the statement "" == "preflight"
in this sense.
@Boomatang you initially implemented the stage field in the CR, do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stage field in the CR is set a the start of each reconcile stage if that stage is in progress. When starting there is only a very short time that the stage has no value till the preflight checks start. If for some reason the preflight checks fail the CR stage value will stay as "Prefight Checks"
. I don't see any way once the CR stage is set that it can go back to been a blank value.
The one caveat to be aware of is the stage value is set in the CR if the reconciler reports a stage is in progress. For example RHMI is full installed and the CR stage is marked as "completed". During the reconcile loop the "cloud-resources" actions start. Firstly the stage checks if the reconciler reports a "in progress" status for the "cloud-resources" and updates the CR if required. If during the rest of the "cloud-resources" actions the status gets set to "in progress" the CR stage will not be updated till the next pass of the reconcile loop. This is also true when doing an install. The CR stage may report the previous stage while starting to install the next stage but will be updated with the correct values on the next pass of the reconcile loop.
In short I agree with @philbrookes the length of time the CR stage value can be empty is so short there is no need to do anything with this. And sense this is a metric been reported my guess is we would have gotten pass the "preflight" stage I don't see how the CR stage value can ever be set back to a blank value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, because you're changing values. Yeah, reset is fine (we generally don't do labels like this so your approach is fine). I would recommend always writing one series, regardless of whether stage is empty. Whether you return 0 for empty stage and 1 for set stage, or 1 for all, I don't care too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks for feedback all
/lgtm |
/approve |
/retest |
1 similar comment
/retest |
f4c622a
to
fcaf2d4
Compare
@jjaferson could i get another lgtm? rebased |
/lgtm |
/retest |
3 similar comments
/retest |
/retest |
/retest |
/retest |
last failure is potentially a new test flake
I believe this failure is possible if the latest scrape of metrics in prometheus has an out of date status compared to the current status in the RHMI CR. |
/retest |
@jjaferson Do you have any ideas here why the metric from the most recent scrape might not match the current value in the RHMI CR? |
fcaf2d4
to
18f4d63
Compare
This prevents a bug where the stage gets set to 'solution-explorer', then quickly set back to 'complete' after.
@philbrookes can you give this an lgtm again. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: philbrookes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
https://issues.redhat.com/browse/INTLY-8120
Removes all labels from the
rhmi_status
metric so the number of series is kept to a minimum (more info in jira & linked document).These changes will ensure a limit of 8 series (1 for each status).
A metric series will be exposed at all times for the current stage, with a value of 1.
For example:
An expression to determine if the RHMI install is complete would be:
If you want to get the currently active stage:
and use the value of the
stage
labelThis approach seems more inline with how metrics are used to expose non numerical data, but comes with caveats (hence the need to limit the possible values of labels).
More info on metric cardinality in https://www.robustperception.io/cardinality-is-key
To verify this change with the operator running locally, run:
and verify the output looks similar to below, with the current stage having a value of 1 during the installation
The stage should change as the CR status.stage field changes
Type of change
Checklist