New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1845873: Add rhmi_status metric to telemeter #795
Bug 1845873: Add rhmi_status metric to telemeter #795
Conversation
@david-martin: GitHub didn't allow me to request PR reviews from the following users: maleck13. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of questions:
- What is the total number of series you will be sending for each component, as each team gets a budget?
- Did you get Claytons 👍 on this, as per the docs? :)
- If I remember correctly you are not installed by default in every openshift cluster, so this makes it hard to test things?
Also note this can only happen from 4.6 onwards, you are targeting master, but master - 4.5 is closed for these kind of changes so you have to wait until 4.6 master opens.
Please link the docs where this is mentioned in case I've missed anything else.
Yes. |
The docs are linked in both forum-telemetry and forum-monitoring :) https://docs.google.com/document/d/1a6n5iBGM2QaIQRg9Lw4-Npj6QY9--Hpx3XYut-BrUSY/edit
You mean label value or label name? Might be easier to link to the metrics in code, thanks! |
Thanks, I was only looking in this repo.
See both metrics in https://github.com/integr8ly/integreatly-operator/blob/master/pkg/metrics/metrics.go |
https://github.com/integr8ly/integreatly-operator/blob/master/pkg/metrics/metrics.go#L66-L72 Seems like its much more than 1 series here, I have not looked at the various label values you can set, but the following makes me think these are of unbound high cardinality metrics. https://github.com/integr8ly/integreatly-operator/blob/5cdc436d52c04f18e7cc56644b3f1a44060676af/pkg/controller/installation/installation_controller.go#L710 As per the document, we are trying to limit the number of series each component to 1 - 10 series for now. If you get special approval from Clayton on more, thats fine by me. But this looks like you are going to be sending unbound number of series at least for this one metric, did not look at others. Let me know if there is something that I am missing or anything unclear. |
Thanks, this is starting to make more sense to me now. |
Sounds good to me, feel free to rechout to me or any other team members on forum-monitoring if you have any other questions! |
# (rhmi) rhmi_version reports what version of RHMI is running on the cluster. | ||
# Used to identify what versions are on a cluster that is experiencing problems. | ||
- '{__name__="rhmi_version"}' | ||
# (rhmi) rhmi_status reports the status of RHMI (pending, failed, complete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add @openshift/openshift-team-cluster-manager
in here for change notifications, as this metric will be consumed by OCM (for self-service RHMI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 👍
/hold while metrics are under review |
@lilic these specific pieces should land part of middleware budget |
/assign pgier |
@pgier do you mind to have a look at this one? |
@david-martin please let us know once your cardinality investigation is done. Do you mind to share the results (numbers) simply here? Thanks! :-) |
@s-urbaniak there is nothing to review yet, I have already reviewed the document of the new proposed metrics, I can take it from here as I know the background:) |
@kbsingh was not aware that is a different budget, they still go to telemtry so similar rules apply? But as discussed these metrics were way over any budget as they were unbound cardinality. |
- rhmi_status - rhmi_version Both will be used for debugging & aggregating reporting on RHMI clusters. rhmi_status will also be used with OCM to inform of RHMI installation progress.
@lilic I've updated the PR to only include the Example of the metric series exposed: EDIT: Now only exposes the current stage (No 0 values)
Let me know if there's anything else I need to do here to get this merged. |
These look good from my perspective, thanks! Last thing is did you get an okay from Clayton on the metrics as well? |
Sorry, I don't know this process - I work on the OCM bits, my project is just a consumer |
You need a valid bugzilla medium or higher priority for backports, this should be against the RHMI component in bugzilla. |
@vkareh as @lilic mentioned you need a BZ against the owning component, but according to @david-martin there seems to be no component owner in OpenShift for RHMI metrics? pinging @bparees @smarterclayton to clarify which component owns RHMI in BZ, this seems to be a black-hole situation if we ship metrics in OpenShift who are not owned by anybody. |
maybe we can use the install component for now, but @sdodson might have better ideas or want to push for a new component(or sub-component) to be added to cover this. |
@@ -205,6 +205,10 @@ data: | |||
# (monitoring-team) monitoring:haproxy_server_http_responses_total:sum tracks the number of times users access | |||
# monitoring routes. | |||
- '{__name__="monitoring:haproxy_server_http_responses_total:sum"}' | |||
# (rhmi, @openshift/openshift-team-cluster-manager) rhmi_status reports the status of an RHMI installation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be in multiple stages? If not, why not only report a single series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be in multiple stages?
No.
If not, why not only report a single series?
I may be misunderstanding how labels impact the number of series, but do you mean only ever expose 1 metric result from the scrape endpoint with whatever the current stage is in the stage label (with a value of 1)?
e.g.
- from 11:30 to 11:35
rhmi_status{stage="bootstrap"} 1
- from 11:35 to 11:39
rhmi_status{stage="products"} 1
- after 11:39
rhmi_status{stage="complete"} 1
Or use the value to represent the stage (no label)?
e.g.
- from 11:30 to 11:35
rhmi_status{} 1
- from 11:35 to 11:39
rhmi_status{} 2
- after 11:39
rhmi_status{} 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first one - you don't need 8 series with 1 set to 1, and the other 7 set to 0, if the question you are primarily answering is "how many clusters are in this state", "is this cluster in this state". The zero values don't convey any additional info over their absence as far as I can tell
from 11:30 to 11:35 rhmi_status{stage="bootstrap"} 1
from 11:35 to 11:39 rhmi_status{stage="products"} 1
after 11:39 rhmi_status{stage="complete"} 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need 8 series with 1 set to 1
Good point.
My thinking was trying to mimic what kube-state-metrics does for some metrics like kube_pod_status_phase
e.g.
kube_pod_status_phase{...,phase="Failed",...} | 0
kube_pod_status_phase{...,phase="Pending",...} | 0
kube_pod_status_phase{...,phase="Running",...} | 1
kube_pod_status_phase{...,phase="Succeeded",...} | 0
kube_pod_status_phase{...,phase="Unknown",...} | 0
I'll update the metric upstream in integr8ly/integreatly-operator#857 to omit the 0's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, if 0 values you don't care about, you could do a recording rule that checks anything larger than 0
or == 1
, and just send the recording rule via telemetry, that will indeed save you few series per cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming I've updated the metric being exposed so it doesn't return 0 values (integr8ly/integreatly-operator#857)
NACK on Installer but I really don't have a better suggestion. Ideally there'd be a component in the repo's OWNERS file which could be used to discover. If we only care about appeasing the bot for backport just stick it in Unknown but make sure it's clear that it doesn't need to be triaged to another component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: david-martin, lilic The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
/cherrypick release-4.5 |
@david-martin: #795 failed to apply on top of branch "release-4.5":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@david-martin: Bugzilla bug 1845873 is in an unrecognized state (VERIFIED) and will not be moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
New metric to be whitelisted for telemeter
Used with OCM to inform of RHMI installation progress.
The total number of series/cardinality is 8 (based on their being 8 possible stages)