Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1845873: Add rhmi_status metric to telemeter #795

Merged
merged 1 commit into from Jun 9, 2020
Merged

Bug 1845873: Add rhmi_status metric to telemeter #795

merged 1 commit into from Jun 9, 2020

Conversation

david-martin
Copy link
Member

@david-martin david-martin commented May 26, 2020

New metric to be whitelisted for telemeter

  • rhmi_status

Used with OCM to inform of RHMI installation progress.
The total number of series/cardinality is 8 (based on their being 8 possible stages)

@david-martin
Copy link
Member Author

/cc @alanmoran @maleck13 @jjaferson

@openshift-ci-robot
Copy link
Contributor

@david-martin: GitHub didn't allow me to request PR reviews from the following users: maleck13.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @alanmoran @maleck13 @jjaferson

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions:

  • What is the total number of series you will be sending for each component, as each team gets a budget?
  • Did you get Claytons 👍 on this, as per the docs? :)
  • If I remember correctly you are not installed by default in every openshift cluster, so this makes it hard to test things?

Also note this can only happen from 4.6 onwards, you are targeting master, but master - 4.5 is closed for these kind of changes so you have to wait until 4.6 master opens.

@david-martin
Copy link
Member Author

  • What is the total number of series you will be sending for each component, as each team gets a budget?
  • rhmi_status => only every 1 instance of this metric per cluster (labels may change over time, but still just 1 instance)
  • rhmi_version => only every 1 instance of this metric per cluster (labels may change over time, but still just 1 instance)
  • Did you get Claytons 👍 on this, as per the docs? :)

Please link the docs where this is mentioned in case I've missed anything else.

  • If I remember correctly you are not installed by default in every openshift cluster, so this makes it hard to test things?

Yes.
Enabling the 'rhmi' addon will be enough to start scraping the metrics.
Let me know what can be done here to satisfy testing criteria

@lilic
Copy link
Contributor

lilic commented May 28, 2020

The docs are linked in both forum-telemetry and forum-monitoring :) https://docs.google.com/document/d/1a6n5iBGM2QaIQRg9Lw4-Npj6QY9--Hpx3XYut-BrUSY/edit

(labels may change over time, but still just 1 instance)

You mean label value or label name?

Might be easier to link to the metrics in code, thanks!

@david-martin
Copy link
Member Author

The docs are linked in both forum-telemetry and forum-monitoring

Thanks, I was only looking in this repo.
Reviewing now.

You mean label value or label name?
Label value

Might be easier to link to the metrics in code, thanks

See both metrics in https://github.com/integr8ly/integreatly-operator/blob/master/pkg/metrics/metrics.go

@lilic
Copy link
Contributor

lilic commented May 28, 2020

https://github.com/integr8ly/integreatly-operator/blob/master/pkg/metrics/metrics.go#L66-L72

Seems like its much more than 1 series here, I have not looked at the various label values you can set, but the following makes me think these are of unbound high cardinality metrics. https://github.com/integr8ly/integreatly-operator/blob/5cdc436d52c04f18e7cc56644b3f1a44060676af/pkg/controller/installation/installation_controller.go#L710

As per the document, we are trying to limit the number of series each component to 1 - 10 series for now. If you get special approval from Clayton on more, thats fine by me. But this looks like you are going to be sending unbound number of series at least for this one metric, did not look at others.

Let me know if there is something that I am missing or anything unclear.

@david-martin
Copy link
Member Author

As per the document, we are trying to limit the number of series each component to 1 - 10 series for now

Thanks, this is starting to make more sense to me now.
I'll review in more detail and get a better idea of cardinality, and suggest changes to get that value within the range.

@lilic
Copy link
Contributor

lilic commented May 28, 2020

Sounds good to me, feel free to rechout to me or any other team members on forum-monitoring if you have any other questions!

# (rhmi) rhmi_version reports what version of RHMI is running on the cluster.
# Used to identify what versions are on a cluster that is experiencing problems.
- '{__name__="rhmi_version"}'
# (rhmi) rhmi_status reports the status of RHMI (pending, failed, complete)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add @openshift/openshift-team-cluster-manager in here for change notifications, as this metric will be consumed by OCM (for self-service RHMI)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 👍

@david-martin
Copy link
Member Author

/hold while metrics are under review

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2020
@kbsingh
Copy link

kbsingh commented Jun 2, 2020

@lilic these specific pieces should land part of middleware budget

@s-urbaniak
Copy link
Contributor

/assign pgier

@s-urbaniak
Copy link
Contributor

@pgier do you mind to have a look at this one?

@s-urbaniak
Copy link
Contributor

@david-martin please let us know once your cardinality investigation is done. Do you mind to share the results (numbers) simply here? Thanks! :-)

@lilic
Copy link
Contributor

lilic commented Jun 3, 2020

@s-urbaniak there is nothing to review yet, I have already reviewed the document of the new proposed metrics, I can take it from here as I know the background:)

@lilic
Copy link
Contributor

lilic commented Jun 3, 2020

@kbsingh was not aware that is a different budget, they still go to telemtry so similar rules apply? But as discussed these metrics were way over any budget as they were unbound cardinality.

@openshift-ci-robot openshift-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 4, 2020
- rhmi_status
- rhmi_version

Both will be used for debugging & aggregating reporting on RHMI clusters.
rhmi_status will also be used with OCM to inform of RHMI installation progress.
@david-martin
Copy link
Member Author

david-martin commented Jun 4, 2020

@lilic I've updated the PR to only include the rhmi_status metric.
The PR in the RHMI Operator for this metric is integr8ly/integreatly-operator#857

Example of the metric series exposed:

EDIT: Now only exposes the current stage (No 0 values)

# HELP rhmi_status RHMI status of an installation
# TYPE rhmi_status gauge
rhmi_status{stage="complete"} 1

Let me know if there's anything else I need to do here to get this merged.

@david-martin
Copy link
Member Author

/unhold

@lilic I believe this is ready for another review.
Let me know if anything else needs to be done.
The e2e-aws CI failure looks unrelated to this change

@vkareh What's the right process here to get this backported for a 4.4 z-stream release as well?

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 5, 2020
@lilic
Copy link
Contributor

lilic commented Jun 5, 2020

@david-martin

Let me know if anything else needs to be done.

These look good from my perspective, thanks! Last thing is did you get an okay from Clayton on the metrics as well?

@vkareh
Copy link
Member

vkareh commented Jun 5, 2020

@david-martin

What's the right process here to get this backported for a 4.4 z-stream release as well?

Sorry, I don't know this process - I work on the OCM bits, my project is just a consumer

@lilic
Copy link
Contributor

lilic commented Jun 5, 2020

You need a valid bugzilla medium or higher priority for backports, this should be against the RHMI component in bugzilla.

@s-urbaniak
Copy link
Contributor

@vkareh as @lilic mentioned you need a BZ against the owning component, but according to @david-martin there seems to be no component owner in OpenShift for RHMI metrics?

pinging @bparees @smarterclayton to clarify which component owns RHMI in BZ, this seems to be a black-hole situation if we ship metrics in OpenShift who are not owned by anybody.

@bparees
Copy link
Contributor

bparees commented Jun 8, 2020

pinging @bparees @smarterclayton to clarify which component owns RHMI in BZ, this seems to be a black-hole situation if we ship metrics in OpenShift who are not owned by anybody.

maybe we can use the install component for now, but @sdodson might have better ideas or want to push for a new component(or sub-component) to be added to cover this.

@@ -205,6 +205,10 @@ data:
# (monitoring-team) monitoring:haproxy_server_http_responses_total:sum tracks the number of times users access
# monitoring routes.
- '{__name__="monitoring:haproxy_server_http_responses_total:sum"}'
# (rhmi, @openshift/openshift-team-cluster-manager) rhmi_status reports the status of an RHMI installation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be in multiple stages? If not, why not only report a single series?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be in multiple stages?

No.

If not, why not only report a single series?

I may be misunderstanding how labels impact the number of series, but do you mean only ever expose 1 metric result from the scrape endpoint with whatever the current stage is in the stage label (with a value of 1)?

e.g.

  • from 11:30 to 11:35 rhmi_status{stage="bootstrap"} 1
  • from 11:35 to 11:39 rhmi_status{stage="products"} 1
  • after 11:39 rhmi_status{stage="complete"} 1

Or use the value to represent the stage (no label)?

e.g.

  • from 11:30 to 11:35 rhmi_status{} 1
  • from 11:35 to 11:39 rhmi_status{} 2
  • after 11:39 rhmi_status{} 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one - you don't need 8 series with 1 set to 1, and the other 7 set to 0, if the question you are primarily answering is "how many clusters are in this state", "is this cluster in this state". The zero values don't convey any additional info over their absence as far as I can tell


    from 11:30 to 11:35 rhmi_status{stage="bootstrap"} 1
    from 11:35 to 11:39 rhmi_status{stage="products"} 1
    after 11:39 rhmi_status{stage="complete"} 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need 8 series with 1 set to 1

Good point.
My thinking was trying to mimic what kube-state-metrics does for some metrics like kube_pod_status_phase e.g.

kube_pod_status_phase{...,phase="Failed",...} | 0
kube_pod_status_phase{...,phase="Pending",...} | 0
kube_pod_status_phase{...,phase="Running",...} | 1
kube_pod_status_phase{...,phase="Succeeded",...} | 0
kube_pod_status_phase{...,phase="Unknown",...} | 0

I'll update the metric upstream in integr8ly/integreatly-operator#857 to omit the 0's.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, if 0 values you don't care about, you could do a recording rule that checks anything larger than 0 or == 1, and just send the recording rule via telemetry, that will indeed save you few series per cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming I've updated the metric being exposed so it doesn't return 0 values (integr8ly/integreatly-operator#857)

@sdodson
Copy link
Member

sdodson commented Jun 8, 2020

maybe we can use the install component for now, but @sdodson might have better ideas or want to push for a new component(or sub-component) to be added to cover this.

NACK on Installer but I really don't have a better suggestion. Ideally there'd be a component in the repo's OWNERS file which could be used to discover. If we only care about appeasing the bot for backport just stick it in Unknown but make sure it's clear that it doesn't need to be triaged to another component.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: david-martin, lilic

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2020
@lilic
Copy link
Contributor

lilic commented Jun 9, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit caa8496 into openshift:master Jun 9, 2020
@david-martin david-martin deleted the rhmi-metrics branch June 10, 2020 08:44
@david-martin david-martin changed the title Add RHMI metrics to telemeter Bug 1845873 - Add rhmi_status metric to telemeter Jun 10, 2020
@david-martin
Copy link
Member Author

/cherrypick release-4.5

@openshift-cherrypick-robot

@david-martin: #795 failed to apply on top of branch "release-4.5":

error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	Documentation/data-collection.md
M	Documentation/sample-metrics.md
M	Documentation/telemeter_query
M	manifests/0000_50_cluster_monitoring_operator_04-config.yaml
Falling back to patching base and 3-way merge...
Auto-merging manifests/0000_50_cluster_monitoring_operator_04-config.yaml
CONFLICT (content): Merge conflict in manifests/0000_50_cluster_monitoring_operator_04-config.yaml
Auto-merging Documentation/telemeter_query
CONFLICT (content): Merge conflict in Documentation/telemeter_query
Auto-merging Documentation/sample-metrics.md
CONFLICT (content): Merge conflict in Documentation/sample-metrics.md
Auto-merging Documentation/data-collection.md
CONFLICT (content): Merge conflict in Documentation/data-collection.md
Patch failed at 0001 Add RHMI metrics to telemeter

In response to this:

/cherrypick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@david-martin david-martin changed the title Bug 1845873 - Add rhmi_status metric to telemeter Bug 1845873: Add rhmi_status metric to telemeter Jun 19, 2020
@openshift-ci-robot
Copy link
Contributor

@david-martin: Bugzilla bug 1845873 is in an unrecognized state (VERIFIED) and will not be moved to the MODIFIED state.

In response to this:

Bug 1845873: Add rhmi_status metric to telemeter

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet