Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metrics for the rhmi installation CR #784

Merged
merged 2 commits into from May 27, 2020

Conversation

jjaferson
Copy link
Contributor

@jjaferson jjaferson commented May 22, 2020

Description

Exposes two metrics for the RHMI installation CR

rhmi_status and rhmi_info

https://issues.redhat.com/browse/INTLY-6667

Verification steps

  1. Install operator via OLM
  2. Go to the rhmi-operator pod
  3. In the terminal tab curl the metric endpoint to verify if metrics are present
  • curl 0.0.0.0:8383/metrics | grep rhmi_status{
  • curl 0.0.0.0:8383/metrics | grep rhmi_info{
  1. Check if metrics have all the labels created and if their values match with installation CR
rhmi_info{installation_type="", master_url="", namespace="", operator_name="", use_cluster_storage=""}
rhmi_status{namespace="", last_error="", operator_name="", preflight_message="", preflight_status="", stage="", <product_name>-status="" ...}
  1. Go to redhat-rhmi-middleware-monitoring-operator namespace
  2. Access Prometheus and verify if rhmi_status and rhmi_info are available

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added a test case that will be used to verify my changes
  • Verified independently on a cluster by reviewer

@jjaferson jjaferson changed the title expose prometheus metrics for the rhmi installation CR Expose metrics for the rhmi installation CR May 22, 2020
@coveralls
Copy link

coveralls commented May 22, 2020

Coverage Status

Coverage decreased (-0.3%) to 59.407% when pulling ccbefb7 on jjaferson:intly-6667 into 6c6ec8d on integr8ly:master.

@jjaferson jjaferson force-pushed the intly-6667 branch 2 times, most recently from 4599bc5 to aaa2bc0 Compare May 22, 2020 18:04
@jjaferson
Copy link
Contributor Author

/test images

@jjaferson jjaferson force-pushed the intly-6667 branch 5 times, most recently from 77aaaf7 to 33e7b2f Compare May 25, 2020 13:30
Copy link
Member

@davidffrench davidffrench left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic @jjaferson . Reviewed code and verified on cluster. One requested change on the addition to the CSV. this should be changed in deploy/role.yaml.

rhmi_status - Looks Good

image

rhmi_status_available - Looks Good

image

rhmi_info - Question

image

One question on the rhmi_info metric. You can see from the screenshot there are two metrics being exposed for this. Do you know why this is?

@jjaferson
Copy link
Contributor Author

Looks fantastic @jjaferson . Reviewed code and verified on cluster. One requested change on the addition to the CSV. this should be changed in deploy/role.yaml.

rhmi_status - Looks Good

image

rhmi_status_available - Looks Good

image

rhmi_info - Question

image

One question on the rhmi_info metric. You can see from the screenshot there are two metrics being exposed for this. Do you know why this is?

I noticed that but I think it's something that happens by default as I saw the same metric for other CRDs like enmasse_addressplan_info and blackboxtarget_info with the same labels

enmasse_addressplan_info
Screenshot from 2020-05-25 14-59-16

blackboxtarget_info
Screenshot from 2020-05-25 15-00-16

@david-martin
Copy link
Member

I noticed that but I think it's something that happens by default as I saw the same metric for other CRDs like enmasse_addressplan_info and blackboxtarget_info with the same labels

It looks like the metric is being scraped from 2 different places/endpoints. cr-metrics and http-metrics.
Is there a duplicate metric being exposed?

@jjaferson
Copy link
Contributor Author

I noticed that but I think it's something that happens by default as I saw the same metric for other CRDs like enmasse_addressplan_info and blackboxtarget_info with the same labels

It looks like the metric is being scraped from 2 different places/endpoints. cr-metrics and http-metrics.
Is there a duplicate metric being exposed?

I was doing some investigation and I found out that there is a gauge metric being exposed on port 8686 of the operator

curl 0.0.0.0:8686/metrics
# HELP rhmi_info Information about the RHMI custom resource.
# TYPE rhmi_info gauge
rhmi_info{namespace="redhat-rhmi-operator",rhmi="rhmi"} 1

https://github.com/integr8ly/integreatly-operator/blob/master/cmd/manager/main.go#L148

Not sure where the metric is created

@jjaferson
Copy link
Contributor Author

I noticed that but I think it's something that happens by default as I saw the same metric for other CRDs like enmasse_addressplan_info and blackboxtarget_info with the same labels

It looks like the metric is being scraped from 2 different places/endpoints. cr-metrics and http-metrics.
Is there a duplicate metric being exposed?

I was doing some investigation and I found out that there is a gauge metric being exposed on port 8686 of the operator

curl 0.0.0.0:8686/metrics
# HELP rhmi_info Information about the RHMI custom resource.
# TYPE rhmi_info gauge
rhmi_info{namespace="redhat-rhmi-operator",rhmi="rhmi"} 1

https://github.com/integr8ly/integreatly-operator/blob/master/cmd/manager/main.go#L148

Not sure where the metric is created

Metric is created by default by the operator-sdk

Custom resource specific metrics

By default operator will expose info metrics based on the number of the current instances of an operator’s custom resources in the cluster. It leverages kube-state-metrics as a library to generate those metrics. Metrics initialization lives in the cmd/manager/main.go file of the operator in the serveCRMetrics function. Its arguments are a custom resource’s group, version, and kind to generate the metrics. The metrics are served on 0.0.0.0:8686/metrics by default.

https://sdk.operatorframework.io/docs/golang/monitoring/prometheus/

@david-martin
Copy link
Member

Not sure where the metric is created

Looks like a CR metric exposed by the operator sdk for any/all CRD instances that exist.

I think we need to change the name of the metric that we're in control of (not the sdk generated one)

@jjaferson
Copy link
Contributor Author

Not sure where the metric is created

Looks like a CR metric exposed by the operator sdk for any/all CRD instances that exist.

I think we need to change the name of the metric that we're in control of (not the sdk generated one)

Yup, what about rhmi_installation_info ?

@jjaferson jjaferson force-pushed the intly-6667 branch 2 times, most recently from 03eebaf to 85c5830 Compare May 25, 2020 17:41
@jjaferson
Copy link
Contributor Author

/test e2e

failed to create e2e pod

"failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Permissions Check": validate AWS credentials: mint credentials check: error simulating policy: Throttling: Rate exceeded\n\tstatus code: 400, request id: 557a3694-d525-4c45-a76b-e95ce8ed683e"

@jjaferson
Copy link
Contributor Author

/retest

1 similar comment
@jjaferson
Copy link
Contributor Author

/retest

@davidffrench
Copy link
Member

/lgtm
/approve

@jjaferson Great work. Please make sure to update the JIRA with the name change from rhmi_info to rhmi_spec

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidffrench

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jjaferson
Copy link
Contributor Author

/test test-cases-lint

@davidffrench davidffrench dismissed their stale review May 27, 2020 10:08

Approved changes now, dismissing requested changes

@openshift-merge-robot openshift-merge-robot merged commit 8f437be into integr8ly:master May 27, 2020
@david-martin
Copy link
Member

david-martin commented May 28, 2020

Unable to tag @openshift/openshift-team-monitoring,
so leaving a comment that these are the code changes that expose rhmi_status metric.
See openshift/cluster-monitoring-operator#795 for cluster-monitoring-operator PR to whitelist this metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants