Enable prometheus metrics for controllers #609

Danil-Grigorev · 2020-06-04T15:49:31Z

Enables functionality introduced in #590

Depends on

Demo:

iamemilio · 2020-06-11T20:12:53Z

/test e2e-openstack

Danil-Grigorev · 2020-06-17T14:32:45Z

/hold Waiting on

Danil-Grigorev · 2020-06-18T06:19:21Z

/hold cancel
/retest

enxebre · 2020-06-18T09:04:36Z

@Danil-Grigorev can you please coordinate with openstack/metal3/ovirt providers so they are aware of this?
You might want to split this PR to expose only in-tree controllers and have a separate one for machine controllers.

Danil-Grigorev · 2020-06-19T10:52:08Z

/retest

Danil-Grigorev · 2020-06-19T10:53:40Z

/test e2e-baremetal

openshift-ci-robot · 2020-06-19T10:53:53Z

@Danil-Grigorev: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

/test e2e-aws
/test e2e-aws-operator
/test e2e-aws-operator-tech-preview
/test e2e-aws-scaleup-rhel7
/test e2e-aws-upgrade
/test e2e-azure
/test e2e-azure-operator
/test e2e-gcp
/test e2e-gcp-operator
/test e2e-openstack
/test goimports
/test golint
/test govet
/test images
/test unit
/test yaml-lint

Use /test all to run the following jobs:

pull-ci-openshift-machine-api-operator-master-e2e-aws
pull-ci-openshift-machine-api-operator-master-e2e-aws-operator
pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7
pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade
pull-ci-openshift-machine-api-operator-master-e2e-azure
pull-ci-openshift-machine-api-operator-master-e2e-azure-operator
pull-ci-openshift-machine-api-operator-master-e2e-gcp
pull-ci-openshift-machine-api-operator-master-e2e-gcp-operator
pull-ci-openshift-machine-api-operator-master-goimports
pull-ci-openshift-machine-api-operator-master-golint
pull-ci-openshift-machine-api-operator-master-govet
pull-ci-openshift-machine-api-operator-master-images
pull-ci-openshift-machine-api-operator-master-unit
pull-ci-openshift-machine-api-operator-master-yaml-lint

In response to this:

/test e2e-baremetal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

- Include PR openshift/machine-api-operator#609 introducing metrics support

Danil-Grigorev · 2020-07-14T14:02:50Z

#640 depends on this PR.

enxebre · 2020-07-14T14:20:29Z

how do we know this work?

JoelSpeed · 2020-07-17T10:38:16Z

@Danil-Grigorev Looks like at this point we are just waiting on the baremetal PR to merge for this to be ready? That PR needs a rebase presently, can you work to try and get that merged please

Also, have you taken this for a test run on a real cluster? Would be good to get some proof in the description of this PR that we are seeing metrics coming through (screenshot from prometheus?)

Danil-Grigorev · 2020-07-17T10:44:41Z

@Danil-Grigorev Looks like at this point we are just waiting on the baremetal PR to merge for this to be ready? That PR needs a rebase presently, can you work to try and get that merged please

Also, have you taken this for a test run on a real cluster? Would be good to get some proof in the description of this PR that we are seeing metrics coming through (screenshot from prometheus?)

I'll put the screenshots soon. The main proof that the code is functional, is that the CI tests expect all created serviceMonitor endpoints to be healthy. If the port isn't exposed, it would return 503 on every metrics collection request, thus failing the tests.

JoelSpeed · 2020-07-17T10:53:50Z

I'll put the screenshots soon. The main proof that the code is functional, is that the CI tests expect all created serviceMonitor endpoints to be healthy. If the port isn't exposed, it would return 503 on every metrics collection request, thus failing the tests.

That doesn't necessarily mean that we are exposing metrics there, we could just be displaying an nginx welcome page for all the E2E tests know 😉 It's good to be in a habit of manual testing with features like this, while our test suites are reasonably thorough, there's a lot of stuff they can't test. Also, while they're quite flaky at the moment, it helps if you've manually tested to know if it's a CI issue or it is your issue

Danil-Grigorev · 2020-07-17T13:00:40Z

GCP metric triggering on wrong machine configuration resulted in error within API call:

Danil-Grigorev · 2020-07-17T13:07:45Z

I'll put the screenshots soon. The main proof that the code is functional, is that the CI tests expect all created serviceMonitor endpoints to be healthy. If the port isn't exposed, it would return 503 on every metrics collection request, thus failing the tests.

That doesn't necessarily mean that we are exposing metrics there, we could just be displaying an nginx welcome page for all the E2E tests know wink It's good to be in a habit of manual testing with features like this, while our test suites are reasonably thorough, there's a lot of stuff they can't test. Also, while they're quite flaky at the moment, it helps if you've manually tested to know if it's a CI issue or it is your issue

It is true, but you could manually verify it at any moment. Here is an example of AWS metrics before merging openshift/cluster-api-provider-aws#324. This is also displayed in CI with tests "Alerts shouldn't report any alerts in firing state", which is not occurring now.

AWS sample metric output:

Sample metric firing, captured by the ServiceMonitor resource:

Danil-Grigorev · 2020-07-17T13:36:29Z

@enxebre @JoelSpeed I extended the description with a demo for introduced merics running in a GCP cluster.

This change ensure compatibility with serviceMonitor resource introduced in openshift/machine-api-operator#609

Danil-Grigorev · 2020-07-20T13:41:17Z

/retest

JoelSpeed · 2020-07-21T11:06:00Z

/approve

openshift-ci-robot · 2020-07-21T11:06:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Danil-Grigorev · 2020-07-22T07:55:22Z

/retest

Danil-Grigorev · 2020-07-22T08:52:35Z

/test e2e-openstack

enxebre · 2020-07-22T09:04:14Z

/lgtm

openshift-bot · 2020-07-22T09:09:53Z

/retest