Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable prometheus metrics for controllers #609

Conversation

@iamemilio
Copy link

/test e2e-openstack

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2020
@Danil-Grigorev
Copy link
Contributor Author

/hold cancel
/retest

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 18, 2020
@enxebre
Copy link
Member

enxebre commented Jun 18, 2020

@Danil-Grigorev can you please coordinate with openstack/metal3/ovirt providers so they are aware of this?
You might want to split this PR to expose only in-tree controllers and have a separate one for machine controllers.

@Danil-Grigorev
Copy link
Contributor Author

/retest

@Danil-Grigorev
Copy link
Contributor Author

/test e2e-baremetal

@openshift-ci-robot
Copy link
Contributor

@Danil-Grigorev: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test e2e-aws
  • /test e2e-aws-operator
  • /test e2e-aws-operator-tech-preview
  • /test e2e-aws-scaleup-rhel7
  • /test e2e-aws-upgrade
  • /test e2e-azure
  • /test e2e-azure-operator
  • /test e2e-gcp
  • /test e2e-gcp-operator
  • /test e2e-openstack
  • /test goimports
  • /test golint
  • /test govet
  • /test images
  • /test unit
  • /test yaml-lint

Use /test all to run the following jobs:

  • pull-ci-openshift-machine-api-operator-master-e2e-aws
  • pull-ci-openshift-machine-api-operator-master-e2e-aws-operator
  • pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7
  • pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade
  • pull-ci-openshift-machine-api-operator-master-e2e-azure
  • pull-ci-openshift-machine-api-operator-master-e2e-azure-operator
  • pull-ci-openshift-machine-api-operator-master-e2e-gcp
  • pull-ci-openshift-machine-api-operator-master-e2e-gcp-operator
  • pull-ci-openshift-machine-api-operator-master-goimports
  • pull-ci-openshift-machine-api-operator-master-golint
  • pull-ci-openshift-machine-api-operator-master-govet
  • pull-ci-openshift-machine-api-operator-master-images
  • pull-ci-openshift-machine-api-operator-master-unit
  • pull-ci-openshift-machine-api-operator-master-yaml-lint

In response to this:

/test e2e-baremetal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Danil-Grigorev added a commit to Danil-Grigorev/cluster-api-provider-gcp that referenced this pull request Jul 2, 2020
- Include PR openshift/machine-api-operator#609 introducing metrics support
Danil-Grigorev added a commit to Danil-Grigorev/cluster-api-provider-aws that referenced this pull request Jul 7, 2020
- Include PR openshift/machine-api-operator#609 introducing metrics support
Danil-Grigorev added a commit to Danil-Grigorev/cluster-api-provider-gcp that referenced this pull request Jul 7, 2020
- Include PR openshift/machine-api-operator#609 introducing metrics support
openshift-merge-robot pushed a commit to openshift/cluster-api-provider-azure that referenced this pull request Jul 9, 2020
@Danil-Grigorev
Copy link
Contributor Author

#640 depends on this PR.

@enxebre
Copy link
Member

enxebre commented Jul 14, 2020

how do we know this work?

@JoelSpeed
Copy link
Contributor

@Danil-Grigorev Looks like at this point we are just waiting on the baremetal PR to merge for this to be ready? That PR needs a rebase presently, can you work to try and get that merged please

Also, have you taken this for a test run on a real cluster? Would be good to get some proof in the description of this PR that we are seeing metrics coming through (screenshot from prometheus?)

@Danil-Grigorev
Copy link
Contributor Author

@Danil-Grigorev Looks like at this point we are just waiting on the baremetal PR to merge for this to be ready? That PR needs a rebase presently, can you work to try and get that merged please

Also, have you taken this for a test run on a real cluster? Would be good to get some proof in the description of this PR that we are seeing metrics coming through (screenshot from prometheus?)

I'll put the screenshots soon. The main proof that the code is functional, is that the CI tests expect all created serviceMonitor endpoints to be healthy. If the port isn't exposed, it would return 503 on every metrics collection request, thus failing the tests.

@JoelSpeed
Copy link
Contributor

I'll put the screenshots soon. The main proof that the code is functional, is that the CI tests expect all created serviceMonitor endpoints to be healthy. If the port isn't exposed, it would return 503 on every metrics collection request, thus failing the tests.

That doesn't necessarily mean that we are exposing metrics there, we could just be displaying an nginx welcome page for all the E2E tests know 😉 It's good to be in a habit of manual testing with features like this, while our test suites are reasonably thorough, there's a lot of stuff they can't test. Also, while they're quite flaky at the moment, it helps if you've manually tested to know if it's a CI issue or it is your issue

@Danil-Grigorev Danil-Grigorev force-pushed the add-service-monitor-controller-metrics branch from a63dd19 to 1515a41 Compare July 17, 2020 12:57
@Danil-Grigorev
Copy link
Contributor Author

GCP metric triggering on wrong machine configuration resulted in error within API call:
image

@Danil-Grigorev
Copy link
Contributor Author

Danil-Grigorev commented Jul 17, 2020

I'll put the screenshots soon. The main proof that the code is functional, is that the CI tests expect all created serviceMonitor endpoints to be healthy. If the port isn't exposed, it would return 503 on every metrics collection request, thus failing the tests.

That doesn't necessarily mean that we are exposing metrics there, we could just be displaying an nginx welcome page for all the E2E tests know wink It's good to be in a habit of manual testing with features like this, while our test suites are reasonably thorough, there's a lot of stuff they can't test. Also, while they're quite flaky at the moment, it helps if you've manually tested to know if it's a CI issue or it is your issue

It is true, but you could manually verify it at any moment. Here is an example of AWS metrics before merging openshift/cluster-api-provider-aws#324. This is also displayed in CI with tests "Alerts shouldn't report any alerts in firing state", which is not occurring now.
image

AWS sample metric output:
image

Sample metric firing, captured by the ServiceMonitor resource:
image

@Danil-Grigorev
Copy link
Contributor Author

@enxebre @JoelSpeed I extended the description with a demo for introduced merics running in a GCP cluster.

Danil-Grigorev added a commit to Danil-Grigorev/cluster-api-provider-baremetal that referenced this pull request Jul 20, 2020
This change ensure compatibility with serviceMonitor resource
introduced in openshift/machine-api-operator#609
@Danil-Grigorev
Copy link
Contributor Author

/retest

@JoelSpeed
Copy link
Contributor

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2020
@Danil-Grigorev
Copy link
Contributor Author

/retest

@Danil-Grigorev
Copy link
Contributor Author

/test e2e-openstack

@enxebre
Copy link
Member

enxebre commented Jul 22, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jul 22, 2020

@Danil-Grigorev: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-metal-ipi 1515a41 link /test e2e-metal-ipi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit e10a02c into openshift:master Jul 22, 2020
Gal-Zaidman pushed a commit to Gal-Zaidman/cluster-api-provider-ovirt that referenced this pull request Jul 26, 2020
This change ensures compatibility with the ServiceMonitor resource
introduced in [1]

[1] openshift/machine-api-operator#609

Signed-off-by: Gal-Zaidman <gzaidman@redhat.com>
Prashanth684 added a commit to Prashanth684/cluster-api-provider-libvirt that referenced this pull request Mar 11, 2021
this is needed to pull in the change to metrics that happened in MAO: openshift/machine-api-operator#609
Prashanth684 added a commit to Prashanth684/cluster-api-provider-libvirt that referenced this pull request Mar 11, 2021
This is to comply with openshift/machine-api-operator#609. Without this change alerts are firing as nobody is listening on port 8081 and the
metrics can't be exported.
Prashanth684 added a commit to Prashanth684/cluster-api-provider-libvirt that referenced this pull request Mar 12, 2021
this is needed to pull in the change to metrics that happened in MAO: openshift/machine-api-operator#609
Prashanth684 added a commit to Prashanth684/cluster-api-provider-libvirt that referenced this pull request Mar 12, 2021
This is to comply with openshift/machine-api-operator#609. Without this change alerts are firing as nobody is listening on port 8081 and the
metrics can't be exported.
Prashanth684 added a commit to Prashanth684/cluster-api-provider-libvirt that referenced this pull request Mar 15, 2021
this is needed to pull in the change to metrics that happened in MAO: openshift/machine-api-operator#609
Prashanth684 added a commit to Prashanth684/cluster-api-provider-libvirt that referenced this pull request Mar 15, 2021
This is to comply with openshift/machine-api-operator#609. Without this change alerts are firing as nobody is listening on port 8081 and the
metrics can't be exported.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants