Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NE-76: Integrate operator metrics with Prometheus #122

Conversation

Miciah
Copy link
Contributor

@Miciah Miciah commented Jul 1, 2019

The operator exposes metrics, but they are not collected. Configure Prometheus to collect these metrics.

  • manifests/0000_70_dns-operator_00-namespace.yaml: Add a label for openshift-monitoring.
  • manifests/0000_70_dns-operator_02-deployment.yaml: Add the metrics port.
  • manifests/0000_70_dns-operator_04-service.yaml: New file. Define a metrics service.
  • manifests/0000_90_dns-operator_00_prometheusrole.yaml: New file. Define a role that grants access to services, endpoints, and pods.
  • manifests/0000_90_dns-operator_01_prometheusrolebinding.yaml: New file. Bind the prometheus-k8s service account to the new role.
  • manifests/0000_90_dns-operator_02_servicemonitor.yaml: New file. Define a service monitor for the new service.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 1, 2019
@ironcladlou
Copy link
Contributor

@Miciah this got lost a long time ago — I'm inclined to tag it, any reason I shouldn't?

/retest

@ironcladlou
Copy link
Contributor

@Miciah if you still feel good about this one please remove the hold when you're ready.

/lgtm
/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2019
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2019
@ironcladlou
Copy link
Contributor

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 6, 2019
@ironcladlou
Copy link
Contributor

@Miciah bindata might need updated here

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@Miciah
Copy link
Contributor Author

Miciah commented Nov 7, 2019

/hold
I think there was an issue with unsecured endpoints.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2019
@Miciah Miciah force-pushed the NE-76-integrate-operator-metrics-with-Prometheus branch from 70065a9 to 4bc4cd8 Compare November 16, 2019 04:43
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 16, 2019
@Miciah
Copy link
Contributor Author

Miciah commented Nov 16, 2019

/hold cancel
The metrics endpoint is now secured using kube-rbac-proxy.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 16, 2019
@Miciah Miciah force-pushed the NE-76-integrate-operator-metrics-with-Prometheus branch from 4bc4cd8 to e552a26 Compare November 16, 2019 21:45
image: quay.io/openshift/origin-kube-rbac-proxy:latest
args:
- --logtostderr
- -v=8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this log stuff on every scrape interval? If so is that ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will. I'll delete it.

- --secure-listen-address=:9393
- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
- --upstream=http://127.0.0.1:60000/
- --tls-cert-file=/etc/tls/private/tls.crt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this image handle automatically reloading a rotated certificate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, kube-rbac-proxy watches for updates (brancz/kube-rbac-proxy@ef4527d), and I verified manually using the following command:

% oc -n openshift-dns-operator delete secrets/metrics-tls
secret "metrics-tls" deleted

Kube-rbac-proxy logs the following:

I1118 17:33:29.709147       1 reloader.go:98] reloading key /etc/tls/private/tls.key certificate /etc/tls/private/tls.crt

The operator exposes metrics, but they are not collected.  Configure
Prometheus to collect these metrics.

This commit resolves NE-76.

https://jira.coreos.com/browse/NE-76

* manifests/0000_70_dns-operator_00-cluster-role.yaml: Allow
kube-rbac-proxy to create subjectaccessreviews and tokenreviews.
* manifests/0000_70_dns-operator_00-namespace.yaml: Add a label for
openshift-monitoring.
* manifests/0000_70_dns-operator_01-service.yaml: New file.  Define a
metrics service.
* manifests/0000_70_dns-operator_02-deployment.yaml: Use kube-rbac-proxy to
expose the metrics port.
* manifests/0000_90_dns-operator_00_prometheusrole.yaml:  New file.
Define a role that grants access to services, endpoints, and pods.
* manifests/0000_90_dns-operator_01_prometheusrolebinding.yaml:  New
file.  Bind the prometheus-k8s service account to the new role.
* manifests/0000_90_dns-operator_02_servicemonitor.yaml: New file.  Define
a service monitor for the new service.
* manifests/image-references: Add kube-rbac-proxy.
@Miciah Miciah force-pushed the NE-76-integrate-operator-metrics-with-Prometheus branch from e552a26 to ae1b9e6 Compare November 18, 2019 17:39
@Miciah
Copy link
Contributor Author

Miciah commented Nov 18, 2019

Latest push drops -v=8 from the kube-rbac-proxy options.

@ironcladlou
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 19, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ironcladlou, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 6e9e960 into openshift:master Nov 19, 2019
@smarterclayton
Copy link
Contributor

This looks like it broke upgrades from 4.2 last week:

Dec 01 00:41:10.988 W clusteroperator/network changed Progressing to False
Dec 01 00:41:10.988 I clusteroperator/network versions: operator 4.2.9 -> 4.3.0-0.ci-2019-11-30-234318
Dec 01 00:41:11.880 I ns/openshift-dns-operator deployment/dns-operator Scaled up replica set dns-operator-5ff9db6dc5 to 1
Dec 01 00:41:11.896 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 node/ created
Dec 01 00:41:11.911 I ns/openshift-dns-operator replicaset/dns-operator-5ff9db6dc5 Created pod: dns-operator-5ff9db6dc5-57m95
Dec 01 00:41:11.921 W ns/openshift-marketplace pod/redhat-operators-6567d7b4c8-nr2nn network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network (13 times)
Dec 01 00:41:11.937 I ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 Successfully assigned openshift-dns-operator/dns-operator-5ff9db6dc5-57m95 to ip-10-0-136-246.ec2.internal
Dec 01 00:41:12.124 W ns/openshift-dns-operator pod/dns-operator-5ff9db6dc5-57m95 MountVolume.SetUp failed for volume "metrics-tls" : secret "metrics-tls" not found
Dec 01 00:41:15.922 I node/ip-10-0-135-144.ec2.internal Node ip-10-0-135-144.ec2.internal status is now: NodeReady (4 times)

@@ -36,6 +36,30 @@ spec:
resources:
requests:
cpu: 10m
- name: kube-rbac-proxy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super heavyweight vs just importing the library-go filters. Are you not based on library-go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general no one should be using rbac proxy for metrics if you are go code we own.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are based on controller-runtime. Not sure about what library-go offers here, but seems too late anyway. Is there a reason we shouldn't ship what we have?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general no one should be using rbac proxy for metrics if you are go code we own.

cc @Miciah

Clearly there's some communications breakdown on this subject. Do we need to revert this or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I don't know of any functional bugs with what we have

Copy link
Contributor Author

@Miciah Miciah Dec 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on discussions with the operator-sdk team, I understood kube-rbac-proxy to be the correct approach. I'm unfamiliar with what library-go provides in this area.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants