Bug 1820083: configure metal3 metrics collection #671

dhellmann · 2020-08-06T19:01:23Z

Refactor some of the proxy container setup so it can be reused.
Add a metal3-metrics Service
Add a metal3 ServiceMonitor

openshift-ci-robot · 2020-08-06T19:01:30Z

@dhellmann: This pull request references Bugzilla bug 1820083, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

WIP: Bug 1820083: configure metal3 metrics collection

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dhellmann · 2020-08-06T19:02:13Z

/cc @JoelSpeed

I don't think this is working right, but I'm putting it up for review so I can get some help understanding why.

dhellmann · 2020-08-06T19:03:51Z

/cc @sadasu @honza

JoelSpeed · 2020-08-07T10:22:46Z

pkg/operator/baremetal_pod.go

+  selector:
+    matchLabels:
+      k8s-app: controller


Is this not going to select the Machine API core components as well? Is there a specific label that is applied to only the metal3 components?

There is not, right now. I'm exploring whether I can change that as part of this PR in some local changes I haven't published yet. I tried with an earlier version and got an error and undid the changes so I could make progress in other areas.

It turns out that field is immutable, which means we have to delete and recreate the deployment to change it (at least as far as I can tell). It doesn't seem to hurt anything to have it this way for now, and we can do something about renaming when we move all of the metal3 logic into the CBO if it turns into a problem.

My concern is that it is going to try and scrape all of the Machine API controllers, not just the metal3 ones. When you tried this out, did you notice in prometheus if it was trying to scrape more than just the one pod, it might have left some errors in the prometheus targets page showing that it couldn't scrape the machine-api-controllers pod?

This new stuff will only be set up when the platform is "baremetal" though, right? So it will look at the cluster-api-provider-baremetal and metal3. Maybe there are other pods?

If we have to change the label, then I think we're going to have to defer fixing the bug to 4.7 or later, because we're trying to focus on building the new operator to manage metal3 and can't afford the time to build the logic to recreate the deployment with the new setting in the MAO.

Which field you mean immutable exactly @dhellmann? This selector is for the new service you are defining in this PR https://github.com/openshift/machine-api-operator/pull/671/files#diff-03a356cadfb070910ca0faa53a4ab4bfR313thttps://github.com/openshift/machine-api-operator/pull/671/files#diff-03a356cadfb070910ca0faa53a4ab4bfR313 so you can choose anything.

It is not possible to change the k8s-app label value in the Deployment.Spec.Selector.MatchLabels field. Attempting to change that with a patch like https://paste.centos.org/view/b0f842e8 causes the MAO to log an error every time it tries to update the Deployment:

W0916 15:42:54.324744 1050505 recorder_logging.go:46] &Event{ObjectMeta:{dummy.16355b11bbdfeaac dummy 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []}, InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,}, Reason:DeploymentUpdateFailed, Message:Failed to update Deployment.apps/metal3 -n openshift-machine-api: Deployment.apps "metal3" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"api":"clusterapi", "k8s-app":"metal3"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)} field is immutable, Source:EventSource{Component:,Host:,}, FirstTimestamp:2020-09-16 15:42:54.324665004 -0400 EDT m=+105.798620314, LastTimestamp:2020-09-16 15:42:54.324665004 -0400 EDT m=+105.798620314, Count:1,Type:Warning, EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}

I suspect that means adding a different label is going to result in the same error, because it would mean changing the same field in the Deployment.

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#label-selector-updates

"In API version apps/v1, a Deployment's label selector is immutable after it gets created."

So yeah. This feels pretty wrong though, as requests will go to a bunch of things we aren't intending.

They don't, though. The label selector is the same, but the port is not, and both are used to filter. Only the metal3 deployment matches both.

JoelSpeed · 2020-08-07T10:30:13Z

pkg/operator/baremetal_pod.go

+			Selector: map[string]string{
+				"k8s-app": "controller",
+			},


Is this not going to select the Machine API core components as well? Is there a specific label that is applied to only the metal3 components?

Same concern as above, this is going to try and expose the port on the machine-api-onctrollers pod as well isn't it? Or will this only come into effect when they're running in their own namespace?

If you check the endpoints for this service, which pods has it targeted?

I see it includes the machine-api-controllers pod as well as the metal3 pod. Those both run in the openshift-machine-api namespace.

In that case, we need to verify that prometheus and in particular alerts are happy. IIRC there's an alert that first if any servicemonitor target is down, I would expect machine-api-controllers to show as down in this case since it won't have the correct port open?

This can may be match a specific label that should be added/replaced in the baremetal deployment so only those endpoints are covered by the service.

JoelSpeed · 2020-08-07T10:33:46Z

@dhellmann Can you elaborate on why you don't think it's working, from the code there's nothing obvious that screams out that it shouldn't be working.

dhellmann · 2020-08-07T17:32:58Z

@dhellmann Can you elaborate on why you don't think it's working, from the code there's nothing obvious that screams out that it shouldn't be working.

Well, when I replace the MAO in my cluster with this code prometheus doesn't report any of the metrics I expect it to report. I'm currently rebuilding the cluster to try that again.

JoelSpeed · 2020-08-10T10:11:56Z

Well, when I replace the MAO in my cluster with this code prometheus doesn't report any of the metrics I expect it to report. I'm currently rebuilding the cluster to try that again.

Haha ok, I guess we just have to start debugging from the ground up on this one to find where the fault is. I'm trying to bring up a cluster-bot cluster with these changes, if that works I'll play around with that and see if I can advise. Otherwise I don't have access to a metal cluster to try this on.

dhellmann · 2020-09-09T13:37:24Z

I've rebased the PR and re-tested locally and it seems to be working fine.

dhellmann · 2020-09-09T18:25:17Z

/retest

JoelSpeed

Still concerned about this conflicting with the other pods in the namespace, any idea if there were issues when testing this?

JoelSpeed · 2020-09-10T10:37:10Z

pkg/metrics/metrics.go

@@ -14,6 +14,7 @@ const (
 	DefaultHealthCheckMetricsAddress = ":8083"
 	DefaultMachineSetMetricsAddress  = ":8082"
 	DefaultMachineMetricsAddress     = ":8081"
+	DefaultMetal3MetricsAddress      = ":60000"


Would be good to know the motivation behind choosing this number, any reason not to try 8085? (To match the exposed port)

Part of the motivation was not changing the port where it was already listening. The pod uses host networking, so it needs to be on a port where it won't conflict with anything else. That port is already being used https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/baremetal_pod.go#L407 so this is just adding a constant to share the value in the multiple places that now need it.

JoelSpeed · 2020-09-10T10:43:37Z

pkg/operator/baremetal_pod.go

+  selector:
+    matchLabels:
+      k8s-app: controller


My concern is that it is going to try and scrape all of the Machine API controllers, not just the metal3 ones. When you tried this out, did you notice in prometheus if it was trying to scrape more than just the one pod, it might have left some errors in the prometheus targets page showing that it couldn't scrape the machine-api-controllers pod?

JoelSpeed · 2020-09-10T10:45:00Z

pkg/operator/baremetal_pod.go

+			Selector: map[string]string{
+				"k8s-app": "controller",
+			},


Same concern as above, this is going to try and expose the port on the machine-api-onctrollers pod as well isn't it? Or will this only come into effect when they're running in their own namespace?

If you check the endpoints for this service, which pods has it targeted?

sadasu

If we changed the label on the metal3 deployment object https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/baremetal_pod.go#L280 to something that doesn't clash with other deployments in MAO and update the metal3ServiceMonitorDefinition to look for this label, would that address concerns regarding scrapping the machine-api-controllers pod?

dhellmann · 2020-09-22T20:07:21Z

If we changed the label on the metal3 deployment object https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/baremetal_pod.go#L280 to something that doesn't clash with other deployments in MAO and update the metal3ServiceMonitorDefinition to look for this label, would that address concerns regarding scrapping the machine-api-controllers pod?

It would, but that field is immutable. So we would need enough logic in the MAO to delete the Deployment and recreate it when the label values are wrong. See #671 (comment)

dhellmann · 2020-09-28T12:10:36Z

pkg/operator/baremetal_pod.go

+    matchLabels:
+      k8s-app: controller
+  endpoints:
+  - port: metrics


metal3-mtrc

dhellmann · 2020-09-28T12:10:46Z

pkg/operator/baremetal_pod.go

+			Type: corev1.ServiceTypeNodePort,
+			Ports: []corev1.ServicePort{
+				{
+					Name: "metrics",


metal3-mtrc

dhellmann · 2020-09-28T12:11:59Z

pkg/operator/baremetal_pod.go

+					Port: metal3ExposeMetricsPort,
+					TargetPort: intstr.IntOrString{
+						Type:   intstr.String,
+						StrVal: "metrics",


metal3-mtrc

dhellmann · 2020-09-28T12:12:31Z

pkg/operator/baremetal_pod.go

+
+func newMetal3Containers(config *OperatorConfig, baremetalProvisioningConfig BaremetalProvisioningConfig) []corev1.Container {
+	containers := []corev1.Container{
+		newKubeProxyContainer(config.Controllers.KubeRBACProxy, "metrics",


metal3-mtrc

openshift-bot · 2020-10-30T08:54:33Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-30T09:21:00Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-30T09:46:34Z

/retest

Please review the full test history for this PR and help us cut down flakes.

JoelSpeed · 2020-10-30T10:25:01Z

/hold

@dhellmann This is going to need a rebase to remove glog in favour of klog as a PR just merged to upgrade from one to the other, the import for klog is k8s.io/klog/v2 if you need it, should hopefully just be a case of updating it in these locations looking at the test failures:

# github.com/openshift/machine-api-operator/pkg/operator [github.com/openshift/machine-api-operator/pkg/operator.test]
pkg/operator/sync.go:329:3: undefined: glog
pkg/operator/sync.go:339:3: undefined: glog
# github.com/openshift/machine-api-operator/pkg/operator
pkg/operator/sync.go:329:3: undefined: glog
pkg/operator/sync.go:339:3: undefined: glog

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

include the necessary volumes and drop the old ContainerPort setting that exposed the metrics directly Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

Most of the Service and ServiceMonitor resources used for prometheus data are created by manifests but we only want to create the ones for metal3 if we are deploying metal3, so that is done in code. The operator therefore needs permission to work with the new types. Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

dhellmann · 2020-10-30T18:58:15Z

Rebased and glog removed. I guessed at the verbosity level based on the other calls in the same file, but let me know if I guessed wrong.

elmiko

i'm not 100% sure if V(3) is the most appropriate log level, but it is consistent with other usages.
/lgtm

dhellmann · 2020-11-02T21:10:37Z

/retest

openshift-merge-robot · 2020-11-02T23:08:35Z

@dhellmann: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-azure	`922fb93`	link	`/test e2e-azure`
ci/prow/e2e-azure-operator	`922fb93`	link	`/test e2e-azure-operator`
ci/prow/e2e-gcp	`922fb93`	link	`/test e2e-gcp`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

dhellmann · 2020-11-03T16:06:07Z

/retest

@JoelSpeed the required CI jobs are passing and the Azure failures look like quota or other platform issues. I'm not sure what's up with GCP, but given that the job built a cluster and then a bunch of tests failed I don't expect that it's related to this PR. I'm retesting, but I think it's probably safe to remove your hold?

dhellmann · 2020-11-03T16:30:22Z

I chatted with @elmiko about the hold on this, and he said it should be OK to remove it. I'm happy to rework anything in a follow-up, if that's necessary.

/hold cancel

elmiko · 2020-11-03T16:30:26Z

as the code has been updated with regards to glog, i am removing the hold after talking with Doug.
/hold cancel

openshift-ci-robot · 2020-11-03T16:48:00Z

@dhellmann: All pull requests linked via external trackers have merged:

openshift/machine-api-operator#671

Bugzilla bug 1820083 has been moved to the MODIFIED state.

In response to this:

Bug 1820083: configure metal3 metrics collection

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zaneb · 2020-11-16T21:21:59Z

pkg/operator/baremetal_pod.go

+					Port: metal3ExposeMetricsPort,
+					TargetPort: intstr.IntOrString{
+						Type:   intstr.String,
+						StrVal: "metal3-mtrc",


This doesn't appear to be a real IANA-registered service type. How does this work?

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. labels Aug 6, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 6, 2020

openshift-ci-robot requested review from alexander-demicev and elmiko August 6, 2020 19:01

openshift-ci-robot requested a review from JoelSpeed August 6, 2020 19:02

openshift-ci-robot requested review from honza and sadasu August 6, 2020 19:03

dhellmann force-pushed the bz1820083-bmo-prometheus branch from f2b7c0e to 8a90498 Compare August 6, 2020 19:05

JoelSpeed reviewed Aug 7, 2020

View reviewed changes

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2020

dhellmann force-pushed the bz1820083-bmo-prometheus branch from 8a90498 to d851aa7 Compare September 8, 2020 21:25

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2020

dhellmann force-pushed the bz1820083-bmo-prometheus branch from d851aa7 to cc76b64 Compare September 9, 2020 13:36

dhellmann changed the title ~~WIP: Bug 1820083: configure metal3 metrics collection~~ Bug 1820083: configure metal3 metrics collection Sep 9, 2020

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 9, 2020

dhellmann requested a review from JoelSpeed September 9, 2020 18:25

JoelSpeed reviewed Sep 10, 2020

View reviewed changes

sadasu reviewed Sep 22, 2020

View reviewed changes

dhellmann force-pushed the bz1820083-bmo-prometheus branch from cc76b64 to a7f7b58 Compare September 28, 2020 12:17

dhellmann commented Sep 28, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2020

dhellmann added 10 commits October 30, 2020 14:52

make metal3 container list construction more concise

2f6dca2

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

allocate a port number for metal3 pod metrics

9edf2f2

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

have the baremetal-operator metrics run on the allocated port

984a6cc

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

move proxy config volume settings to a reusable function

8364791

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

add a proxy container for metal3 metrics

a0fa565

include the necessary volumes and drop the old ContainerPort setting that exposed the metrics directly Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

use the const for the secret volume source

e5a8bc7

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

add a service for metal3 metrics

f66594b

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

add service monitor for metal3 metrics

a4d0b8b

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

use port name "metal3-mtrc" for consistency with other metrics endpoints

9002655

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>

dhellmann force-pushed the bz1820083-bmo-prometheus branch from d04fc03 to 922fb93 Compare October 30, 2020 18:57

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 30, 2020

elmiko approved these changes Oct 30, 2020

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 30, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2020

openshift-merge-robot merged commit 6bc54bc into openshift:master Nov 3, 2020

zaneb reviewed Nov 16, 2020

View reviewed changes

dhellmann mentioned this pull request Nov 19, 2020

Bug 1906102: Add standard metrics support for CBO openshift/cluster-baremetal-operator#26

Merged

Bug 1820083: configure metal3 metrics collection #671

Bug 1820083: configure metal3 metrics collection #671

Uh oh!

Conversation

dhellmann commented Aug 6, 2020

Uh oh!

openshift-ci-robot commented Aug 6, 2020

Uh oh!

dhellmann commented Aug 6, 2020

Uh oh!

dhellmann commented Aug 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enxebre Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoelSpeed commented Aug 7, 2020

Uh oh!

dhellmann commented Aug 7, 2020

Uh oh!

JoelSpeed commented Aug 10, 2020

Uh oh!

dhellmann commented Sep 9, 2020

Uh oh!

dhellmann commented Sep 9, 2020

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadasu left a comment

Choose a reason for hiding this comment

Uh oh!

dhellmann commented Sep 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-bot commented Oct 30, 2020

Uh oh!

openshift-bot commented Oct 30, 2020

Uh oh!

enxebre Sep 16, 2020 •

edited

Loading

openshift-merge-robot commented Nov 2, 2020 •

edited

Loading

zaneb Nov 16, 2020 •

edited

Loading