Skip to content

Bug 1820083: configure metal3 metrics collection #671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dhellmann
Copy link
Contributor

  • Refactor some of the proxy container setup so it can be reused.
  • Add a metal3-metrics Service
  • Add a metal3 ServiceMonitor

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. labels Aug 6, 2020
@openshift-ci-robot
Copy link
Contributor

@dhellmann: This pull request references Bugzilla bug 1820083, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

WIP: Bug 1820083: configure metal3 metrics collection

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 6, 2020
@dhellmann
Copy link
Contributor Author

/cc @JoelSpeed

I don't think this is working right, but I'm putting it up for review so I can get some help understanding why.

@dhellmann
Copy link
Contributor Author

/cc @sadasu @honza

@dhellmann dhellmann force-pushed the bz1820083-bmo-prometheus branch from f2b7c0e to 8a90498 Compare August 6, 2020 19:05
Comment on lines +58 to +96
selector:
matchLabels:
k8s-app: controller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not going to select the Machine API core components as well? Is there a specific label that is applied to only the metal3 components?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not, right now. I'm exploring whether I can change that as part of this PR in some local changes I haven't published yet. I tried with an earlier version and got an error and undid the changes so I could make progress in other areas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that field is immutable, which means we have to delete and recreate the deployment to change it (at least as far as I can tell). It doesn't seem to hurt anything to have it this way for now, and we can do something about renaming when we move all of the metal3 logic into the CBO if it turns into a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that it is going to try and scrape all of the Machine API controllers, not just the metal3 ones. When you tried this out, did you notice in prometheus if it was trying to scrape more than just the one pod, it might have left some errors in the prometheus targets page showing that it couldn't scrape the machine-api-controllers pod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new stuff will only be set up when the platform is "baremetal" though, right? So it will look at the cluster-api-provider-baremetal and metal3. Maybe there are other pods?

If we have to change the label, then I think we're going to have to defer fixing the bug to 4.7 or later, because we're trying to focus on building the new operator to manage metal3 and can't afford the time to build the logic to recreate the deployment with the new setting in the MAO.

Copy link
Member

@enxebre enxebre Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible to change the k8s-app label value in the Deployment.Spec.Selector.MatchLabels field. Attempting to change that with a patch like https://paste.centos.org/view/b0f842e8 causes the MAO to log an error every time it tries to update the Deployment:

W0916 15:42:54.324744 1050505 recorder_logging.go:46] &Event{ObjectMeta:{dummy.16355b11bbdfeaac  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},
InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},
Reason:DeploymentUpdateFailed,
Message:Failed to update Deployment.apps/metal3 -n openshift-machine-api: Deployment.apps "metal3" is invalid: spec.selector: Invalid value: 
v1.LabelSelector{MatchLabels:map[string]string{"api":"clusterapi", "k8s-app":"metal3"}, 
MatchExpressions:[]v1.LabelSelectorRequirement(nil)} field is immutable,
Source:EventSource{Component:,Host:,}, 
FirstTimestamp:2020-09-16 15:42:54.324665004 -0400 EDT m=+105.798620314,
LastTimestamp:2020-09-16 15:42:54.324665004 -0400 EDT m=+105.798620314,
Count:1,Type:Warning,
EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}

I suspect that means adding a different label is going to result in the same error, because it would mean changing the same field in the Deployment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#label-selector-updates

"In API version apps/v1, a Deployment's label selector is immutable after it gets created."

So yeah. This feels pretty wrong though, as requests will go to a bunch of things we aren't intending.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't, though. The label selector is the same, but the port is not, and both are used to filter. Only the metal3 deployment matches both.

Comment on lines +178 to +330
Selector: map[string]string{
"k8s-app": "controller",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not going to select the Machine API core components as well? Is there a specific label that is applied to only the metal3 components?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as above, this is going to try and expose the port on the machine-api-onctrollers pod as well isn't it? Or will this only come into effect when they're running in their own namespace?

If you check the endpoints for this service, which pods has it targeted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it includes the machine-api-controllers pod as well as the metal3 pod. Those both run in the openshift-machine-api namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we need to verify that prometheus and in particular alerts are happy. IIRC there's an alert that first if any servicemonitor target is down, I would expect machine-api-controllers to show as down in this case since it won't have the correct port open?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can may be match a specific label that should be added/replaced in the baremetal deployment so only those endpoints are covered by the service.

@JoelSpeed
Copy link
Contributor

@dhellmann Can you elaborate on why you don't think it's working, from the code there's nothing obvious that screams out that it shouldn't be working.

@dhellmann
Copy link
Contributor Author

@dhellmann Can you elaborate on why you don't think it's working, from the code there's nothing obvious that screams out that it shouldn't be working.

Well, when I replace the MAO in my cluster with this code prometheus doesn't report any of the metrics I expect it to report. I'm currently rebuilding the cluster to try that again.

@JoelSpeed
Copy link
Contributor

Well, when I replace the MAO in my cluster with this code prometheus doesn't report any of the metrics I expect it to report. I'm currently rebuilding the cluster to try that again.

Haha ok, I guess we just have to start debugging from the ground up on this one to find where the fault is. I'm trying to bring up a cluster-bot cluster with these changes, if that works I'll play around with that and see if I can advise. Otherwise I don't have access to a metal cluster to try this on.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2020
@dhellmann dhellmann force-pushed the bz1820083-bmo-prometheus branch from 8a90498 to d851aa7 Compare September 8, 2020 21:25
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2020
@dhellmann dhellmann force-pushed the bz1820083-bmo-prometheus branch from d851aa7 to cc76b64 Compare September 9, 2020 13:36
@dhellmann dhellmann changed the title WIP: Bug 1820083: configure metal3 metrics collection Bug 1820083: configure metal3 metrics collection Sep 9, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 9, 2020
@dhellmann
Copy link
Contributor Author

I've rebased the PR and re-tested locally and it seems to be working fine.

@dhellmann
Copy link
Contributor Author

/retest

@dhellmann dhellmann requested a review from JoelSpeed September 9, 2020 18:25
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still concerned about this conflicting with the other pods in the namespace, any idea if there were issues when testing this?

@@ -14,6 +14,7 @@ const (
DefaultHealthCheckMetricsAddress = ":8083"
DefaultMachineSetMetricsAddress = ":8082"
DefaultMachineMetricsAddress = ":8081"
DefaultMetal3MetricsAddress = ":60000"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to know the motivation behind choosing this number, any reason not to try 8085? (To match the exposed port)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the motivation was not changing the port where it was already listening. The pod uses host networking, so it needs to be on a port where it won't conflict with anything else. That port is already being used https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/baremetal_pod.go#L407 so this is just adding a constant to share the value in the multiple places that now need it.

Comment on lines +58 to +96
selector:
matchLabels:
k8s-app: controller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that it is going to try and scrape all of the Machine API controllers, not just the metal3 ones. When you tried this out, did you notice in prometheus if it was trying to scrape more than just the one pod, it might have left some errors in the prometheus targets page showing that it couldn't scrape the machine-api-controllers pod?

Comment on lines +178 to +330
Selector: map[string]string{
"k8s-app": "controller",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as above, this is going to try and expose the port on the machine-api-onctrollers pod as well isn't it? Or will this only come into effect when they're running in their own namespace?

If you check the endpoints for this service, which pods has it targeted?

Copy link
Contributor

@sadasu sadasu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we changed the label on the metal3 deployment object https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/baremetal_pod.go#L280 to something that doesn't clash with other deployments in MAO and update the metal3ServiceMonitorDefinition to look for this label, would that address concerns regarding scrapping the machine-api-controllers pod?

@dhellmann
Copy link
Contributor Author

If we changed the label on the metal3 deployment object https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/baremetal_pod.go#L280 to something that doesn't clash with other deployments in MAO and update the metal3ServiceMonitorDefinition to look for this label, would that address concerns regarding scrapping the machine-api-controllers pod?

It would, but that field is immutable. So we would need enough logic in the MAO to delete the Deployment and recreate it when the label values are wrong. See #671 (comment)

@dhellmann dhellmann force-pushed the bz1820083-bmo-prometheus branch from cc76b64 to a7f7b58 Compare September 28, 2020 12:17
matchLabels:
k8s-app: controller
endpoints:
- port: metrics
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metal3-mtrc

Type: corev1.ServiceTypeNodePort,
Ports: []corev1.ServicePort{
{
Name: "metrics",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metal3-mtrc

Port: metal3ExposeMetricsPort,
TargetPort: intstr.IntOrString{
Type: intstr.String,
StrVal: "metrics",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metal3-mtrc


func newMetal3Containers(config *OperatorConfig, baremetalProvisioningConfig BaremetalProvisioningConfig) []corev1.Container {
containers := []corev1.Container{
newKubeProxyContainer(config.Controllers.KubeRBACProxy, "metrics",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metal3-mtrc

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@JoelSpeed
Copy link
Contributor

/hold

@dhellmann This is going to need a rebase to remove glog in favour of klog as a PR just merged to upgrade from one to the other, the import for klog is k8s.io/klog/v2 if you need it, should hopefully just be a case of updating it in these locations looking at the test failures:

# github.com/openshift/machine-api-operator/pkg/operator [github.com/openshift/machine-api-operator/pkg/operator.test]
pkg/operator/sync.go:329:3: undefined: glog
pkg/operator/sync.go:339:3: undefined: glog
# github.com/openshift/machine-api-operator/pkg/operator
pkg/operator/sync.go:329:3: undefined: glog
pkg/operator/sync.go:339:3: undefined: glog

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 30, 2020
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
include the necessary volumes and drop the old ContainerPort setting
that exposed the metrics directly

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Most of the Service and ServiceMonitor resources used for prometheus
data are created by manifests but we only want to create the ones for
metal3 if we are deploying metal3, so that is done in code. The operator
therefore needs permission to work with the new types.

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
@dhellmann dhellmann force-pushed the bz1820083-bmo-prometheus branch from d04fc03 to 922fb93 Compare October 30, 2020 18:57
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 30, 2020
@dhellmann
Copy link
Contributor Author

Rebased and glog removed. I guessed at the verbosity level based on the other calls in the same file, but let me know if I guessed wrong.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not 100% sure if V(3) is the most appropriate log level, but it is consistent with other usages.
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 30, 2020
@dhellmann
Copy link
Contributor Author

/retest

@openshift-merge-robot
Copy link
Contributor

openshift-merge-robot commented Nov 2, 2020

@dhellmann: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure 922fb93 link /test e2e-azure
ci/prow/e2e-azure-operator 922fb93 link /test e2e-azure-operator
ci/prow/e2e-gcp 922fb93 link /test e2e-gcp

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dhellmann
Copy link
Contributor Author

/retest

@JoelSpeed the required CI jobs are passing and the Azure failures look like quota or other platform issues. I'm not sure what's up with GCP, but given that the job built a cluster and then a bunch of tests failed I don't expect that it's related to this PR. I'm retesting, but I think it's probably safe to remove your hold?

@dhellmann
Copy link
Contributor Author

I chatted with @elmiko about the hold on this, and he said it should be OK to remove it. I'm happy to rework anything in a follow-up, if that's necessary.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2020
@elmiko
Copy link
Contributor

elmiko commented Nov 3, 2020

as the code has been updated with regards to glog, i am removing the hold after talking with Doug.
/hold cancel

@openshift-merge-robot openshift-merge-robot merged commit 6bc54bc into openshift:master Nov 3, 2020
@openshift-ci-robot
Copy link
Contributor

@dhellmann: All pull requests linked via external trackers have merged:

Bugzilla bug 1820083 has been moved to the MODIFIED state.

In response to this:

Bug 1820083: configure metal3 metrics collection

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Port: metal3ExposeMetricsPort,
TargetPort: intstr.IntOrString{
Type: intstr.String,
StrVal: "metal3-mtrc",
Copy link
Member

@zaneb zaneb Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be a real IANA-registered service type. How does this work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants