New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election #395
OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election #395
Conversation
@wking: This pull request references Jira Issue OCPBUGS-25766, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (jhajyahy@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking, zaneb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a409b95
to
4e4519d
Compare
/lgtm |
|
4e4519d
to
3be7317
Compare
Oops, thanks. I've pushed: $ git --no-pager diff 4e4519d6..3be7317c
diff --git a/main.go b/main.go
index 4846efd1..7af5a72c 100644
--- a/main.go
+++ b/main.go
@@ -98,7 +98,9 @@ func main() {
}),
NewCache: cache.MultiNamespacedCacheBuilder(
[]string{controllers.ComponentNamespace, provisioning.OpenshiftConfigNamespace}),
- LeaderElection: enableLeaderElection,
+ LeaderElection: enableLeaderElection,
+ LeaderElectionID: "cluster-baremetal-operator",
+ LeaderElectionNamespace: os.Getenv("COMPONENT_NAMESPACE"),
})
if err != nil {
klog.ErrorS(err, "unable to start manager") And added a paragraph to the 3be7317 commit message talking myself through that change. |
…ader election The option has been available for years: $ git blame main.go | grep enable-leader-election dcbe86f (Sandhya Dasu 2020-08-18 21:09:29 -0400 72) flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, and without it overlapping operator pods can crash-loop [1]: : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers 0s { event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times} while fighting each other over the same ClusterOperator status: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort 2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up. I'm also setting a Recreate strategy [2], because: 1. Incoming pod surged by the default Deployment strategy. 2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it. 3. Outgoing pod releases the lease and exits. 4. Incoming pod tries again, and this time acquires the lease. can be slow in the 3-to-4 phase, while: 1. Outgoing pod releases the lease and exits. 2. Incoming pod created, scheduled, and acquires the lease. tends to be faster. And again, the component should be able to coast through small durations without a functioning leader. See openshift/machine-config-operator@7530ded86 (install: Recreate and delayed default ServiceAccount deletion, 2023-08-29, openshift/machine-config-operator#3895) for another example of how Recreate can help that way. To get the leader election working, I'm also setting LeaderElectionID and LeaderElectionNamespace, pattern-matching from [3]. Using cluster-baremetal-operator as the ID avoids collisions with baremetal-operator [4]. There is a COMPONENT_NAMESPACE is from: - name: COMPONENT_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace using the pattern documented in [5]. We've been asking for that environment variable since f368a4f (Add sa, deployment and co manifests, 2020-09-29, openshift#25), although grepping through history, I don't see anything consuming it. Perhaps it's somehow feeding controllers.ComponentNamespace, which is what I'm using. [1]: https://issues.redhat.com/browse/OCPBUGS-25766 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment [3]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L207-L208 [4]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L71 [5]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/
3be7317
to
0cfaf39
Compare
More pattern-matching consistency: $ git --no-pager diff 3be7317c..0cfaf393
diff --git a/main.go b/main.go
index 7af5a72c..e6bb56d5 100644
--- a/main.go
+++ b/main.go
@@ -100,7 +100,7 @@ func main() {
[]string{controllers.ComponentNamespace, provisioning.OpenshiftConfigNamespace}),
LeaderElection: enableLeaderElection,
LeaderElectionID: "cluster-baremetal-operator",
- LeaderElectionNamespace: os.Getenv("COMPONENT_NAMESPACE"),
+ LeaderElectionNamespace: controllers.ComponentNamespace,
})
if err != nil {
klog.ErrorS(err, "unable to start manager") |
Hrm, I need to extend RBAC: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6/1737877679555743744/artifacts/e2e-metal-ipi-ovn-ipv6/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-f9696655-p5dxv_cluster-baremetal-operator.log | tail -n1
E1221 19:31:10.136115 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" |
Hrm, this preference for sharding out over a number of Roles and RoleBindings will make this a bit awkward. I'm not sure if I should:
|
8f95164
to
f8e3ac9
Compare
Trying to understand what's going on with ServiceAccounts and RBAC in the $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods.json | jq -c '.items[] | select(.metadata.namespace == "openshift-machine-api") | {name: .metadata.name, serviceAccountName: .spec.serviceAccountName}'
{"name":"cluster-autoscaler-operator-6878b89c58-7kf6d","serviceAccountName":"cluster-autoscaler-operator"}
{"name":"cluster-baremetal-operator-5c57b874f5-s9zmq","serviceAccountName":"cluster-baremetal-operator"}
{"name":"control-plane-machine-set-operator-7d5dcfcb76-cshv4","serviceAccountName":"control-plane-machine-set-operator"}
{"name":"machine-api-controllers-54c98b4bb4-sgvhd","serviceAccountName":"machine-api-controllers"}
{"name":"machine-api-operator-7fb4f8c4fb-7gr2t","serviceAccountName":"machine-api-operator"} So each pod has its own ServiceAccount. Good. Checking on bound roles: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/rolebindings.json | jq -c '.items[] | select(.metadata.namespace == "openshift-machine-api") | {name: .metadata.name, subjects, roleRef, description: .metadata.annotations["openshift.io/description"]}'
{"name":"cluster-autoscaler","subjects":[{"kind":"ServiceAccount","name":"cluster-autoscaler","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"cluster-autoscaler"},"description":null}
{"name":"cluster-autoscaler-operator","subjects":[{"kind":"ServiceAccount","name":"cluster-autoscaler-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"cluster-autoscaler-operator"},"description":null}
{"name":"cluster-baremetal-operator","subjects":[{"kind":"ServiceAccount","name":"cluster-baremetal-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"cluster-baremetal-operator"},"description":null}
{"name":"control-plane-machine-set-operator","subjects":[{"kind":"ServiceAccount","name":"control-plane-machine-set-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"control-plane-machine-set-operator"},"description":null}
{"name":"machine-api-controllers","subjects":[{"kind":"ServiceAccount","name":"machine-api-controllers","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"machine-api-controllers"},"description":null}
{"name":"machine-api-operator","subjects":[{"kind":"ServiceAccount","name":"machine-api-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"machine-api-operator"},"description":null}
{"name":"prometheus-k8s-cluster-autoscaler-operator","subjects":[{"kind":"ServiceAccount","name":"prometheus-k8s","namespace":"openshift-monitoring"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"prometheus-k8s-cluster-autoscaler-operator"},"description":null}
{"name":"prometheus-k8s-cluster-baremetal-operator","subjects":[{"kind":"ServiceAccount","name":"prometheus-k8s","namespace":"openshift-monitoring"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"prometheus-k8s-cluster-baremetal-operator"},"description":null}
{"name":"prometheus-k8s-machine-api-operator","subjects":[{"kind":"ServiceAccount","name":"prometheus-k8s","namespace":"openshift-monitoring"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"prometheus-k8s-machine-api-operator"},"description":null}
{"name":"system:deployers","subjects":[{"kind":"ServiceAccount","name":"deployer","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"system:deployer"},"description":"Allows deploymentconfigs in this namespace to rollout pods in this namespace. It is auto-managed by a controller; remove subjects to disable."}
{"name":"system:image-builders","subjects":[{"kind":"ServiceAccount","name":"builder","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"system:image-builder"},"description":"Allows builds in this namespace to push images to this namespace. It is auto-managed by a controller; remove subjects to disable."}
{"name":"system:image-pullers","subjects":[{"apiGroup":"rbac.authorization.k8s.io","kind":"Group","name":"system:serviceaccounts:openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"system:image-puller"},"description":"Allows all pods in this namespace to pull images from this namespace. It is auto-managed by a controller; remove subjects to disable."}
And it seems a bit strange that $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/roles.json | jq -c '.items[] | select(.metadata.namespace == "openshift-machine-api") | ([.rules[] | select(.resources | tostring | contains("leases"))]) | select(length > 0)' | sort | uniq -c
1 [{"apiGroups":["coordination.k8s.io"],"resources":["leases"],"verbs":["get","list","create","update"]}]
3 [{"apiGroups":["coordination.k8s.io"],"resources":["leases"],"verbs":["get","list","watch","create","update","patch","delete"]}]
1 [{"apiGroups":[""],"resources":["configmaps","leases","secrets","services"],"verbs":["create","delete","get","list","patch","update","watch"]}] Aha, I'd missed |
…ement to Role Avoid: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6/1737877679555743744/artifacts/e2e-metal-ipi-ovn-ipv6/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-f9696655-p5dxv_cluster-baremetal-operator.log | tail -n1 E1221 19:31:10.136115 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" I'm also removing some old (and previously unused) leader-election kustomize manifests. It's hard for me to imagine someone needing that kind of composability out of this operator, and collapsing into a single role both simplifies maintenance and removes the need to use unweildy names like cluster-baremetal-operator-leader-election-role to stand out vs. the other leader-election roles that the openshift-machine-api namespace might hold. The patchesJson6902 Kustomize syntax is documented in [1]: Patches can be used to apply different customizations to Resources. Kustomize supports different patching mechanisms through patchesStrategicMerge and patchesJson6902. ... Not all Resources or fields support strategic merge patches. To support modifying arbitrary fields in arbitrary Resources, Kustomize offers applying JSON patch [2] through patchesJson6902. To find the correct Resource for a Json patch, the group, version, kind and name of that Resource need to be specified in kustomization.yaml. For example, increasing the replica number of a Deployment object can also be done through patchesJson6902. The /rules/- path means "append the value to the existing "rules" array, as described in [3,4]. [1]: https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/#customizing [2]: https://datatracker.ietf.org/doc/html/rfc6902 [3]: https://datatracker.ietf.org/doc/html/rfc6902#section-4.1 [4]: https://datatracker.ietf.org/doc/html/rfc6902#appendix-A.16
f8e3ac9
to
43f9adb
Compare
… SetupWithManager SetupWithManager(mgr) is called before mgr.Start() in main(), so it's running before the leader lease has been acquired. We don't want to be writing to anything before we acquire the lease, because that could cause contention with an actual lease-holder that is managing those resources. We can still perform read-only prechecks in SetupWithManager. And then we wait patiently until we get the lease, and update ClusterOperator (and anything else we manage) in Instead, patiently wait until we have the lease in the Reconcile() function. This partially rolls back 2e9d117 (Ensure baremetal CO is completely setup before Reconcile, 2020-11-30, openshift#81), but that addition predated ensureClusterOperator being added early in Reconcile in 4f2d314 (Make sure ensureClusterOperator() is called before its status is updated, 2020-12-15, openshift#71): $ git log --oneline | grep -n '2e9d1177\|4f2d3141' 468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated 506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile So the ensureClusterOperator call in SetupWithManager is no longer needed. And this partially rolls back 8798044 (Handle case when Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81). That "we're enabled, but there isn't a Provisioning custom resource yet" handling happens continually in Reconcile (where the write will be protected by the operator holding the lease). Among other improvements, this change will prevent a nominally-successful install where the operator never acquired a lease [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log $ head -n4 cbo.log I1222 01:05:34.274563 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1222 01:05:34.318283 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1222 01:05:34.403202 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" I1222 01:05:34.430552 1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1 $ tail -n2 cbo.log E1222 02:36:57.323869 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" E1222 02:37:00.248442 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" but still managed to write Available=True (with that 'new CO status' line): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR: 2023-12-22T01:05:34Z Degraded=False : 2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform 2023-12-22T01:05:34Z Upgradeable=True : 2023-12-22T01:05:34Z Disabled=False : "I'll never get this lease, and I need a lease to run all my controllers" doesn't seem very Available=True to me, and with this commit, we won't touch the ClusterOperator and the install will time out. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
43f9adb
to
2f98d8c
Compare
This ingress flake is unrelated. Otherwise CI is looking pretty good now, so probably ready for a fresh review round when folks have time. |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
a5c81df
into
openshift:master
@wking: Jira Issue OCPBUGS-25766: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-25766 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build ose-cluster-baremetal-operator-container-v4.16.0-202401031811.p0.ga5c81df.assembly.stream for distgit ose-cluster-baremetal-operator. |
/cherry-pick release-4.15 |
@rhjanders: new pull request created: #402 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The option has been available for years:
and without it overlapping operator pods can crash-loop:
while fighting each other over the same ClusterOperator status:
Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up.
I'm also setting a
Recreate
strategy, because:can be slow in the 3-to-4 phase, while:
tends to be faster. And again, the component should be able to coast through small durations without a functioning leader.
See openshift/machine-config-operator@7530ded86 (openshift/machine-config-operator#3895) for another example of how Recreate can help that way.