OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election #395

wking · 2023-12-20T21:49:07Z

The option has been available for years:

$ git blame main.go | grep enable-leader-election
dcbe86f72 (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop:

  : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
  {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up.

I'm also setting a Recreate strategy, because:

Incoming pod surged by the default Deployment strategy.
Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
Outgoing pod releases the lease and exits.
Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

Outgoing pod releases the lease and exits.
Incoming pod created, scheduled, and acquires the lease.

tends to be faster. And again, the component should be able to coast through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (openshift/machine-config-operator#3895) for another example of how Recreate can help that way.

openshift-ci-robot · 2023-12-20T21:50:45Z

@wking: This pull request references Jira Issue OCPBUGS-25766, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jhajyahy@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The option has been available for years:

$ git blame main.go | grep enable-leader-election
dcbe86f72 (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop:

 : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
 {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
 event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up.

I'm also setting a Recreate strategy, because:

Incoming pod surged by the default Deployment strategy.
Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
Outgoing pod releases the lease and exits.
Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

Outgoing pod releases the lease and exits.
Incoming pod created, scheduled, and acquires the lease.

tends to be faster. And again, the component should be able to coast through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (openshift/machine-config-operator#3895) for another example of how Recreate can help that way.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zaneb · 2023-12-20T21:53:26Z

/approve

openshift-ci · 2023-12-20T21:53:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [zaneb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

honza · 2023-12-21T00:33:05Z

/lgtm

openshift-ci-robot · 2023-12-21T01:26:24Z

/retest-required

Remaining retests: 0 against base HEAD 18c6614 and 2 for PR HEAD 4e4519d in total

zaneb · 2023-12-21T09:41:53Z

2023-12-21T03:13:11.148334181Z E1221 03:13:11.148220       1 main.go:104] "unable to start manager" err="LeaderElectionID must be configured"

wking · 2023-12-21T15:59:04Z

...LeaderElectionID must be configured...

Oops, thanks. I've pushed:

$ git --no-pager diff 4e4519d6..3be7317c
diff --git a/main.go b/main.go
index 4846efd1..7af5a72c 100644
--- a/main.go
+++ b/main.go
@@ -98,7 +98,9 @@ func main() {
                }),
                NewCache: cache.MultiNamespacedCacheBuilder(
                        []string{controllers.ComponentNamespace, provisioning.OpenshiftConfigNamespace}),
-               LeaderElection: enableLeaderElection,
+               LeaderElection:          enableLeaderElection,
+               LeaderElectionID:        "cluster-baremetal-operator",
+               LeaderElectionNamespace: os.Getenv("COMPONENT_NAMESPACE"),
        })
        if err != nil {
                klog.ErrorS(err, "unable to start manager")

And added a paragraph to the 3be7317 commit message talking myself through that change.

…ader election The option has been available for years: $ git blame main.go | grep enable-leader-election dcbe86f (Sandhya Dasu 2020-08-18 21:09:29 -0400 72) flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, and without it overlapping operator pods can crash-loop [1]: : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers 0s { event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times} while fighting each other over the same ClusterOperator status: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort 2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up. I'm also setting a Recreate strategy [2], because: 1. Incoming pod surged by the default Deployment strategy. 2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it. 3. Outgoing pod releases the lease and exits. 4. Incoming pod tries again, and this time acquires the lease. can be slow in the 3-to-4 phase, while: 1. Outgoing pod releases the lease and exits. 2. Incoming pod created, scheduled, and acquires the lease. tends to be faster. And again, the component should be able to coast through small durations without a functioning leader. See openshift/machine-config-operator@7530ded86 (install: Recreate and delayed default ServiceAccount deletion, 2023-08-29, openshift/machine-config-operator#3895) for another example of how Recreate can help that way. To get the leader election working, I'm also setting LeaderElectionID and LeaderElectionNamespace, pattern-matching from [3]. Using cluster-baremetal-operator as the ID avoids collisions with baremetal-operator [4]. There is a COMPONENT_NAMESPACE is from: - name: COMPONENT_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace using the pattern documented in [5]. We've been asking for that environment variable since f368a4f (Add sa, deployment and co manifests, 2020-09-29, openshift#25), although grepping through history, I don't see anything consuming it. Perhaps it's somehow feeding controllers.ComponentNamespace, which is what I'm using. [1]: https://issues.redhat.com/browse/OCPBUGS-25766 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment [3]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L207-L208 [4]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L71 [5]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/

wking · 2023-12-21T16:48:21Z

More pattern-matching consistency:

$ git --no-pager diff 3be7317c..0cfaf393
diff --git a/main.go b/main.go
index 7af5a72c..e6bb56d5 100644
--- a/main.go
+++ b/main.go
@@ -100,7 +100,7 @@ func main() {
                        []string{controllers.ComponentNamespace, provisioning.OpenshiftConfigNamespace}),
                LeaderElection:          enableLeaderElection,
                LeaderElectionID:        "cluster-baremetal-operator",
-               LeaderElectionNamespace: os.Getenv("COMPONENT_NAMESPACE"),
+               LeaderElectionNamespace: controllers.ComponentNamespace,
        })
        if err != nil {
                klog.ErrorS(err, "unable to start manager")

wking · 2023-12-21T21:14:25Z

Hrm, I need to extend RBAC:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6/1737877679555743744/artifacts/e2e-metal-ipi-ovn-ipv6/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-f9696655-p5dxv_cluster-baremetal-operator.log | tail -n1
E1221 19:31:10.136115       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

wking · 2023-12-21T21:47:23Z

Hrm, this preference for sharding out over a number of Roles and RoleBindings will make this a bit awkward. I'm not sure if I should:

Shuffle that around to stuff everything into one operator-scoped role (losing flexibility for folks building kustomize without enabling leader election? Who would do that?), or
Pivot the many entries from probably-not-unique-enough vs. other openshift-machine-api inhabitants names like leader-election-role to more unique-but-unweildy names like cluster-baremetal-operator-leader-election-rolebinding.

wking · 2023-12-22T05:49:54Z

Trying to understand what's going on with ServiceAccounts and RBAC in the e2e-agnostic-ovn run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods.json | jq -c '.items[] | select(.metadata.namespace == "openshift-machine-api") | {name: .metadata.name, serviceAccountName: .spec.serviceAccountName}'
{"name":"cluster-autoscaler-operator-6878b89c58-7kf6d","serviceAccountName":"cluster-autoscaler-operator"}
{"name":"cluster-baremetal-operator-5c57b874f5-s9zmq","serviceAccountName":"cluster-baremetal-operator"}
{"name":"control-plane-machine-set-operator-7d5dcfcb76-cshv4","serviceAccountName":"control-plane-machine-set-operator"}
{"name":"machine-api-controllers-54c98b4bb4-sgvhd","serviceAccountName":"machine-api-controllers"}
{"name":"machine-api-operator-7fb4f8c4fb-7gr2t","serviceAccountName":"machine-api-operator"}

So each pod has its own ServiceAccount. Good. Checking on bound roles:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/rolebindings.json | jq -c '.items[] | select(.metadata.namespace == "openshift-machine-api") | {name: .metadata.name, subjects, roleRef, description: .metadata.annotations["openshift.io/description"]}'
{"name":"cluster-autoscaler","subjects":[{"kind":"ServiceAccount","name":"cluster-autoscaler","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"cluster-autoscaler"},"description":null}
{"name":"cluster-autoscaler-operator","subjects":[{"kind":"ServiceAccount","name":"cluster-autoscaler-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"cluster-autoscaler-operator"},"description":null}
{"name":"cluster-baremetal-operator","subjects":[{"kind":"ServiceAccount","name":"cluster-baremetal-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"cluster-baremetal-operator"},"description":null}
{"name":"control-plane-machine-set-operator","subjects":[{"kind":"ServiceAccount","name":"control-plane-machine-set-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"control-plane-machine-set-operator"},"description":null}
{"name":"machine-api-controllers","subjects":[{"kind":"ServiceAccount","name":"machine-api-controllers","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"machine-api-controllers"},"description":null}
{"name":"machine-api-operator","subjects":[{"kind":"ServiceAccount","name":"machine-api-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"machine-api-operator"},"description":null}
{"name":"prometheus-k8s-cluster-autoscaler-operator","subjects":[{"kind":"ServiceAccount","name":"prometheus-k8s","namespace":"openshift-monitoring"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"prometheus-k8s-cluster-autoscaler-operator"},"description":null}
{"name":"prometheus-k8s-cluster-baremetal-operator","subjects":[{"kind":"ServiceAccount","name":"prometheus-k8s","namespace":"openshift-monitoring"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"prometheus-k8s-cluster-baremetal-operator"},"description":null}
{"name":"prometheus-k8s-machine-api-operator","subjects":[{"kind":"ServiceAccount","name":"prometheus-k8s","namespace":"openshift-monitoring"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"prometheus-k8s-machine-api-operator"},"description":null}
{"name":"system:deployers","subjects":[{"kind":"ServiceAccount","name":"deployer","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"system:deployer"},"description":"Allows deploymentconfigs in this namespace to rollout pods in this namespace.  It is auto-managed by a controller; remove subjects to disable."}
{"name":"system:image-builders","subjects":[{"kind":"ServiceAccount","name":"builder","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"system:image-builder"},"description":"Allows builds in this namespace to push images to this namespace.  It is auto-managed by a controller; remove subjects to disable."}
{"name":"system:image-pullers","subjects":[{"apiGroup":"rbac.authorization.k8s.io","kind":"Group","name":"system:serviceaccounts:openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"system:image-puller"},"description":"Allows all pods in this namespace to pull images from this namespace.  It is auto-managed by a controller; remove subjects to disable."}

openshift.io/description seems unnecessarily OpenShift-specific, when upstream declares kubernetes.io/description. I'd expect standardizing on the upstream tag would make it easier for people and robots to find the description, by allowing them to avoid wondering "will this be a Kubernetes description, or an OpenShift description?". But that's orthogonal to RBAC.

And it seems a bit strange that cluster-autoscaler-operator, cluster-baremetal-operator, and machine-api-operator set up prometheus-k8s-* RoleBindings, while control-plane-machine-set-operator does not. But perhaps CPMS recycles prometheus-k8s-machine-api-operator or something.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/roles.json | jq -c '.items[] | select(.metadata.namespace == "openshift-machine-api") | ([.rules[] | select(.resources | tostring | contains("leases"))]) | select(length > 0)' | sort | uniq -c
      1 [{"apiGroups":["coordination.k8s.io"],"resources":["leases"],"verbs":["get","list","create","update"]}]
      3 [{"apiGroups":["coordination.k8s.io"],"resources":["leases"],"verbs":["get","list","watch","create","update","patch","delete"]}]
      1 [{"apiGroups":[""],"resources":["configmaps","leases","secrets","services"],"verbs":["create","delete","get","list","patch","update","watch"]}]

Aha, I'd missed "apiGroups":["coordination.k8s.io"]. Hopefully fixed with 8f95164 -> f8e3ac9.

…ement to Role Avoid: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6/1737877679555743744/artifacts/e2e-metal-ipi-ovn-ipv6/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-f9696655-p5dxv_cluster-baremetal-operator.log | tail -n1 E1221 19:31:10.136115 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" I'm also removing some old (and previously unused) leader-election kustomize manifests. It's hard for me to imagine someone needing that kind of composability out of this operator, and collapsing into a single role both simplifies maintenance and removes the need to use unweildy names like cluster-baremetal-operator-leader-election-role to stand out vs. the other leader-election roles that the openshift-machine-api namespace might hold. The patchesJson6902 Kustomize syntax is documented in [1]: Patches can be used to apply different customizations to Resources. Kustomize supports different patching mechanisms through patchesStrategicMerge and patchesJson6902. ... Not all Resources or fields support strategic merge patches. To support modifying arbitrary fields in arbitrary Resources, Kustomize offers applying JSON patch [2] through patchesJson6902. To find the correct Resource for a Json patch, the group, version, kind and name of that Resource need to be specified in kustomization.yaml. For example, increasing the replica number of a Deployment object can also be done through patchesJson6902. The /rules/- path means "append the value to the existing "rules" array, as described in [3,4]. [1]: https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/#customizing [2]: https://datatracker.ietf.org/doc/html/rfc6902 [3]: https://datatracker.ietf.org/doc/html/rfc6902#section-4.1 [4]: https://datatracker.ietf.org/doc/html/rfc6902#appendix-A.16

… SetupWithManager SetupWithManager(mgr) is called before mgr.Start() in main(), so it's running before the leader lease has been acquired. We don't want to be writing to anything before we acquire the lease, because that could cause contention with an actual lease-holder that is managing those resources. We can still perform read-only prechecks in SetupWithManager. And then we wait patiently until we get the lease, and update ClusterOperator (and anything else we manage) in Instead, patiently wait until we have the lease in the Reconcile() function. This partially rolls back 2e9d117 (Ensure baremetal CO is completely setup before Reconcile, 2020-11-30, openshift#81), but that addition predated ensureClusterOperator being added early in Reconcile in 4f2d314 (Make sure ensureClusterOperator() is called before its status is updated, 2020-12-15, openshift#71): $ git log --oneline | grep -n '2e9d1177\|4f2d3141' 468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated 506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile So the ensureClusterOperator call in SetupWithManager is no longer needed. And this partially rolls back 8798044 (Handle case when Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81). That "we're enabled, but there isn't a Provisioning custom resource yet" handling happens continually in Reconcile (where the write will be protected by the operator holding the lease). Among other improvements, this change will prevent a nominally-successful install where the operator never acquired a lease [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log $ head -n4 cbo.log I1222 01:05:34.274563 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1222 01:05:34.318283 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1222 01:05:34.403202 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" I1222 01:05:34.430552 1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1 $ tail -n2 cbo.log E1222 02:36:57.323869 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" E1222 02:37:00.248442 1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api" but still managed to write Available=True (with that 'new CO status' line): $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR: 2023-12-22T01:05:34Z Degraded=False : 2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform 2023-12-22T01:05:34Z Upgradeable=True : 2023-12-22T01:05:34Z Disabled=False : "I'll never get this lease, and I need a lease to run all my controllers" doesn't seem very Available=True to me, and with this commit, we won't touch the ClusterOperator and the install will time out. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352

wking · 2023-12-22T16:54:48Z

This ingress flake is unrelated. Otherwise CI is looking pretty good now, so probably ready for a fresh review round when folks have time.

openshift-ci · 2023-12-22T18:10:23Z

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-ipi-upgrade-ovn-ipv6	`2f98d8c`	link	false	`/test e2e-metal-ipi-upgrade-ovn-ipv6`
ci/prow/e2e-metal-ipi-sdn-upgrade	`2f98d8c`	link	false	`/test e2e-metal-ipi-sdn-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

honza · 2024-01-03T15:34:22Z

/lgtm

openshift-ci-robot · 2024-01-03T15:37:48Z

@wking: Jira Issue OCPBUGS-25766: All pull requests linked via external trackers have merged:

openshift/cluster-baremetal-operator#395

Jira Issue OCPBUGS-25766 has been moved to the MODIFIED state.

In response to this:

The option has been available for years:

$ git blame main.go | grep enable-leader-election
dcbe86f72 (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop:

 : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
 {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
 event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up.

I'm also setting a Recreate strategy, because:

Incoming pod surged by the default Deployment strategy.
Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
Outgoing pod releases the lease and exits.
Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

Outgoing pod releases the lease and exits.
Incoming pod created, scheduled, and acquires the lease.

tends to be faster. And again, the component should be able to coast through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (openshift/machine-config-operator#3895) for another example of how Recreate can help that way.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-01-03T18:38:05Z

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-cluster-baremetal-operator-container-v4.16.0-202401031811.p0.ga5c81df.assembly.stream for distgit ose-cluster-baremetal-operator.
All builds following this will include this PR.

rhjanders · 2024-02-07T03:08:39Z

/cherry-pick release-4.15

openshift-cherrypick-robot · 2024-02-07T03:09:23Z

@rhjanders: new pull request created: #402

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from andfasano and dtantsur December 20, 2023 21:50

wking changed the title ~~manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election~~ OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election Dec 20, 2023

openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Dec 20, 2023

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Dec 20, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2023

wking force-pushed the enable-leader-election branch from a409b95 to 4e4519d Compare December 20, 2023 22:58

openshift-ci bot assigned honza Dec 21, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 21, 2023

wking force-pushed the enable-leader-election branch from 4e4519d to 3be7317 Compare December 21, 2023 15:57

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 21, 2023

wking force-pushed the enable-leader-election branch from 3be7317 to 0cfaf39 Compare December 21, 2023 16:47

wking force-pushed the enable-leader-election branch 2 times, most recently from 8f95164 to f8e3ac9 Compare December 22, 2023 05:47

wking force-pushed the enable-leader-election branch from f8e3ac9 to 43f9adb Compare December 22, 2023 07:45

wking force-pushed the enable-leader-election branch from 43f9adb to 2f98d8c Compare December 22, 2023 13:59

wking mentioned this pull request Jan 2, 2024

OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls openshift/cluster-version-operator#1009

Merged

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 3, 2024

openshift-merge-bot bot merged commit a5c81df into openshift:master Jan 3, 2024
15 of 17 checks passed

wking deleted the enable-leader-election branch January 3, 2024 15:46

openshift-cherrypick-robot mentioned this pull request Feb 7, 2024

[release-4.15] OCPBUGS-29153: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election #402

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election #395

OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election #395

wking commented Dec 20, 2023

openshift-ci-robot commented Dec 20, 2023

zaneb commented Dec 20, 2023

openshift-ci bot commented Dec 20, 2023

honza commented Dec 21, 2023

openshift-ci-robot commented Dec 21, 2023

zaneb commented Dec 21, 2023

wking commented Dec 21, 2023

wking commented Dec 21, 2023

wking commented Dec 21, 2023

wking commented Dec 21, 2023

wking commented Dec 22, 2023

wking commented Dec 22, 2023

openshift-ci bot commented Dec 22, 2023

honza commented Jan 3, 2024

openshift-ci-robot commented Jan 3, 2024

openshift-bot commented Jan 3, 2024

rhjanders commented Feb 7, 2024

openshift-cherrypick-robot commented Feb 7, 2024

OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election #395

OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election #395

Conversation

wking commented Dec 20, 2023

openshift-ci-robot commented Dec 20, 2023

zaneb commented Dec 20, 2023

openshift-ci bot commented Dec 20, 2023

honza commented Dec 21, 2023

openshift-ci-robot commented Dec 21, 2023

zaneb commented Dec 21, 2023

wking commented Dec 21, 2023

wking commented Dec 21, 2023

wking commented Dec 21, 2023

wking commented Dec 21, 2023

wking commented Dec 22, 2023

wking commented Dec 22, 2023

openshift-ci bot commented Dec 22, 2023

honza commented Jan 3, 2024

openshift-ci-robot commented Jan 3, 2024

openshift-bot commented Jan 3, 2024

rhjanders commented Feb 7, 2024

openshift-cherrypick-robot commented Feb 7, 2024