Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election #395

Merged
merged 3 commits into from Jan 3, 2024

Conversation

wking
Copy link
Member

@wking wking commented Dec 20, 2023

The option has been available for years:

$ git blame main.go | grep enable-leader-election
dcbe86f72 (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop:

  : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
  {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up.

I'm also setting a Recreate strategy, because:

  1. Incoming pod surged by the default Deployment strategy.
  2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
  3. Outgoing pod releases the lease and exits.
  4. Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

  1. Outgoing pod releases the lease and exits.
  2. Incoming pod created, scheduled, and acquires the lease.

tends to be faster. And again, the component should be able to coast through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (openshift/machine-config-operator#3895) for another example of how Recreate can help that way.

@wking wking changed the title manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election Dec 20, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Dec 20, 2023
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-25766, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jhajyahy@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The option has been available for years:

$ git blame main.go | grep enable-leader-election
dcbe86f72 (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop:

 : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
 {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
 event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up.

I'm also setting a Recreate strategy, because:

  1. Incoming pod surged by the default Deployment strategy.
  2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
  3. Outgoing pod releases the lease and exits.
  4. Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

  1. Outgoing pod releases the lease and exits.
  2. Incoming pod created, scheduled, and acquires the lease.

tends to be faster. And again, the component should be able to coast through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (openshift/machine-config-operator#3895) for another example of how Recreate can help that way.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Dec 20, 2023
@zaneb
Copy link
Member

zaneb commented Dec 20, 2023

/approve

Copy link
Contributor

openshift-ci bot commented Dec 20, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2023
@honza
Copy link
Member

honza commented Dec 21, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 21, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 18c6614 and 2 for PR HEAD 4e4519d in total

@zaneb
Copy link
Member

zaneb commented Dec 21, 2023

2023-12-21T03:13:11.148334181Z E1221 03:13:11.148220       1 main.go:104] "unable to start manager" err="LeaderElectionID must be configured"

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 21, 2023
@wking
Copy link
Member Author

wking commented Dec 21, 2023

...LeaderElectionID must be configured...

Oops, thanks. I've pushed:

$ git --no-pager diff 4e4519d6..3be7317c
diff --git a/main.go b/main.go
index 4846efd1..7af5a72c 100644
--- a/main.go
+++ b/main.go
@@ -98,7 +98,9 @@ func main() {
                }),
                NewCache: cache.MultiNamespacedCacheBuilder(
                        []string{controllers.ComponentNamespace, provisioning.OpenshiftConfigNamespace}),
-               LeaderElection: enableLeaderElection,
+               LeaderElection:          enableLeaderElection,
+               LeaderElectionID:        "cluster-baremetal-operator",
+               LeaderElectionNamespace: os.Getenv("COMPONENT_NAMESPACE"),
        })
        if err != nil {
                klog.ErrorS(err, "unable to start manager")

And added a paragraph to the 3be7317 commit message talking myself through that change.

…ader election

The option has been available for years:

  $ git blame main.go | grep enable-leader-election
  dcbe86f (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop [1]:

  : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
  {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
  2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should
be able to coast through brief moments after an outgoing leader leaves
until a replacement leader picks things back up.

I'm also setting a Recreate strategy [2], because:

1. Incoming pod surged by the default Deployment strategy.
2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
3. Outgoing pod releases the lease and exits.
4. Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

1. Outgoing pod releases the lease and exits.
2. Incoming pod created, scheduled, and acquires the lease.

tends to be faster.  And again, the component should be able to coast
through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (install: Recreate and
delayed default ServiceAccount deletion, 2023-08-29,
openshift/machine-config-operator#3895) for another example of how
Recreate can help that way.

To get the leader election working, I'm also setting LeaderElectionID
and LeaderElectionNamespace, pattern-matching from [3].  Using
cluster-baremetal-operator as the ID avoids collisions with
baremetal-operator [4].  There is a COMPONENT_NAMESPACE is from:

  - name: COMPONENT_NAMESPACE
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: metadata.namespace

using the pattern documented in [5].  We've been asking for that
environment variable since f368a4f (Add sa, deployment and co
manifests, 2020-09-29, openshift#25), although grepping through history, I
don't see anything consuming it.  Perhaps it's somehow feeding
controllers.ComponentNamespace, which is what I'm using.

[1]: https://issues.redhat.com/browse/OCPBUGS-25766
[2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
[3]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L207-L208
[4]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L71
[5]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/
@wking
Copy link
Member Author

wking commented Dec 21, 2023

More pattern-matching consistency:

$ git --no-pager diff 3be7317c..0cfaf393
diff --git a/main.go b/main.go
index 7af5a72c..e6bb56d5 100644
--- a/main.go
+++ b/main.go
@@ -100,7 +100,7 @@ func main() {
                        []string{controllers.ComponentNamespace, provisioning.OpenshiftConfigNamespace}),
                LeaderElection:          enableLeaderElection,
                LeaderElectionID:        "cluster-baremetal-operator",
-               LeaderElectionNamespace: os.Getenv("COMPONENT_NAMESPACE"),
+               LeaderElectionNamespace: controllers.ComponentNamespace,
        })
        if err != nil {
                klog.ErrorS(err, "unable to start manager")

@wking
Copy link
Member Author

wking commented Dec 21, 2023

Hrm, I need to extend RBAC:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6/1737877679555743744/artifacts/e2e-metal-ipi-ovn-ipv6/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-f9696655-p5dxv_cluster-baremetal-operator.log | tail -n1
E1221 19:31:10.136115       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

@wking
Copy link
Member Author

wking commented Dec 21, 2023

Hrm, this preference for sharding out over a number of Roles and RoleBindings will make this a bit awkward. I'm not sure if I should:

  • Shuffle that around to stuff everything into one operator-scoped role (losing flexibility for folks building kustomize without enabling leader election? Who would do that?), or
  • Pivot the many entries from probably-not-unique-enough vs. other openshift-machine-api inhabitants names like leader-election-role to more unique-but-unweildy names like cluster-baremetal-operator-leader-election-rolebinding.

@wking wking force-pushed the enable-leader-election branch 2 times, most recently from 8f95164 to f8e3ac9 Compare December 22, 2023 05:47
@wking
Copy link
Member Author

wking commented Dec 22, 2023

Trying to understand what's going on with ServiceAccounts and RBAC in the e2e-agnostic-ovn run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods.json | jq -c '.items[] | select(.metadata.namespace == "openshift-machine-api") | {name: .metadata.name, serviceAccountName: .spec.serviceAccountName}'
{"name":"cluster-autoscaler-operator-6878b89c58-7kf6d","serviceAccountName":"cluster-autoscaler-operator"}
{"name":"cluster-baremetal-operator-5c57b874f5-s9zmq","serviceAccountName":"cluster-baremetal-operator"}
{"name":"control-plane-machine-set-operator-7d5dcfcb76-cshv4","serviceAccountName":"control-plane-machine-set-operator"}
{"name":"machine-api-controllers-54c98b4bb4-sgvhd","serviceAccountName":"machine-api-controllers"}
{"name":"machine-api-operator-7fb4f8c4fb-7gr2t","serviceAccountName":"machine-api-operator"}

So each pod has its own ServiceAccount. Good. Checking on bound roles:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/rolebindings.json | jq -c '.items[] | select(.metadata.namespace == "openshift-machine-api") | {name: .metadata.name, subjects, roleRef, description: .metadata.annotations["openshift.io/description"]}'
{"name":"cluster-autoscaler","subjects":[{"kind":"ServiceAccount","name":"cluster-autoscaler","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"cluster-autoscaler"},"description":null}
{"name":"cluster-autoscaler-operator","subjects":[{"kind":"ServiceAccount","name":"cluster-autoscaler-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"cluster-autoscaler-operator"},"description":null}
{"name":"cluster-baremetal-operator","subjects":[{"kind":"ServiceAccount","name":"cluster-baremetal-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"cluster-baremetal-operator"},"description":null}
{"name":"control-plane-machine-set-operator","subjects":[{"kind":"ServiceAccount","name":"control-plane-machine-set-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"control-plane-machine-set-operator"},"description":null}
{"name":"machine-api-controllers","subjects":[{"kind":"ServiceAccount","name":"machine-api-controllers","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"machine-api-controllers"},"description":null}
{"name":"machine-api-operator","subjects":[{"kind":"ServiceAccount","name":"machine-api-operator","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"machine-api-operator"},"description":null}
{"name":"prometheus-k8s-cluster-autoscaler-operator","subjects":[{"kind":"ServiceAccount","name":"prometheus-k8s","namespace":"openshift-monitoring"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"prometheus-k8s-cluster-autoscaler-operator"},"description":null}
{"name":"prometheus-k8s-cluster-baremetal-operator","subjects":[{"kind":"ServiceAccount","name":"prometheus-k8s","namespace":"openshift-monitoring"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"prometheus-k8s-cluster-baremetal-operator"},"description":null}
{"name":"prometheus-k8s-machine-api-operator","subjects":[{"kind":"ServiceAccount","name":"prometheus-k8s","namespace":"openshift-monitoring"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"prometheus-k8s-machine-api-operator"},"description":null}
{"name":"system:deployers","subjects":[{"kind":"ServiceAccount","name":"deployer","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"system:deployer"},"description":"Allows deploymentconfigs in this namespace to rollout pods in this namespace.  It is auto-managed by a controller; remove subjects to disable."}
{"name":"system:image-builders","subjects":[{"kind":"ServiceAccount","name":"builder","namespace":"openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"system:image-builder"},"description":"Allows builds in this namespace to push images to this namespace.  It is auto-managed by a controller; remove subjects to disable."}
{"name":"system:image-pullers","subjects":[{"apiGroup":"rbac.authorization.k8s.io","kind":"Group","name":"system:serviceaccounts:openshift-machine-api"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"system:image-puller"},"description":"Allows all pods in this namespace to pull images from this namespace.  It is auto-managed by a controller; remove subjects to disable."}

openshift.io/description seems unnecessarily OpenShift-specific, when upstream declares kubernetes.io/description. I'd expect standardizing on the upstream tag would make it easier for people and robots to find the description, by allowing them to avoid wondering "will this be a Kubernetes description, or an OpenShift description?". But that's orthogonal to RBAC.

And it seems a bit strange that cluster-autoscaler-operator, cluster-baremetal-operator, and machine-api-operator set up prometheus-k8s-* RoleBindings, while control-plane-machine-set-operator does not. But perhaps CPMS recycles prometheus-k8s-machine-api-operator or something.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/roles.json | jq -c '.items[] | select(.metadata.namespace == "openshift-machine-api") | ([.rules[] | select(.resources | tostring | contains("leases"))]) | select(length > 0)' | sort | uniq -c
      1 [{"apiGroups":["coordination.k8s.io"],"resources":["leases"],"verbs":["get","list","create","update"]}]
      3 [{"apiGroups":["coordination.k8s.io"],"resources":["leases"],"verbs":["get","list","watch","create","update","patch","delete"]}]
      1 [{"apiGroups":[""],"resources":["configmaps","leases","secrets","services"],"verbs":["create","delete","get","list","patch","update","watch"]}]

Aha, I'd missed "apiGroups":["coordination.k8s.io"]. Hopefully fixed with 8f95164 -> f8e3ac9.

…ement to Role

Avoid:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi-ovn-ipv6/1737877679555743744/artifacts/e2e-metal-ipi-ovn-ipv6/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-f9696655-p5dxv_cluster-baremetal-operator.log | tail -n1
  E1221 19:31:10.136115       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

I'm also removing some old (and previously unused) leader-election
kustomize manifests.  It's hard for me to imagine someone needing that
kind of composability out of this operator, and collapsing into a
single role both simplifies maintenance and removes the need to use
unweildy names like cluster-baremetal-operator-leader-election-role to
stand out vs. the other leader-election roles that the
openshift-machine-api namespace might hold.

The patchesJson6902 Kustomize syntax is documented in [1]:

  Patches can be used to apply different customizations to
  Resources. Kustomize supports different patching mechanisms through
  patchesStrategicMerge and patchesJson6902.
  ...
  Not all Resources or fields support strategic merge patches. To
  support modifying arbitrary fields in arbitrary Resources, Kustomize
  offers applying JSON patch [2] through patchesJson6902. To find the
  correct Resource for a Json patch, the group, version, kind and name
  of that Resource need to be specified in kustomization.yaml. For
  example, increasing the replica number of a Deployment object can
  also be done through patchesJson6902.

The /rules/- path means "append the value to the existing "rules"
array, as described in [3,4].

[1]: https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/#customizing
[2]: https://datatracker.ietf.org/doc/html/rfc6902
[3]: https://datatracker.ietf.org/doc/html/rfc6902#section-4.1
[4]: https://datatracker.ietf.org/doc/html/rfc6902#appendix-A.16
… SetupWithManager

SetupWithManager(mgr) is called before mgr.Start() in main(), so it's
running before the leader lease has been acquired.  We don't want to
be writing to anything before we acquire the lease, because that could
cause contention with an actual lease-holder that is managing those
resources.  We can still perform read-only prechecks in
SetupWithManager.  And then we wait patiently until we get the lease,
and update ClusterOperator (and anything else we manage) in Instead,
patiently wait until we have the lease in the Reconcile() function.

This partially rolls back 2e9d117 (Ensure baremetal CO is
completely setup before Reconcile, 2020-11-30, openshift#81), but that addition
predated ensureClusterOperator being added early in Reconcile in
4f2d314 (Make sure ensureClusterOperator() is called before its
status is updated, 2020-12-15, openshift#71):

  $ git log --oneline | grep -n '2e9d1177\|4f2d3141'
  468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated
  506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile

So the ensureClusterOperator call in SetupWithManager is no longer
needed.

And this partially rolls back 8798044 (Handle case when
Provisioning CR is absent on the Baremetal platform, 2020-11-30, openshift#81).
That "we're enabled, but there isn't a Provisioning custom resource
yet" handling happens continually in Reconcile (where the write will
be protected by the operator holding the lease).

Among other improvements, this change will prevent a
nominally-successful install where the operator never acquired a lease
[1]:

 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log
  $ head -n4 cbo.log
  I1222 01:05:34.274563       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
  I1222 01:05:34.318283       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
  I1222 01:05:34.403202       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
  I1222 01:05:34.430552       1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1
  $ tail -n2 cbo.log
  E1222 02:36:57.323869       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"
  E1222 02:37:00.248442       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

but still managed to write Available=True (with that 'new CO status' line):

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR:
  2023-12-22T01:05:34Z Degraded=False :
  2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform
  2023-12-22T01:05:34Z Upgradeable=True :
  2023-12-22T01:05:34Z Disabled=False :

"I'll never get this lease, and I need a lease to run all my
controllers" doesn't seem very Available=True to me, and with this
commit, we won't touch the ClusterOperator and the install will time
out.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
@wking
Copy link
Member Author

wking commented Dec 22, 2023

This ingress flake is unrelated. Otherwise CI is looking pretty good now, so probably ready for a fresh review round when folks have time.

Copy link
Contributor

openshift-ci bot commented Dec 22, 2023

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-upgrade-ovn-ipv6 2f98d8c link false /test e2e-metal-ipi-upgrade-ovn-ipv6
ci/prow/e2e-metal-ipi-sdn-upgrade 2f98d8c link false /test e2e-metal-ipi-sdn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@honza
Copy link
Member

honza commented Jan 3, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 3, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit a5c81df into openshift:master Jan 3, 2024
15 of 17 checks passed
@openshift-ci-robot
Copy link
Contributor

@wking: Jira Issue OCPBUGS-25766: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-25766 has been moved to the MODIFIED state.

In response to this:

The option has been available for years:

$ git blame main.go | grep enable-leader-election
dcbe86f72 (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop:

 : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
 {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
 event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up.

I'm also setting a Recreate strategy, because:

  1. Incoming pod surged by the default Deployment strategy.
  2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
  3. Outgoing pod releases the lease and exits.
  4. Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

  1. Outgoing pod releases the lease and exits.
  2. Incoming pod created, scheduled, and acquires the lease.

tends to be faster. And again, the component should be able to coast through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (openshift/machine-config-operator#3895) for another example of how Recreate can help that way.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@wking wking deleted the enable-leader-election branch January 3, 2024 15:46
@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-cluster-baremetal-operator-container-v4.16.0-202401031811.p0.ga5c81df.assembly.stream for distgit ose-cluster-baremetal-operator.
All builds following this will include this PR.

@rhjanders
Copy link
Contributor

/cherry-pick release-4.15

@openshift-cherrypick-robot

@rhjanders: new pull request created: #402

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants