OCPBUGS-16905: install: Recreate and delayed default ServiceAccount deletion #3895

wking · 2023-08-29T17:12:25Z

Carrying and extending #3884.

Single-node updates from 4.13 to 4.14 are having slow leader handoffs during operator Deployment updates. For example, this CI run in Loki:

{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection"

gives:

  machine-config-operator-7987b6bcdd-6zvrt
    I0817 11:26:14.747630       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
    I0817 11:26:14.756376       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
    E0817 11:31:15.153289       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config)
    E0817 11:37:42.204539       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
    E0817 11:53:05.221213       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
  machine-config-operator-74d7475455-cgcmg
    I0817 12:21:41.019562       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
    I0817 12:21:41.036459       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
  machine-config-operator-7987b6bcdd-6zvrt
    E0817 12:21:41.780276       1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator"
  machine-config-operator-74d7475455-cgcmg
    I0817 12:27:27.767746       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
    I0817 12:35:29.649939       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
    I0817 12:35:29.685549       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
    I0817 12:35:29.696394       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

That shows two lease handoffs:

During the update from 4.13 to 4.14, the cluster-version operator bumps the MCO Deployment to roll out the 4.14 image and other manifest changes.
1. The outgoing 4.13 is using the default ServiceAccount, because it predates OCPBUGS-10924: Switch default SA to machine-config-operator #3740
2. The CVO deletes the default ServiceAccount, because its manifest filename sorts it before the MCO Deployment.
3. The CVO bumps the MCO Deployment.
4. The MCO Deployment defaults to RollingUpdate with a default maxSurge of 25% rounded up. That maxSurge default is very old, dating back to at least the version of Kubernetes vendored by OpenShift 4.1.
5. The Deployment controller surges the incoming pod, and at 12:21:41.036459 it checks to see if the lease is available, but it isn't, because the outgoing pod is still running.
6. The Deployment controller TERMs the outgoing containers, and at 12:21:41.780276 the outgoing pod complains that it is no longer authorized to delete the lease ConfigMap, because the default ServiceAccount has been deleted in 1.ii.
7. The incoming pod acquires the lease 5m46s later at 12:27:27, when the outgoing container's final lease goes stale.
When the single node goes down for a reboot to roll out the 4.14 RHCOS, and the new container coming up at 12:35:29 acquires the lease smoothly.

This commit improves the deployment rollout with two changes:

Shifting the default ServiceAccount deletion to a 90 manifest moves it behind 0000_80_machine-config-operator_04_deployment.yaml and 0000_80_machine-config-operator_06_clusteroperator.yaml. Moving behind the ClusterOperator ensures that deletion is not initiated until the incoming operator pod has updated the ClusterOperator to confirm the update has completed. Positioning between the Deployment and the ClusterOperator would have put the deletion after the CVO bumped the Deployment, but might still have raced against the outgoing operator pod actually shutting down. This should allow the outgoing pod to gracefully remove the ConfigMap and Lease resources during graceful shutdowns.
Shifting the Deployment to Recreate. But also as discussed in the Recreate docs:

Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.

Still, Recreate will avoid the current milliseconds of overlap between the incoming pod's first attempt to acquire the lease and the outgoing pod's attempt to release during Deployment rollouts. Other MCO container restarts and drains-for-multi-control-plane-rolls and such may still have racey overlap, but should be limited to the:
```
  Worst graceful lease acquisition is {26s}.
```
latency (at least with the current 4.14 single-node lease configuration).

Single-node updates from 4.13 to 4.14 are having slow leader handoffs during operator Deployment updates [1]. For example [2] in Loki: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection" gives: machine-config-operator-7987b6bcdd-6zvrt I0817 11:26:14.747630 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config... I0817 11:26:14.756376 1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config E0817 11:31:15.153289 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config) E0817 11:37:42.204539 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF E0817 11:53:05.221213 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF machine-config-operator-74d7475455-cgcmg I0817 12:21:41.019562 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}. I0817 12:21:41.036459 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config... machine-config-operator-7987b6bcdd-6zvrt E0817 12:21:41.780276 1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator" machine-config-operator-74d7475455-cgcmg I0817 12:27:27.767746 1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config I0817 12:35:29.649939 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}. I0817 12:35:29.685549 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config... I0817 12:35:29.696394 1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config That shows two lease handoffs: 1. During the update from 4.13 to 4.14, the cluster-version operator bumps the MCO Deployment to roll out the 4.14 image and other manifest changes. a. The outgoing 4.13 is using the default ServiceAccount, because it predates [3,4]. b. The CVO deletes the default ServiceAccount, because its manifest filename sorts it before the MCO Deployment. c. The CVO bumps the MCO Deployment. d. The MCO Deployment defaults to RollingUpdate [5] with a default maxSurge of 25% rounded up [6]. That maxSurge default is very old, dating back to at least the version of Kubernetes vendored by OpenShift 4.1 [7]. e. The Deployment controller surges the incoming pod, and at 12:21:41.036459 it checks to see if the lease is available, but it isn't, because the outgoing pod is still running. f. The Deployment controller TERMs the outgoing containers, and at 12:21:41.780276 the outgoing pod complains that it is no longer authorized to delete the lease ConfigMap, because the default ServiceAccount has been deleted in 1.b. g. The incoming pod acquires the lease 5m46s later at 12:27:27, when the outgoing container's final lease goes stale. 2. When the single node goes down for a reboot to roll out the 4.14 RHCOS, and the new container coming up at 12:35:29 acquires the lease smoothly. This commit improves the deployment rollout with two changes: * Shifting the default ServiceAccount deletion to a 90 manifest moves it behind 0000_80_machine-config-operator_04_deployment.yaml and 0000_80_machine-config-operator_06_clusteroperator.yaml. Moving behind the ClusterOperator ensures that deletion is not initiated until the incoming operator pod has updated the ClusterOperator to confirm the update has completed. Positioning between the Deployment and the ClusterOperator would have put the deletion after the CVO bumped the Deployment, but might still have raced against the outgoing operator pod actually shutting down. This should allow the outgoing pod to gracefully remove the ConfigMap and Lease resources during graceful shutdowns. * Shifting the Deployment to Recreate [8]. But also as discussed in [8]: Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet. Still, Recreate will avoid the current milliseconds of overlap between the incoming pod's first attempt to acquire the lease and the outgoing pod's attempt to release during Deployment rollouts. Other MCO container restarts and drains-for-multi-control-plane-rolls and such may still have racey overlap, but should be limited to the: Worst graceful lease acquisition is {26s}. latency (at least with the current 4.14 single-node lease configuration). [1]: https://issues.redhat.com/browse/OCPBUGS-16905 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688 [3]: https://issues.redhat.com/browse/OCPBUGS-10924 [4]: openshift@ace637f [5]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy [6]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge [7]: openshift/api $ git --no-pager grep -B12 maxSurge origin/release-4.1 -- vendor/k8s.io/api/apps/v1/types.go origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // The maximum number of pods that can be scheduled above the desired number of origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // pods. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // This can not be 0 if MaxUnavailable is 0. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Absolute number is calculated from percentage by rounding up. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Defaults to 25%. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Example: when this is set to 30%, the new ReplicaSet can be scaled up immediately when origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // the rolling update starts, such that the total number of old and new pods do not exceed origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // 130% of desired pods. Once old pods have been killed, origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // new ReplicaSet can be scaled up further, ensuring that total number of pods running origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // at any time during the update is at most 130% of desired pods. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // +optional origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go: MaxSurge *intstr.IntOrString `json:"maxSurge,omitempty" protobuf:"bytes,2,opt,name=maxSurge" [8]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment

openshift-ci-robot · 2023-08-29T17:23:43Z

@wking: This pull request references Jira Issue OCPBUGS-16905, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Carrying and extending #3884.

Single-node updates from 4.13 to 4.14 are having slow leader handoffs during operator Deployment updates. For example, this CI run in Loki:
{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection"
gives:
 machine-config-operator-7987b6bcdd-6zvrt
   I0817 11:26:14.747630       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
   I0817 11:26:14.756376       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
   E0817 11:31:15.153289       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config)
   E0817 11:37:42.204539       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
   E0817 11:53:05.221213       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
 machine-config-operator-74d7475455-cgcmg
   I0817 12:21:41.019562       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
   I0817 12:21:41.036459       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
 machine-config-operator-7987b6bcdd-6zvrt
   E0817 12:21:41.780276       1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator"
 machine-config-operator-74d7475455-cgcmg
   I0817 12:27:27.767746       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
   I0817 12:35:29.649939       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
   I0817 12:35:29.685549       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
   I0817 12:35:29.696394       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
That shows two lease handoffs:

During the update from 4.13 to 4.14, the cluster-version operator bumps the MCO Deployment to roll out the 4.14 image and other manifest changes.

The outgoing 4.13 is using the default ServiceAccount, because it predates OCPBUGS-10924: Switch default SA to machine-config-operator #3740

The CVO deletes the default ServiceAccount, because its manifest filename sorts it before the MCO Deployment.

The CVO bumps the MCO Deployment.

The MCO Deployment defaults to RollingUpdate with a default maxSurge of 25% rounded up. That maxSurge default is very old, dating back to at least the version of Kubernetes vendored by OpenShift 4.1.

The Deployment controller surges the incoming pod, and at 12:21:41.036459 it checks to see if the lease is available, but it isn't, because the outgoing pod is still running.

The Deployment controller TERMs the outgoing containers, and at 12:21:41.780276 the outgoing pod complains that it is no longer authorized to delete the lease ConfigMap, because the default ServiceAccount has been deleted in 1.ii.

The incoming pod acquires the lease 5m46s later at 12:27:27, when the outgoing container's final lease goes stale.

When the single node goes down for a reboot to roll out the 4.14 RHCOS, and the new container coming up at 12:35:29 acquires the lease smoothly.

This commit improves the deployment rollout with two changes:
Shifting the default ServiceAccount deletion to a 90 manifest moves it behind 0000_80_machine-config-operator_04_deployment.yaml and 0000_80_machine-config-operator_06_clusteroperator.yaml. Moving behind the ClusterOperator ensures that deletion is not initiated until the incoming operator pod has updated the ClusterOperator to confirm the update has completed. Positioning between the Deployment and the ClusterOperator would have put the deletion after the CVO bumped the Deployment, but might still have raced against the outgoing operator pod actually shutting down. This should allow the outgoing pod to gracefully remove the ConfigMap and Lease resources during graceful shutdowns.
Shifting the Deployment to Recreate. But also as discussed in the Recreate docs:

Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.

Still, Recreate will avoid the current milliseconds of overlap between the incoming pod's first attempt to acquire the lease and the outgoing pod's attempt to release during Deployment rollouts. Other MCO container restarts and drains-for-multi-control-plane-rolls and such may still have racey overlap, but should be limited to the:
 Worst graceful lease acquisition is {26s}.
latency (at least with the current 4.14 single-node lease configuration).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2023-08-29T18:52:39Z

Testing in Cluster Bot with:

launch 4.13.10 single-node,aws (logs)
build 4.14,openshift/machine-config-operator#3895 (logs)

$ oc image info registry.build05.ci.openshift.org/ci-ln-itrhw1t/release:latest | grep Digest
Digest:      sha256:0b85c9dc133c734cc348c2a26c7b32f826f4494245fa69de2c0776d3181b8f70
$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.build05.ci.openshift.org/ci-ln-itrhw1t/release@sha256:0b85c9dc133c734cc348c2a26c7b32f826f4494245fa69de2c0776d3181b8f70

And then watching the original MCO pod:

$ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-operator --tail -1 --follow
...
I0829 18:46:40.004700       1 helpers.go:93] Shutting down due to: terminated
I0829 18:46:40.004859       1 helpers.go:96] Context cancelled
I0829 18:46:40.004947       1 operator.go:286] Shutting down MachineConfigOperator
I0829 18:46:40.042567       1 start.go:115] Stopped leading. Terminating.

And the incoming MCO pod after the Deployment bump:

$ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-operator --tail -1 --follow
Defaulted container "machine-config-operator" out of: machine-config-operator, kube-rbac-proxy
I0829 18:46:52.063271       1 start.go:49] Version: 4.14.0-0.test-2023-08-29-180918-ci-ln-itrhw1t-latest (Raw: machine-config-daemon-4.6.0-202006240615.p0-2308-gbc165516-dirty, Hash: bc1655163ad3a944a899e05f85f175b4258907a0)
I0829 18:46:52.063533       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
I0829 18:46:52.063620       1 metrics.go:74] Registering Prometheus metrics
I0829 18:46:52.063685       1 metrics.go:81] Starting metrics listener on 127.0.0.1:8797
I0829 18:46:52.080335       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0829 18:46:52.086623       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
...

So that looks like a nice, clean, 12s handoff to me on the Deployment roll. Confirming in in-cluster Prometheus with:

group by (lease_holder) (kube_lease_owner{namespace="openshift-machine-config-operator",lease="machine-config"})

and:

group by (pod, container_id, image) (kube_pod_container_info{namespace="openshift-machine-config-operator",container="machine-config-operator"})

Which confirms that this commit is having its intended effect on the Deployment bump. The ClusterBot job hasn't gotten to the single-node reboot phase yet, but neither of the changes I'm making here are expected to effect that handoff.

wking · 2023-08-29T18:53:46Z

/jira refresh

openshift-ci-robot · 2023-08-29T18:54:35Z

@wking: This pull request references Jira Issue OCPBUGS-16905, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang

/lgtm

Makes sense for us and should be an otherwise safe fix. I don't think we need to merge through the critical-fixes-only.

Would be good to do some verification on SNO to see whether this resolves both issues but I don't think this otherwise blocks the 4.14 release

openshift-ci · 2023-08-29T21:36:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sergiordlr · 2023-08-30T12:32:23Z

Verified using SNO on GCP.

Upgrade fom 4.13 to 4.14+fix

    history:
    - acceptedRisks: |-
        Target release version="" image="registry.build05.ci.openshift.org/ci-ln-4nbqnkt/release:latest" cannot be verified, but continuing anyway because the update was forced: release images that are not accessed via digest cannot be verified
        Forced through blocking failures: Multiple precondition checks failed:
        * Precondition "ClusterVersionUpgradeable" failed because of "AdminAckRequired": Kubernetes 1.27 and therefore OpenShift 4.14 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6958395 for details and instructions.
        * Precondition "EtcdRecentBackup" failed because of "ControllerStarted": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
        * Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.13.0-0.nightly-2023-08-29-102305 to 4.14.0-0.ci.test-2023-08-30-110107-ci-ln-4nbqnkt-latest is unknown.
      completionTime: "2023-08-30T11:53:42Z"
      image: registry.build05.ci.openshift.org/ci-ln-4nbqnkt/release:latest
      startedTime: "2023-08-30T11:04:10Z"
      state: Completed
      verified: false
      version: 4.14.0-0.ci.test-2023-08-30-110107-ci-ln-4nbqnkt-latest
    - completionTime: "2023-08-30T10:37:29Z"
      image: registry.ci.openshift.org/ocp/release@sha256:1c9791305bf0b3c1e77ce0fa7926cf03472b0a1af4e67fc1e2ccac7a173b70d0
      startedTime: "2023-08-30T10:12:17Z"
      state: Completed
      verified: false
      version: 4.13.0-0.nightly-2023-08-29-102305

Before the operator pod update, we can see these logs in the machine-config-operator pod

W0830 11:38:41.655465       1 warnings.go:70] unknown field "spec.infra.metadata.resourceVersion"
W0830 11:38:41.655467       1 warnings.go:70] unknown field "spec.infra.metadata.uid"
I0830 11:39:00.049819       1 helpers.go:93] Shutting down due to: terminated
I0830 11:39:00.052691       1 helpers.go:96] Context cancelled
I0830 11:39:00.079544       1 operator.go:286] Shutting down MachineConfigOperator
I0830 11:39:00.254136       1 start.go:115] Stopped leading. Terminating.

After the operator pod update, we can see these logs:

2023-08-30T11:39:13.756572375+00:00 stderr F I0830 11:39:13.756493       1 start.go:49] Version: 4.14.0-0.ci.test-2023-08-30-110107-ci-ln-4nbqnkt-latest (Raw: machine-config-daemon-4.6.0-202006240615.p0-2308-gbc165516-dirty, Hash: bc1655163ad3a944a899e05f85f175b4258907a0)
2023-08-30T11:39:13.757062160+00:00 stderr F I0830 11:39:13.757015       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
2023-08-30T11:39:13.757207344+00:00 stderr F I0830 11:39:13.757163       1 metrics.go:74] Registering Prometheus metrics
2023-08-30T11:39:13.757278350+00:00 stderr F I0830 11:39:13.757266       1 metrics.go:81] Starting metrics listener on 127.0.0.1:8797
2023-08-30T11:39:13.778781067+00:00 stderr F I0830 11:39:13.778675       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
2023-08-30T11:39:13.796313656+00:00 stderr F I0830 11:39:13.796248       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
2023-08-30T11:39:13.818883921+00:00 stderr F I0830 11:39:13.812101       1 simple_featuregate_reader.go:171] Starting feature-gate-detector

It takes 13 seconds since the old pod released the lease and the new pod took it.

After the pod restart when applying the MCs we can see these logs:

2023-08-30T11:50:00.264332131+00:00 stderr F I0830 11:50:00.264267       1 start.go:49] Version: 4.14.0-0.ci.test-2023-08-30-110107-ci-ln-4nbqnkt-latest (Raw: machine-config-daemon-4.6.0-202006240615.p0-2308-gbc165516-dirty, Hash: bc1655163ad3a944a899e05f85f175b4258907a0)
2023-08-30T11:50:00.264868367+00:00 stderr F I0830 11:50:00.264808       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
2023-08-30T11:50:00.265088557+00:00 stderr F I0830 11:50:00.264941       1 metrics.go:74] Registering Prometheus metrics
2023-08-30T11:50:00.265164856+00:00 stderr F I0830 11:50:00.265151       1 metrics.go:81] Starting metrics listener on 127.0.0.1:8797
2023-08-30T11:50:00.291282475+00:00 stderr F I0830 11:50:00.291232       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
2023-08-30T11:50:00.323308097+00:00 stderr F I0830 11:50:00.323260       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

Everything seems ok. We can add the qe-approve label.

Thank you @wking for the detailed verification steps, I really appreciate it. Thanks!!!

/label qe-approved

sergiordlr · 2023-09-06T12:38:30Z

/retest

openshift-bot · 2023-09-08T10:05:13Z

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

openshift-ci-robot · 2023-09-08T10:06:12Z

@openshift-bot: This pull request references Jira Issue OCPBUGS-16905, which is invalid:

expected the bug to target the "4.15.0" version, but it targets "4.14.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2023-09-08T17:12:20Z

/jira refresh

openshift-ci-robot · 2023-09-08T17:12:22Z

@wking: This pull request references Jira Issue OCPBUGS-16905, which is invalid:

expected the bug to target the "4.15.0" version, but it targets "4.14.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2023-09-08T17:13:04Z

/jira refresh

openshift-ci-robot · 2023-09-08T17:13:08Z

@wking: This pull request references Jira Issue OCPBUGS-16905, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.15.0) matches configured target version for branch (4.15.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2023-09-11T20:10:28Z

/cherry-pick release-4.14

openshift-cherrypick-robot · 2023-09-11T20:10:30Z

@yuqi-zhang: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-09-18T10:14:14Z

/retest-required

Remaining retests: 0 against base HEAD 5b821a2 and 2 for PR HEAD 7530ded in total

openshift-ci-robot · 2023-09-18T12:15:09Z

/retest-required

Remaining retests: 0 against base HEAD f7333c4 and 1 for PR HEAD 7530ded in total

openshift-ci · 2023-09-18T13:36:06Z

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`7530ded`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-09-18T13:56:34Z

/retest-required

Remaining retests: 0 against base HEAD fd0d2d0 and 0 for PR HEAD 7530ded in total

openshift-ci-robot · 2023-09-18T19:13:59Z

@wking: Jira Issue OCPBUGS-16905: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#3895

Jira Issue OCPBUGS-16905 has been moved to the MODIFIED state.

In response to this:

Carrying and extending #3884.

Single-node updates from 4.13 to 4.14 are having slow leader handoffs during operator Deployment updates. For example, this CI run in Loki:
{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection"
gives:
 machine-config-operator-7987b6bcdd-6zvrt
   I0817 11:26:14.747630       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
   I0817 11:26:14.756376       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
   E0817 11:31:15.153289       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config)
   E0817 11:37:42.204539       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
   E0817 11:53:05.221213       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
 machine-config-operator-74d7475455-cgcmg
   I0817 12:21:41.019562       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
   I0817 12:21:41.036459       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
 machine-config-operator-7987b6bcdd-6zvrt
   E0817 12:21:41.780276       1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator"
 machine-config-operator-74d7475455-cgcmg
   I0817 12:27:27.767746       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
   I0817 12:35:29.649939       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
   I0817 12:35:29.685549       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
   I0817 12:35:29.696394       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
That shows two lease handoffs:

During the update from 4.13 to 4.14, the cluster-version operator bumps the MCO Deployment to roll out the 4.14 image and other manifest changes.

The outgoing 4.13 is using the default ServiceAccount, because it predates OCPBUGS-10924: Switch default SA to machine-config-operator #3740

The CVO deletes the default ServiceAccount, because its manifest filename sorts it before the MCO Deployment.

The CVO bumps the MCO Deployment.

The MCO Deployment defaults to RollingUpdate with a default maxSurge of 25% rounded up. That maxSurge default is very old, dating back to at least the version of Kubernetes vendored by OpenShift 4.1.

The Deployment controller surges the incoming pod, and at 12:21:41.036459 it checks to see if the lease is available, but it isn't, because the outgoing pod is still running.

The Deployment controller TERMs the outgoing containers, and at 12:21:41.780276 the outgoing pod complains that it is no longer authorized to delete the lease ConfigMap, because the default ServiceAccount has been deleted in 1.ii.

The incoming pod acquires the lease 5m46s later at 12:27:27, when the outgoing container's final lease goes stale.

When the single node goes down for a reboot to roll out the 4.14 RHCOS, and the new container coming up at 12:35:29 acquires the lease smoothly.

This commit improves the deployment rollout with two changes:
Shifting the default ServiceAccount deletion to a 90 manifest moves it behind 0000_80_machine-config-operator_04_deployment.yaml and 0000_80_machine-config-operator_06_clusteroperator.yaml. Moving behind the ClusterOperator ensures that deletion is not initiated until the incoming operator pod has updated the ClusterOperator to confirm the update has completed. Positioning between the Deployment and the ClusterOperator would have put the deletion after the CVO bumped the Deployment, but might still have raced against the outgoing operator pod actually shutting down. This should allow the outgoing pod to gracefully remove the ConfigMap and Lease resources during graceful shutdowns.
Shifting the Deployment to Recreate. But also as discussed in the Recreate docs:

Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.

Still, Recreate will avoid the current milliseconds of overlap between the incoming pod's first attempt to acquire the lease and the outgoing pod's attempt to release during Deployment rollouts. Other MCO container restarts and drains-for-multi-control-plane-rolls and such may still have racey overlap, but should be limited to the:
 Worst graceful lease acquisition is {26s}.
latency (at least with the current 4.14 single-node lease configuration).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2023-09-18T19:14:50Z

@yuqi-zhang: new pull request created: #3920

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ader election The option has been available for years: $ git blame main.go | grep enable-leader-election dcbe86f (Sandhya Dasu 2020-08-18 21:09:29 -0400 72) flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, and without it overlapping operator pods can crash-loop [1]: : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers 0s { event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times} while fighting each other over the same ClusterOperator status: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort 2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up. I'm also setting a Recreate strategy [2], because: 1. Incoming pod surged by the default Deployment strategy. 2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it. 3. Outgoing pod releases the lease and exits. 4. Incoming pod tries again, and this time acquires the lease. can be slow in the 3-to-4 phase, while: 1. Outgoing pod releases the lease and exits. 2. Incoming pod created, scheduled, and acquires the lease. tends to be faster. And again, the component should be able to coast through small durations without a functioning leader. See openshift/machine-config-operator@7530ded86 (install: Recreate and delayed default ServiceAccount deletion, 2023-08-29, openshift/machine-config-operator#3895) for another example of how Recreate can help that way. [1]: https://issues.redhat.com/browse/OCPBUGS-25766 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment

…ader election The option has been available for years: $ git blame main.go | grep enable-leader-election dcbe86f (Sandhya Dasu 2020-08-18 21:09:29 -0400 72) flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, and without it overlapping operator pods can crash-loop [1]: : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers 0s { event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times} while fighting each other over the same ClusterOperator status: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort 2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up. I'm also setting a Recreate strategy [2], because: 1. Incoming pod surged by the default Deployment strategy. 2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it. 3. Outgoing pod releases the lease and exits. 4. Incoming pod tries again, and this time acquires the lease. can be slow in the 3-to-4 phase, while: 1. Outgoing pod releases the lease and exits. 2. Incoming pod created, scheduled, and acquires the lease. tends to be faster. And again, the component should be able to coast through small durations without a functioning leader. See openshift/machine-config-operator@7530ded86 (install: Recreate and delayed default ServiceAccount deletion, 2023-08-29, openshift/machine-config-operator#3895) for another example of how Recreate can help that way. To get the leader election working, I'm also setting LeaderElectionID and LeaderElectionNamespace, pattern-matching from [3]. Using cluster-baremetal-operator as the ID avoids collisions with baremetal-operator [4]. COMPONENT_NAMESPACE is from: - name: COMPONENT_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace using the pattern documented in [5]. We've been asking for that environment variable since f368a4f (Add sa, deployment and co manifests, 2020-09-29, openshift#25), although grepping through history, I don't see anything consuming it. But with this commit, we now have a consumer :). [1]: https://issues.redhat.com/browse/OCPBUGS-25766 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment [3]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L207-L208 [4]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L71 [5]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/

…ader election The option has been available for years: $ git blame main.go | grep enable-leader-election dcbe86f (Sandhya Dasu 2020-08-18 21:09:29 -0400 72) flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, and without it overlapping operator pods can crash-loop [1]: : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers 0s { event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times} while fighting each other over the same ClusterOperator status: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort 2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4 2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g 2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g Using a leader lock will avoid this contention, and the system should be able to coast through brief moments after an outgoing leader leaves until a replacement leader picks things back up. I'm also setting a Recreate strategy [2], because: 1. Incoming pod surged by the default Deployment strategy. 2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it. 3. Outgoing pod releases the lease and exits. 4. Incoming pod tries again, and this time acquires the lease. can be slow in the 3-to-4 phase, while: 1. Outgoing pod releases the lease and exits. 2. Incoming pod created, scheduled, and acquires the lease. tends to be faster. And again, the component should be able to coast through small durations without a functioning leader. See openshift/machine-config-operator@7530ded86 (install: Recreate and delayed default ServiceAccount deletion, 2023-08-29, openshift/machine-config-operator#3895) for another example of how Recreate can help that way. To get the leader election working, I'm also setting LeaderElectionID and LeaderElectionNamespace, pattern-matching from [3]. Using cluster-baremetal-operator as the ID avoids collisions with baremetal-operator [4]. There is a COMPONENT_NAMESPACE is from: - name: COMPONENT_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace using the pattern documented in [5]. We've been asking for that environment variable since f368a4f (Add sa, deployment and co manifests, 2020-09-29, openshift#25), although grepping through history, I don't see anything consuming it. Perhaps it's somehow feeding controllers.ComponentNamespace, which is what I'm using. [1]: https://issues.redhat.com/browse/OCPBUGS-25766 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment [3]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L207-L208 [4]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L71 [5]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 29, 2023

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Aug 29, 2023

openshift-ci bot requested review from cheesesashimi and jkyros August 29, 2023 17:24

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 29, 2023

openshift-ci bot requested a review from sergiordlr August 29, 2023 19:35

yuqi-zhang approved these changes Aug 29, 2023

View reviewed changes

wking mentioned this pull request Aug 29, 2023

OCPBUGS-16905: Operator: switch upgrade strategy to recreate #3884

Closed

openshift-ci bot assigned yuqi-zhang Aug 29, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2023

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Aug 30, 2023

openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 8, 2023

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 8, 2023

openshift-merge-robot merged commit 0aa8e03 into openshift:master Sep 18, 2023
12 of 13 checks passed

openshift-cherrypick-robot mentioned this pull request Sep 18, 2023

[release-4.14] OCPBUGS-19357: install: Recreate and delayed default ServiceAccount deletion #3920

Merged

wking deleted the single-node-lease-transitions branch September 19, 2023 22:33

wking mentioned this pull request Dec 20, 2023

OCPBUGS-25766: manifests/0000_31_cluster-baremetal-operator_06_deployment: Enable leader election openshift/cluster-baremetal-operator#395

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-16905: install: Recreate and delayed default ServiceAccount deletion #3895

OCPBUGS-16905: install: Recreate and delayed default ServiceAccount deletion #3895

wking commented Aug 29, 2023

openshift-ci-robot commented Aug 29, 2023

wking commented Aug 29, 2023

wking commented Aug 29, 2023

openshift-ci-robot commented Aug 29, 2023

yuqi-zhang left a comment

openshift-ci bot commented Aug 29, 2023

sergiordlr commented Aug 30, 2023

sergiordlr commented Sep 6, 2023

openshift-bot commented Sep 8, 2023

openshift-ci-robot commented Sep 8, 2023

wking commented Sep 8, 2023

openshift-ci-robot commented Sep 8, 2023

wking commented Sep 8, 2023

openshift-ci-robot commented Sep 8, 2023

yuqi-zhang commented Sep 11, 2023

openshift-cherrypick-robot commented Sep 11, 2023

openshift-ci-robot commented Sep 18, 2023

openshift-ci-robot commented Sep 18, 2023

openshift-ci bot commented Sep 18, 2023 •

edited

openshift-ci-robot commented Sep 18, 2023

openshift-ci-robot commented Sep 18, 2023

openshift-cherrypick-robot commented Sep 18, 2023

OCPBUGS-16905: install: Recreate and delayed default ServiceAccount deletion #3895

OCPBUGS-16905: install: Recreate and delayed default ServiceAccount deletion #3895

Conversation

wking commented Aug 29, 2023

openshift-ci-robot commented Aug 29, 2023

wking commented Aug 29, 2023

wking commented Aug 29, 2023

openshift-ci-robot commented Aug 29, 2023

yuqi-zhang left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Aug 29, 2023

sergiordlr commented Aug 30, 2023

sergiordlr commented Sep 6, 2023

openshift-bot commented Sep 8, 2023

openshift-ci-robot commented Sep 8, 2023

wking commented Sep 8, 2023

openshift-ci-robot commented Sep 8, 2023

wking commented Sep 8, 2023

openshift-ci-robot commented Sep 8, 2023

yuqi-zhang commented Sep 11, 2023

openshift-cherrypick-robot commented Sep 11, 2023

openshift-ci-robot commented Sep 18, 2023

openshift-ci-robot commented Sep 18, 2023

openshift-ci bot commented Sep 18, 2023 • edited

openshift-ci-robot commented Sep 18, 2023

openshift-ci-robot commented Sep 18, 2023

openshift-cherrypick-robot commented Sep 18, 2023

openshift-ci bot commented Sep 18, 2023 •

edited