Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-16905: install: Recreate and delayed default ServiceAccount deletion #3895

Merged

Conversation

wking
Copy link
Member

@wking wking commented Aug 29, 2023

Carrying and extending #3884.

Single-node updates from 4.13 to 4.14 are having slow leader handoffs during operator Deployment updates. For example, this CI run in Loki:

{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection"

gives:

  machine-config-operator-7987b6bcdd-6zvrt
    I0817 11:26:14.747630       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
    I0817 11:26:14.756376       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
    E0817 11:31:15.153289       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config)
    E0817 11:37:42.204539       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
    E0817 11:53:05.221213       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
  machine-config-operator-74d7475455-cgcmg
    I0817 12:21:41.019562       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
    I0817 12:21:41.036459       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
  machine-config-operator-7987b6bcdd-6zvrt
    E0817 12:21:41.780276       1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator"
  machine-config-operator-74d7475455-cgcmg
    I0817 12:27:27.767746       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
    I0817 12:35:29.649939       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
    I0817 12:35:29.685549       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
    I0817 12:35:29.696394       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

That shows two lease handoffs:

  1. During the update from 4.13 to 4.14, the cluster-version operator bumps the MCO Deployment to roll out the 4.14 image and other manifest changes.
    1. The outgoing 4.13 is using the default ServiceAccount, because it predates OCPBUGS-10924: Switch default SA to machine-config-operator #3740
    2. The CVO deletes the default ServiceAccount, because its manifest filename sorts it before the MCO Deployment.
    3. The CVO bumps the MCO Deployment.
    4. The MCO Deployment defaults to RollingUpdate with a default maxSurge of 25% rounded up. That maxSurge default is very old, dating back to at least the version of Kubernetes vendored by OpenShift 4.1.
    5. The Deployment controller surges the incoming pod, and at 12:21:41.036459 it checks to see if the lease is available, but it isn't, because the outgoing pod is still running.
    6. The Deployment controller TERMs the outgoing containers, and at 12:21:41.780276 the outgoing pod complains that it is no longer authorized to delete the lease ConfigMap, because the default ServiceAccount has been deleted in 1.ii.
    7. The incoming pod acquires the lease 5m46s later at 12:27:27, when the outgoing container's final lease goes stale.
  2. When the single node goes down for a reboot to roll out the 4.14 RHCOS, and the new container coming up at 12:35:29 acquires the lease smoothly.

This commit improves the deployment rollout with two changes:

  • Shifting the default ServiceAccount deletion to a 90 manifest moves it behind 0000_80_machine-config-operator_04_deployment.yaml and 0000_80_machine-config-operator_06_clusteroperator.yaml. Moving behind the ClusterOperator ensures that deletion is not initiated until the incoming operator pod has updated the ClusterOperator to confirm the update has completed. Positioning between the Deployment and the ClusterOperator would have put the deletion after the CVO bumped the Deployment, but might still have raced against the outgoing operator pod actually shutting down. This should allow the outgoing pod to gracefully remove the ConfigMap and Lease resources during graceful shutdowns.

  • Shifting the Deployment to Recreate. But also as discussed in the Recreate docs:

    Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.

    Still, Recreate will avoid the current milliseconds of overlap between the incoming pod's first attempt to acquire the lease and the outgoing pod's attempt to release during Deployment rollouts. Other MCO container restarts and drains-for-multi-control-plane-rolls and such may still have racey overlap, but should be limited to the:

      Worst graceful lease acquisition is {26s}.
    

    latency (at least with the current 4.14 single-node lease configuration).

Single-node updates from 4.13 to 4.14 are having slow leader handoffs
during operator Deployment updates [1].  For example [2] in Loki:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection"

gives:

  machine-config-operator-7987b6bcdd-6zvrt
    I0817 11:26:14.747630       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
    I0817 11:26:14.756376       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
    E0817 11:31:15.153289       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config)
    E0817 11:37:42.204539       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
    E0817 11:53:05.221213       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
  machine-config-operator-74d7475455-cgcmg
    I0817 12:21:41.019562       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
    I0817 12:21:41.036459       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
  machine-config-operator-7987b6bcdd-6zvrt
    E0817 12:21:41.780276       1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator"
  machine-config-operator-74d7475455-cgcmg
    I0817 12:27:27.767746       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
    I0817 12:35:29.649939       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
    I0817 12:35:29.685549       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
    I0817 12:35:29.696394       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

That shows two lease handoffs:

1. During the update from 4.13 to 4.14, the cluster-version operator
   bumps the MCO Deployment to roll out the 4.14 image and other manifest
   changes.
   a. The outgoing 4.13 is using the default ServiceAccount, because
      it predates [3,4].
   b. The CVO deletes the default ServiceAccount, because its manifest
      filename sorts it before the MCO Deployment.
   c. The CVO bumps the MCO Deployment.
   d. The MCO Deployment defaults to RollingUpdate [5] with a default
      maxSurge of 25% rounded up [6].  That maxSurge default is very
      old, dating back to at least the version of Kubernetes vendored
      by OpenShift 4.1 [7].
   e. The Deployment controller surges the incoming pod, and at
      12:21:41.036459 it checks to see if the lease is available, but
      it isn't, because the outgoing pod is still running.
   f. The Deployment controller TERMs the outgoing containers, and at
      12:21:41.780276 the outgoing pod complains that it is no longer
      authorized to delete the lease ConfigMap, because the default
      ServiceAccount has been deleted in 1.b.
   g. The incoming pod acquires the lease 5m46s later at 12:27:27,
      when the outgoing container's final lease goes stale.
2. When the single node goes down for a reboot to roll out the 4.14
   RHCOS, and the new container coming up at 12:35:29 acquires the
   lease smoothly.

This commit improves the deployment rollout with two changes:

* Shifting the default ServiceAccount deletion to a 90 manifest moves
  it behind 0000_80_machine-config-operator_04_deployment.yaml and
  0000_80_machine-config-operator_06_clusteroperator.yaml.  Moving
  behind the ClusterOperator ensures that deletion is not initiated
  until the incoming operator pod has updated the ClusterOperator to
  confirm the update has completed.  Positioning between the
  Deployment and the ClusterOperator would have put the deletion after
  the CVO bumped the Deployment, but might still have raced against
  the outgoing operator pod actually shutting down.  This should allow
  the outgoing pod to gracefully remove the ConfigMap and Lease
  resources during graceful shutdowns.
* Shifting the Deployment to Recreate [8].  But also as discussed in
  [8]:

    Note: This will only guarantee Pod termination previous to
    creation for upgrades. If you upgrade a Deployment, all Pods of
    the old revision will be terminated immediately. Successful
    removal is awaited before any Pod of the new revision is
    created. If you manually delete a Pod, the lifecycle is controlled
    by the ReplicaSet and the replacement will be created immediately
    (even if the old Pod is still in a Terminating state). If you need
    an "at most" guarantee for your Pods, you should consider using a
    StatefulSet.

  Still, Recreate will avoid the current milliseconds of overlap
  between the incoming pod's first attempt to acquire the lease and
  the outgoing pod's attempt to release during Deployment rollouts.
  Other MCO container restarts and
  drains-for-multi-control-plane-rolls and such may still have racey
  overlap, but should be limited to the:

    Worst graceful lease acquisition is {26s}.

  latency (at least with the current 4.14 single-node lease
  configuration).

[1]: https://issues.redhat.com/browse/OCPBUGS-16905
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688
[3]: https://issues.redhat.com/browse/OCPBUGS-10924
[4]: openshift@ace637f
[5]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
[6]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge
[7]: openshift/api $ git --no-pager grep -B12 maxSurge origin/release-4.1 -- vendor/k8s.io/api/apps/v1/types.go
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // The maximum number of pods that can be scheduled above the desired number of
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // pods.
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%).
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // This can not be 0 if MaxUnavailable is 0.
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // Absolute number is calculated from percentage by rounding up.
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // Defaults to 25%.
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // Example: when this is set to 30%, the new ReplicaSet can be scaled up immediately when
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // the rolling update starts, such that the total number of old and new pods do not exceed
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // 130% of desired pods. Once old pods have been killed,
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // new ReplicaSet can be scaled up further, ensuring that total number of pods running
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // at any time during the update is at most 130% of desired pods.
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go-  // +optional
     origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go:  MaxSurge *intstr.IntOrString `json:"maxSurge,omitempty" protobuf:"bytes,2,opt,name=maxSurge"
[8]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 29, 2023
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-16905, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Carrying and extending #3884.

Single-node updates from 4.13 to 4.14 are having slow leader handoffs during operator Deployment updates. For example, this CI run in Loki:

{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection"

gives:

 machine-config-operator-7987b6bcdd-6zvrt
   I0817 11:26:14.747630       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
   I0817 11:26:14.756376       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
   E0817 11:31:15.153289       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config)
   E0817 11:37:42.204539       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
   E0817 11:53:05.221213       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
 machine-config-operator-74d7475455-cgcmg
   I0817 12:21:41.019562       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
   I0817 12:21:41.036459       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
 machine-config-operator-7987b6bcdd-6zvrt
   E0817 12:21:41.780276       1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator"
 machine-config-operator-74d7475455-cgcmg
   I0817 12:27:27.767746       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
   I0817 12:35:29.649939       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
   I0817 12:35:29.685549       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
   I0817 12:35:29.696394       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

That shows two lease handoffs:

  1. During the update from 4.13 to 4.14, the cluster-version operator bumps the MCO Deployment to roll out the 4.14 image and other manifest changes.
    1. The outgoing 4.13 is using the default ServiceAccount, because it predates OCPBUGS-10924: Switch default SA to machine-config-operator #3740
    2. The CVO deletes the default ServiceAccount, because its manifest filename sorts it before the MCO Deployment.
    3. The CVO bumps the MCO Deployment.
    4. The MCO Deployment defaults to RollingUpdate with a default maxSurge of 25% rounded up. That maxSurge default is very old, dating back to at least the version of Kubernetes vendored by OpenShift 4.1.
    5. The Deployment controller surges the incoming pod, and at 12:21:41.036459 it checks to see if the lease is available, but it isn't, because the outgoing pod is still running.
    6. The Deployment controller TERMs the outgoing containers, and at 12:21:41.780276 the outgoing pod complains that it is no longer authorized to delete the lease ConfigMap, because the default ServiceAccount has been deleted in 1.ii.
    7. The incoming pod acquires the lease 5m46s later at 12:27:27, when the outgoing container's final lease goes stale.
  2. When the single node goes down for a reboot to roll out the 4.14 RHCOS, and the new container coming up at 12:35:29 acquires the lease smoothly.

This commit improves the deployment rollout with two changes:

  • Shifting the default ServiceAccount deletion to a 90 manifest moves it behind 0000_80_machine-config-operator_04_deployment.yaml and 0000_80_machine-config-operator_06_clusteroperator.yaml. Moving behind the ClusterOperator ensures that deletion is not initiated until the incoming operator pod has updated the ClusterOperator to confirm the update has completed. Positioning between the Deployment and the ClusterOperator would have put the deletion after the CVO bumped the Deployment, but might still have raced against the outgoing operator pod actually shutting down. This should allow the outgoing pod to gracefully remove the ConfigMap and Lease resources during graceful shutdowns.

  • Shifting the Deployment to Recreate. But also as discussed in the Recreate docs:

    Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.

    Still, Recreate will avoid the current milliseconds of overlap between the incoming pod's first attempt to acquire the lease and the outgoing pod's attempt to release during Deployment rollouts. Other MCO container restarts and drains-for-multi-control-plane-rolls and such may still have racey overlap, but should be limited to the:

     Worst graceful lease acquisition is {26s}.
    

    latency (at least with the current 4.14 single-node lease configuration).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Aug 29, 2023
@wking
Copy link
Member Author

wking commented Aug 29, 2023

Testing in Cluster Bot with:

  • launch 4.13.10 single-node,aws (logs)
  • build 4.14,openshift/machine-config-operator#3895 (logs)
$ oc image info registry.build05.ci.openshift.org/ci-ln-itrhw1t/release:latest | grep Digest
Digest:      sha256:0b85c9dc133c734cc348c2a26c7b32f826f4494245fa69de2c0776d3181b8f70
$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.build05.ci.openshift.org/ci-ln-itrhw1t/release@sha256:0b85c9dc133c734cc348c2a26c7b32f826f4494245fa69de2c0776d3181b8f70

And then watching the original MCO pod:

$ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-operator --tail -1 --follow
...
I0829 18:46:40.004700       1 helpers.go:93] Shutting down due to: terminated
I0829 18:46:40.004859       1 helpers.go:96] Context cancelled
I0829 18:46:40.004947       1 operator.go:286] Shutting down MachineConfigOperator
I0829 18:46:40.042567       1 start.go:115] Stopped leading. Terminating.

And the incoming MCO pod after the Deployment bump:

$ oc -n openshift-machine-config-operator logs -l k8s-app=machine-config-operator --tail -1 --follow
Defaulted container "machine-config-operator" out of: machine-config-operator, kube-rbac-proxy
I0829 18:46:52.063271       1 start.go:49] Version: 4.14.0-0.test-2023-08-29-180918-ci-ln-itrhw1t-latest (Raw: machine-config-daemon-4.6.0-202006240615.p0-2308-gbc165516-dirty, Hash: bc1655163ad3a944a899e05f85f175b4258907a0)
I0829 18:46:52.063533       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
I0829 18:46:52.063620       1 metrics.go:74] Registering Prometheus metrics
I0829 18:46:52.063685       1 metrics.go:81] Starting metrics listener on 127.0.0.1:8797
I0829 18:46:52.080335       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0829 18:46:52.086623       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
...

So that looks like a nice, clean, 12s handoff to me on the Deployment roll. Confirming in in-cluster Prometheus with:

group by (lease_holder) (kube_lease_owner{namespace="openshift-machine-config-operator",lease="machine-config"})

and:

group by (pod, container_id, image) (kube_pod_container_info{namespace="openshift-machine-config-operator",container="machine-config-operator"})

Which confirms that this commit is having its intended effect on the Deployment bump. The ClusterBot job hasn't gotten to the single-node reboot phase yet, but neither of the changes I'm making here are expected to effect that handoff.

PromQL metrics for lease handoff

@wking
Copy link
Member Author

wking commented Aug 29, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 29, 2023
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-16905, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from sergiordlr August 29, 2023 19:35
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Makes sense for us and should be an otherwise safe fix. I don't think we need to merge through the critical-fixes-only.

Would be good to do some verification on SNO to see whether this resolves both issues but I don't think this otherwise blocks the 4.14 release

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 29, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2023
@sergiordlr
Copy link

Verified using SNO on GCP.

Upgrade fom 4.13 to 4.14+fix

    history:
    - acceptedRisks: |-
        Target release version="" image="registry.build05.ci.openshift.org/ci-ln-4nbqnkt/release:latest" cannot be verified, but continuing anyway because the update was forced: release images that are not accessed via digest cannot be verified
        Forced through blocking failures: Multiple precondition checks failed:
        * Precondition "ClusterVersionUpgradeable" failed because of "AdminAckRequired": Kubernetes 1.27 and therefore OpenShift 4.14 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6958395 for details and instructions.
        * Precondition "EtcdRecentBackup" failed because of "ControllerStarted": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
        * Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.13.0-0.nightly-2023-08-29-102305 to 4.14.0-0.ci.test-2023-08-30-110107-ci-ln-4nbqnkt-latest is unknown.
      completionTime: "2023-08-30T11:53:42Z"
      image: registry.build05.ci.openshift.org/ci-ln-4nbqnkt/release:latest
      startedTime: "2023-08-30T11:04:10Z"
      state: Completed
      verified: false
      version: 4.14.0-0.ci.test-2023-08-30-110107-ci-ln-4nbqnkt-latest
    - completionTime: "2023-08-30T10:37:29Z"
      image: registry.ci.openshift.org/ocp/release@sha256:1c9791305bf0b3c1e77ce0fa7926cf03472b0a1af4e67fc1e2ccac7a173b70d0
      startedTime: "2023-08-30T10:12:17Z"
      state: Completed
      verified: false
      version: 4.13.0-0.nightly-2023-08-29-102305

Before the operator pod update, we can see these logs in the machine-config-operator pod

W0830 11:38:41.655465       1 warnings.go:70] unknown field "spec.infra.metadata.resourceVersion"
W0830 11:38:41.655467       1 warnings.go:70] unknown field "spec.infra.metadata.uid"
I0830 11:39:00.049819       1 helpers.go:93] Shutting down due to: terminated
I0830 11:39:00.052691       1 helpers.go:96] Context cancelled
I0830 11:39:00.079544       1 operator.go:286] Shutting down MachineConfigOperator
I0830 11:39:00.254136       1 start.go:115] Stopped leading. Terminating.

After the operator pod update, we can see these logs:

2023-08-30T11:39:13.756572375+00:00 stderr F I0830 11:39:13.756493       1 start.go:49] Version: 4.14.0-0.ci.test-2023-08-30-110107-ci-ln-4nbqnkt-latest (Raw: machine-config-daemon-4.6.0-202006240615.p0-2308-gbc165516-dirty, Hash: bc1655163ad3a944a899e05f85f175b4258907a0)
2023-08-30T11:39:13.757062160+00:00 stderr F I0830 11:39:13.757015       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
2023-08-30T11:39:13.757207344+00:00 stderr F I0830 11:39:13.757163       1 metrics.go:74] Registering Prometheus metrics
2023-08-30T11:39:13.757278350+00:00 stderr F I0830 11:39:13.757266       1 metrics.go:81] Starting metrics listener on 127.0.0.1:8797
2023-08-30T11:39:13.778781067+00:00 stderr F I0830 11:39:13.778675       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
2023-08-30T11:39:13.796313656+00:00 stderr F I0830 11:39:13.796248       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
2023-08-30T11:39:13.818883921+00:00 stderr F I0830 11:39:13.812101       1 simple_featuregate_reader.go:171] Starting feature-gate-detector

It takes 13 seconds since the old pod released the lease and the new pod took it.

After the pod restart when applying the MCs we can see these logs:

2023-08-30T11:50:00.264332131+00:00 stderr F I0830 11:50:00.264267       1 start.go:49] Version: 4.14.0-0.ci.test-2023-08-30-110107-ci-ln-4nbqnkt-latest (Raw: machine-config-daemon-4.6.0-202006240615.p0-2308-gbc165516-dirty, Hash: bc1655163ad3a944a899e05f85f175b4258907a0)
2023-08-30T11:50:00.264868367+00:00 stderr F I0830 11:50:00.264808       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
2023-08-30T11:50:00.265088557+00:00 stderr F I0830 11:50:00.264941       1 metrics.go:74] Registering Prometheus metrics
2023-08-30T11:50:00.265164856+00:00 stderr F I0830 11:50:00.265151       1 metrics.go:81] Starting metrics listener on 127.0.0.1:8797
2023-08-30T11:50:00.291282475+00:00 stderr F I0830 11:50:00.291232       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
2023-08-30T11:50:00.323308097+00:00 stderr F I0830 11:50:00.323260       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

Everything seems ok. We can add the qe-approve label.

Thank you @wking for the detailed verification steps, I really appreciate it. Thanks!!!

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Aug 30, 2023
@sergiordlr
Copy link

/retest

@openshift-bot
Copy link
Contributor

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 8, 2023
@openshift-ci-robot
Copy link
Contributor

@openshift-bot: This pull request references Jira Issue OCPBUGS-16905, which is invalid:

  • expected the bug to target the "4.15.0" version, but it targets "4.14.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Sep 8, 2023

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-16905, which is invalid:

  • expected the bug to target the "4.15.0" version, but it targets "4.14.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Sep 8, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 8, 2023
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-16905, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yuqi-zhang
Copy link
Contributor

/cherry-pick release-4.14

@openshift-cherrypick-robot

@yuqi-zhang: once the present PR merges, I will cherry-pick it on top of release-4.14 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 5b821a2 and 2 for PR HEAD 7530ded in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD f7333c4 and 1 for PR HEAD 7530ded in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 18, 2023

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 7530ded link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD fd0d2d0 and 0 for PR HEAD 7530ded in total

@openshift-merge-robot openshift-merge-robot merged commit 0aa8e03 into openshift:master Sep 18, 2023
12 of 13 checks passed
@openshift-ci-robot
Copy link
Contributor

@wking: Jira Issue OCPBUGS-16905: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-16905 has been moved to the MODIFIED state.

In response to this:

Carrying and extending #3884.

Single-node updates from 4.13 to 4.14 are having slow leader handoffs during operator Deployment updates. For example, this CI run in Loki:

{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection"

gives:

 machine-config-operator-7987b6bcdd-6zvrt
   I0817 11:26:14.747630       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
   I0817 11:26:14.756376       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
   E0817 11:31:15.153289       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config)
   E0817 11:37:42.204539       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
   E0817 11:53:05.221213       1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
 machine-config-operator-74d7475455-cgcmg
   I0817 12:21:41.019562       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
   I0817 12:21:41.036459       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
 machine-config-operator-7987b6bcdd-6zvrt
   E0817 12:21:41.780276       1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator"
 machine-config-operator-74d7475455-cgcmg
   I0817 12:27:27.767746       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
   I0817 12:35:29.649939       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
   I0817 12:35:29.685549       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
   I0817 12:35:29.696394       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

That shows two lease handoffs:

  1. During the update from 4.13 to 4.14, the cluster-version operator bumps the MCO Deployment to roll out the 4.14 image and other manifest changes.
    1. The outgoing 4.13 is using the default ServiceAccount, because it predates OCPBUGS-10924: Switch default SA to machine-config-operator #3740
    2. The CVO deletes the default ServiceAccount, because its manifest filename sorts it before the MCO Deployment.
    3. The CVO bumps the MCO Deployment.
    4. The MCO Deployment defaults to RollingUpdate with a default maxSurge of 25% rounded up. That maxSurge default is very old, dating back to at least the version of Kubernetes vendored by OpenShift 4.1.
    5. The Deployment controller surges the incoming pod, and at 12:21:41.036459 it checks to see if the lease is available, but it isn't, because the outgoing pod is still running.
    6. The Deployment controller TERMs the outgoing containers, and at 12:21:41.780276 the outgoing pod complains that it is no longer authorized to delete the lease ConfigMap, because the default ServiceAccount has been deleted in 1.ii.
    7. The incoming pod acquires the lease 5m46s later at 12:27:27, when the outgoing container's final lease goes stale.
  2. When the single node goes down for a reboot to roll out the 4.14 RHCOS, and the new container coming up at 12:35:29 acquires the lease smoothly.

This commit improves the deployment rollout with two changes:

  • Shifting the default ServiceAccount deletion to a 90 manifest moves it behind 0000_80_machine-config-operator_04_deployment.yaml and 0000_80_machine-config-operator_06_clusteroperator.yaml. Moving behind the ClusterOperator ensures that deletion is not initiated until the incoming operator pod has updated the ClusterOperator to confirm the update has completed. Positioning between the Deployment and the ClusterOperator would have put the deletion after the CVO bumped the Deployment, but might still have raced against the outgoing operator pod actually shutting down. This should allow the outgoing pod to gracefully remove the ConfigMap and Lease resources during graceful shutdowns.

  • Shifting the Deployment to Recreate. But also as discussed in the Recreate docs:

    Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.

    Still, Recreate will avoid the current milliseconds of overlap between the incoming pod's first attempt to acquire the lease and the outgoing pod's attempt to release during Deployment rollouts. Other MCO container restarts and drains-for-multi-control-plane-rolls and such may still have racey overlap, but should be limited to the:

     Worst graceful lease acquisition is {26s}.
    

    latency (at least with the current 4.14 single-node lease configuration).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@yuqi-zhang: new pull request created: #3920

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the single-node-lease-transitions branch September 19, 2023 22:33
wking added a commit to wking/cluster-baremetal-operator that referenced this pull request Dec 20, 2023
…ader election

The option has been available for years:

  $ git blame main.go | grep enable-leader-election
  dcbe86f (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop [1]:

  : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
  {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
  2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should
be able to coast through brief moments after an outgoing leader leaves
until a replacement leader picks things back up.

I'm also setting a Recreate strategy [2], because:

1. Incoming pod surged by the default Deployment strategy.
2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
3. Outgoing pod releases the lease and exits.
4. Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

1. Outgoing pod releases the lease and exits.
2. Incoming pod created, scheduled, and acquires the lease.

tends to be faster.  And again, the component should be able to coast
through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (install: Recreate and
delayed default ServiceAccount deletion, 2023-08-29,
openshift/machine-config-operator#3895) for another example of how
Recreate can help that way.

[1]: https://issues.redhat.com/browse/OCPBUGS-25766
[2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
wking added a commit to wking/cluster-baremetal-operator that referenced this pull request Dec 20, 2023
…ader election

The option has been available for years:

  $ git blame main.go | grep enable-leader-election
  dcbe86f (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop [1]:

  : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
  {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
  2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should
be able to coast through brief moments after an outgoing leader leaves
until a replacement leader picks things back up.

I'm also setting a Recreate strategy [2], because:

1. Incoming pod surged by the default Deployment strategy.
2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
3. Outgoing pod releases the lease and exits.
4. Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

1. Outgoing pod releases the lease and exits.
2. Incoming pod created, scheduled, and acquires the lease.

tends to be faster.  And again, the component should be able to coast
through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (install: Recreate and
delayed default ServiceAccount deletion, 2023-08-29,
openshift/machine-config-operator#3895) for another example of how
Recreate can help that way.

[1]: https://issues.redhat.com/browse/OCPBUGS-25766
[2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
wking added a commit to wking/cluster-baremetal-operator that referenced this pull request Dec 21, 2023
…ader election

The option has been available for years:

  $ git blame main.go | grep enable-leader-election
  dcbe86f (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop [1]:

  : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
  {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
  2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should
be able to coast through brief moments after an outgoing leader leaves
until a replacement leader picks things back up.

I'm also setting a Recreate strategy [2], because:

1. Incoming pod surged by the default Deployment strategy.
2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
3. Outgoing pod releases the lease and exits.
4. Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

1. Outgoing pod releases the lease and exits.
2. Incoming pod created, scheduled, and acquires the lease.

tends to be faster.  And again, the component should be able to coast
through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (install: Recreate and
delayed default ServiceAccount deletion, 2023-08-29,
openshift/machine-config-operator#3895) for another example of how
Recreate can help that way.

To get the leader election working, I'm also setting LeaderElectionID
and LeaderElectionNamespace, pattern-matching from [3].  Using
cluster-baremetal-operator as the ID avoids collisions with
baremetal-operator [4].  COMPONENT_NAMESPACE is from:

  - name: COMPONENT_NAMESPACE
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: metadata.namespace

using the pattern documented in [5].  We've been asking for that
environment variable since f368a4f (Add sa, deployment and co
manifests, 2020-09-29, openshift#25), although grepping through history, I
don't see anything consuming it.  But with this commit, we now have a
consumer :).

[1]: https://issues.redhat.com/browse/OCPBUGS-25766
[2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
[3]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L207-L208
[4]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L71
[5]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/
wking added a commit to wking/cluster-baremetal-operator that referenced this pull request Dec 21, 2023
…ader election

The option has been available for years:

  $ git blame main.go | grep enable-leader-election
  dcbe86f (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop [1]:

  : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
  {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
  2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should
be able to coast through brief moments after an outgoing leader leaves
until a replacement leader picks things back up.

I'm also setting a Recreate strategy [2], because:

1. Incoming pod surged by the default Deployment strategy.
2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
3. Outgoing pod releases the lease and exits.
4. Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

1. Outgoing pod releases the lease and exits.
2. Incoming pod created, scheduled, and acquires the lease.

tends to be faster.  And again, the component should be able to coast
through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (install: Recreate and
delayed default ServiceAccount deletion, 2023-08-29,
openshift/machine-config-operator#3895) for another example of how
Recreate can help that way.

To get the leader election working, I'm also setting LeaderElectionID
and LeaderElectionNamespace, pattern-matching from [3].  Using
cluster-baremetal-operator as the ID avoids collisions with
baremetal-operator [4].  There is a COMPONENT_NAMESPACE is from:

  - name: COMPONENT_NAMESPACE
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: metadata.namespace

using the pattern documented in [5].  We've been asking for that
environment variable since f368a4f (Add sa, deployment and co
manifests, 2020-09-29, openshift#25), although grepping through history, I
don't see anything consuming it.  Perhaps it's somehow feeding
controllers.ComponentNamespace, which is what I'm using.

[1]: https://issues.redhat.com/browse/OCPBUGS-25766
[2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
[3]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L207-L208
[4]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L71
[5]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-baremetal-operator that referenced this pull request Feb 7, 2024
…ader election

The option has been available for years:

  $ git blame main.go | grep enable-leader-election
  dcbe86f (Sandhya Dasu               2020-08-18 21:09:29 -0400  72)    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false,

and without it overlapping operator pods can crash-loop [1]:

  : [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
  {  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

while fighting each other over the same ClusterOperator status:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016/artifacts/e2e-aws-ovn-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
  $ zgrep -h '"resource":"clusteroperators","name":"baremetal"' kube-apiserver/*audit*log.gz | jq -r 'select(.verb == "create" or .verb == "update") | .stageTimestamp + " " + .verb + " " + (.responseStatus.code | tostring) + " " + (.objectRef.subresource) + " " + .user.username + " " + .user.extra["authentication.kubernetes.io/pod-name"][0]' | grep 'T06:08:.*cluster-baremetal-operator' | sort
  2023-12-20T06:08:21.757799Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.778638Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.780378Z update 409 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-574577fbcb-z8nd4
  2023-12-20T06:08:21.790000Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g
  2023-12-20T06:08:21.802780Z update 200 status system:serviceaccount:openshift-machine-api:cluster-baremetal-operator cluster-baremetal-operator-7fbb57959b-s9v9g

Using a leader lock will avoid this contention, and the system should
be able to coast through brief moments after an outgoing leader leaves
until a replacement leader picks things back up.

I'm also setting a Recreate strategy [2], because:

1. Incoming pod surged by the default Deployment strategy.
2. Incoming pod attempts to acquire the Lease, but the outgoing pod is holding it.
3. Outgoing pod releases the lease and exits.
4. Incoming pod tries again, and this time acquires the lease.

can be slow in the 3-to-4 phase, while:

1. Outgoing pod releases the lease and exits.
2. Incoming pod created, scheduled, and acquires the lease.

tends to be faster.  And again, the component should be able to coast
through small durations without a functioning leader.

See openshift/machine-config-operator@7530ded86 (install: Recreate and
delayed default ServiceAccount deletion, 2023-08-29,
openshift/machine-config-operator#3895) for another example of how
Recreate can help that way.

To get the leader election working, I'm also setting LeaderElectionID
and LeaderElectionNamespace, pattern-matching from [3].  Using
cluster-baremetal-operator as the ID avoids collisions with
baremetal-operator [4].  There is a COMPONENT_NAMESPACE is from:

  - name: COMPONENT_NAMESPACE
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: metadata.namespace

using the pattern documented in [5].  We've been asking for that
environment variable since f368a4f (Add sa, deployment and co
manifests, 2020-09-29, openshift#25), although grepping through history, I
don't see anything consuming it.  Perhaps it's somehow feeding
controllers.ComponentNamespace, which is what I'm using.

[1]: https://issues.redhat.com/browse/OCPBUGS-25766
[2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
[3]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L207-L208
[4]: https://github.com/openshift/baremetal-operator/blob/749609dd0fb858be72a783c5d87853cad2d303db/main.go#L71
[5]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants