OCPBUGS-33018: pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap #4337

wking · 2024-04-25T22:41:52Z

Some adjustements to the logic from 4d447c5 (#4106).

Factor out the common rebootstrap logic into a new function, just to avoid repeating ourselves in two places. This part is a pure refactor, with no user-visible changes.

Significantly for users, 4d447c5's utilruntime.ErrorHandlers handling introduced a race like:

Daemon is rolling an update out onto disk.
Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.
utilruntime.ErrorHandlers feeds that failure into errCh.
stopCh set to trigger graceful shutdowns in the child goroutines.
Child goroutine performing the update moves into its deferred rollbacks (for example, see code around error rolling back files writes).
But before that rollback completes, the main goroutine returns ErrAuxiliary, and the container exits 255 without finishing the rollback.
Kubelet launches a replacement container.
Replacement container is upset at the interrupted cleanup, complains content mismatch for file... about some of the incoming-but-not-rolled-back content, and exits.
Return to step 7, looping forever.

But we ignored these utilruntime.ErrorHandlers errors before 4d447c5, without trouble beyond responding to api-int Certificate Authority rotations. We still want to rebootstrap when we have api-int trouble, but I'm not gating both the rebootstrap and the ErrAuxiliary return on deferKubeletRestart. So now:

Daemon is rolling an update out onto disk.
Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.
utilruntime.ErrorHandlers feeds that failure into errCh.
Daemon knows it does not have a pending kubelet rebootstrap, and it ignores the error.

Or we can have:

...
4. Daemon knows it needs to rebootstrap the kubelet, so it does.
5. Daemon returns ErrAuxiliary, and the container exits 255.

There's still room for a partial-rollback race there if we have an api-int certificate rotation in the works while the Kube API hiccups while we're rolling update files out to disk, because the current goroutine handling launches children like:

go wait.Until(dn.worker, time.Second, stopCh)

so there's no mechanism for the canceled child goroutine to tell the parent "got your message, and I've finished my graceful shutdown". But this commit means we are only exposed when there's an ongoing api-int rotation, and that reduces our risk significantly (they're less common than generic "any kind of Kube API hiccup matching the daemon's substrings"). MCO-1154 is up to track providing that child-to-parent return path.

openshift-ci-robot · 2024-04-25T22:43:18Z

@wking: This pull request references Jira Issue OCPBUGS-33018, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Some adjustements to the logic from 4d447c5 (#4106).

Factor out the common rebootstrap logic into a new function, just to avoid repeating ourselves in two places. This part is a pure refactor, with no user-visible changes.

Significantly for users, 4d447c5's utilruntime.ErrorHandlers handling introduced a race like:

Daemon is rolling an update out onto disk.

Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.

utilruntime.ErrorHandlers feeds that failure into errCh.

stopCh set to trigger graceful shutdowns in the child goroutines.

Child goroutine performing the update moves into its deferred rollbacks (for example, see code around error rolling back files writes).

But before that rollback completes, the main goroutine returns ErrAuxiliary, and the container exits 255 without finishing the rollback.

Kubelet launches a replacement container.

Replacement container is upset at the interrupted cleanup, complains content mismatch for file... about some of the incoming-but-not-rolled-back content, and exits.

Return to step 7, looping forever.

But we ignored these utilruntime.ErrorHandlers errors before 4d447c5, without trouble beyond responding to api-int Certificate Authority rotations. We still want to rebootstrap when we have api-int trouble, but I'm not gating both the rebootstrap and the ErrAuxiliary return on deferKubeletRestart. So now:

Daemon is rolling an update out onto disk.

Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.

utilruntime.ErrorHandlers feeds that failure into errCh.

Daemon knows it does not have a pending kubelet rebootstrap, and it ignores the error.

Or we can have:

...
4. Daemon knows it needs to rebootstrap the kubelet, so it does.
5. Daemon returns ErrAuxiliary, and the container exits 255.

There's still room for a partial-rollback race there if we have an api-int certificate rotation in the works while the Kube API hiccups while we're rolling update files out to disk, because the current goroutine handling launches children like:
go wait.Until(dn.worker, time.Second, stopCh)
so there's no mechanism for the canceled child goroutine to tell the parent "got your message, and I've finished my graceful shutdown". But this commit means we are only exposed when there's an ongoing api-int rotation, and that reduces our risk significantly (they're less common than generic "any kind of Kube API hiccup matching the daemon's substrings"). MCO-1154 is up to track providing that child-to-parent return path.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

wking · 2024-04-25T22:44:05Z

/Jira refresh

openshift-ci-robot · 2024-04-25T22:44:11Z

@wking: This pull request references Jira Issue OCPBUGS-33018, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/Jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

yuqi-zhang · 2024-04-25T22:48:23Z

/test ?

openshift-ci · 2024-04-25T22:48:27Z

@yuqi-zhang: The following commands are available to trigger required jobs:

/test 4.12-upgrade-from-stable-4.11-images
/test cluster-bootimages
/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test e2e-gcp-op-single-node
/test e2e-hypershift
/test images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
/test bootstrap-unit
/test e2e-aws-disruptive
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-aws-ovn-upgrade-out-of-change
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-ovn-upgrade-out-of-change
/test e2e-azure-upgrade
/test e2e-gcp-op-techpreview
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-metal-assisted
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-dualstack
/test e2e-openstack-externallb
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-upgrade
/test e2e-vsphere-upi
/test e2e-vsphere-upi-zones
/test e2e-vsphere-zones
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-images
/test security

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-master-bootstrap-unit
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-techpreview
pull-ci-openshift-machine-config-operator-master-e2e-hypershift
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-okd-images
pull-ci-openshift-machine-config-operator-master-security
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/daemon/daemon.go

wking · 2024-04-25T23:12:22Z

Running one to see if this works:

/payload-job periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6

If that shows us using the target payload we expect, we probably want to run a /payload-aggregate ... because the periodic is only mostly dead, not completely dead, and we don't want to misread a got-lucky-and-passed racy win as "actually fixed this issue".

openshift-ci · 2024-04-25T23:12:28Z

@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/465e69e0-0359-11ef-8ea6-b7071b44f025-0

Some adjustements to the logic from 4d447c5 (backportable version of api-int cert work, 2024-01-09, openshift#4106). Factor out the common rebootstrap logic into a new function, just to avoid repeating ourselves in two places. This part is a pure refactor, with no user-visible changes. Significantly for users, 4d447c5's utilruntime.ErrorHandlers handling introduced a race like: 1. Daemon is rolling an update out onto disk. 2. Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail. 3. utilruntime.ErrorHandlers feeds that failure into errCh. 4. stopCh set to trigger graceful shutdowns in the child goroutines. 5. Child goroutine performing the update moves into its deferred rollbacks (for example, see code around "error rolling back files writes"). 6. But before that rollback completes, the main goroutine returns ErrAuxiliary, and the container exits 255 without finishing the rollback. 7. Kubelet launches a replacement container. 8. Replacement container is upset at the interrupted cleanup, complains "content mismatch for file..." about some of the incoming-but-not-rolled-back content, and exits. 9. Return to step 7, looping forever. But we ignored these utilruntime.ErrorHandlers errors before 4d447c5, without trouble beyond responding to api-int Certificate Authority rotations. We still want to rebootstrap when we have api-int trouble, but I'm not gating both the rebootstrap and the ErrAuxiliary return on deferKubeletRestart. So now: 1. Daemon is rolling an update out onto disk. 2. Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail. 3. utilruntime.ErrorHandlers feeds that failure into errCh. 4. Daemon knows it does not have a pending kubelet rebootstrap, and it ignores the error. Or we can have: ... 4. Daemon knows it needs to rebootstrap the kubelet, so it does. 5. Daemon returns ErrAuxiliary, and the container exits 255. There's still room for a partial-rollback race there if we have an api-int certificate rotation in the works while the Kube API hiccups while we're rolling update files out to disk, because the current goroutine handling launches children like: go wait.Until(dn.worker, time.Second, stopCh) so there's no mechanism for the canceled child goroutine to tell the parent "got your message, and I've finished my graceful shutdown". But this commit means we are only exposed when there's an ongoing api-int rotation, and that reduces our risk significantly (they're less common than generic "any kind of Kube API hiccup matching the daemon's substrings"). [1] is up to track providing that child-to-parent return path. [1]: https://issues.redhat.com/browse/MCO-1154

yuqi-zhang · 2024-04-25T23:48:25Z

pkg/daemon/daemon.go

-					}); err != nil {
-						return fmt.Errorf("something went wrong while waiting for kubeconfig file to generate: %v", err)
-					}
+			if dn.deferKubeletRestart && (strings.Contains(strings.ToLower(err.Error()), "failed to watch") || strings.Contains(strings.ToLower(err.Error()), "unknown authority") || strings.Contains(strings.ToLower(err.Error()), "error on the server")) {


For a second I thought this was doing the same thing, but the key difference is that ErrAuxiliary gets wrapped into a kubelet restart. I think this should be safe so let's give it a try

wking · 2024-04-26T04:58:20Z

failed, with:

INFO[2024-04-25T23:12:42Z] Resolved source https://github.com/openshift/machine-config-operator to master@9b99954a, merging: #4337 66cf5a19 @wking

and:

event happened 663 times, something is wrong: namespace/openshift-config-operator node/master-0.ostest.test.metalkube.org pod/openshift-config-operator-55d4f665c8-l9l66 hmsg/b037c7d6d4 - reason/BackOff Back-off pulling image "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef" (04:11:53Z) result=reject }

Not sure what's going on with the config operator. And gather-extra failed to collect pod logs. And no bytes in the pod logs in the must-gather:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-must-gather/artifacts/must-gather.tar | tar -tvz | grep '/openshift-config-operator.*/logs/'
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/current.insecure.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/current.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/previous.insecure.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/previous.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/current.insecure.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/current.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/previous.insecure.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/previous.log

Ah, because we failed to pull that image:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "openshift-config-operator-55d4f665c8-l9l66").status.initContainerStatuses[]'
{
  "image": "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef",
  "imageID": "",
  "lastState": {},
  "name": "openshift-api",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
    "waiting": {
      "message": "Back-off pulling image \"registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef\"",
      "reason": "ImagePullBackOff"
    }
  }
}

Because networking fell over? Or some kind of proxy thing? I dunno:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/events.json | jq -r '.items[] | select(.reason == "Failed" and (.message | contains("3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef"))).message'
Failed to pull image "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef": (Mirrors also failed: [virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: reading manifest sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef in virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image: manifest unknown]): registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: pinging container registry registry.build05.ci.openshift.org: Get "https://registry.build05.ci.openshift.org/v2/": dial tcp 54.145.168.129:443: connect: network is unreachable
Failed to pull image "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef": (Mirrors also failed: [virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: reading manifest sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef in virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image: manifest unknown]): registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: pinging container registry registry.build05.ci.openshift.org: Get "https://registry.build05.ci.openshift.org/v2/": dial tcp 54.165.220.45:443: connect: network is unreachable

But:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version'
2024-04-26T01:41:22Z - Partial 4.16.0-0.ci.test-2024-04-25-231905-ci-op-s98n4wp9-latest
2024-04-26T00:10:33Z 2024-04-26T01:05:53Z Completed 4.15.10

so looks like it is doing the 4.15 -> freshly-built 4.16 that we want. I'll run another:

/payload-job periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6

openshift-ci · 2024-04-26T04:58:23Z

@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a858820-0389-11ef-992a-e70884edc1d8-0

yuqi-zhang · 2024-04-26T15:08:10Z

We can't seem to pre-merge test this via the payload and it's a safe enough change.

/lgtm
/override ci/prow/e2e-hypershift

openshift-ci · 2024-04-26T15:15:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-04-26T15:16:51Z

@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-hypershift

In response to this:

We can't seem to pre-merge test this via the payload and it's a safe enough change.

/lgtm
/override ci/prow/e2e-hypershift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2024-04-26T15:17:09Z

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/security	`2974e2d`	link	false	`/test security`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2024-04-26T15:20:25Z

@wking: Jira Issue OCPBUGS-33018: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#4337

Jira Issue OCPBUGS-33018 has been moved to the MODIFIED state.

In response to this:

Some adjustements to the logic from 4d447c5 (#4106).

Factor out the common rebootstrap logic into a new function, just to avoid repeating ourselves in two places. This part is a pure refactor, with no user-visible changes.

Significantly for users, 4d447c5's utilruntime.ErrorHandlers handling introduced a race like:

Daemon is rolling an update out onto disk.

Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.

utilruntime.ErrorHandlers feeds that failure into errCh.

stopCh set to trigger graceful shutdowns in the child goroutines.

Child goroutine performing the update moves into its deferred rollbacks (for example, see code around error rolling back files writes).

But before that rollback completes, the main goroutine returns ErrAuxiliary, and the container exits 255 without finishing the rollback.

Kubelet launches a replacement container.

Replacement container is upset at the interrupted cleanup, complains content mismatch for file... about some of the incoming-but-not-rolled-back content, and exits.

Return to step 7, looping forever.

But we ignored these utilruntime.ErrorHandlers errors before 4d447c5, without trouble beyond responding to api-int Certificate Authority rotations. We still want to rebootstrap when we have api-int trouble, but I'm not gating both the rebootstrap and the ErrAuxiliary return on deferKubeletRestart. So now:

Daemon is rolling an update out onto disk.

Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.

utilruntime.ErrorHandlers feeds that failure into errCh.

Daemon knows it does not have a pending kubelet rebootstrap, and it ignores the error.

Or we can have:

...
4. Daemon knows it needs to rebootstrap the kubelet, so it does.
5. Daemon returns ErrAuxiliary, and the container exits 255.

There's still room for a partial-rollback race there if we have an api-int certificate rotation in the works while the Kube API hiccups while we're rolling update files out to disk, because the current goroutine handling launches children like:
go wait.Until(dn.worker, time.Second, stopCh)
so there's no mechanism for the canceled child goroutine to tell the parent "got your message, and I've finished my graceful shutdown". But this commit means we are only exposed when there's an ongoing api-int rotation, and that reduces our risk significantly (they're less common than generic "any kind of Kube API hiccup matching the daemon's substrings"). MCO-1154 is up to track providing that child-to-parent return path.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2024-04-29T22:28:30Z

Fix included in accepted release 4.16.0-0.nightly-2024-04-29-154406

openshift-merge-robot · 2024-05-09T13:23:51Z

Fix included in accepted release 4.16.0-0.nightly-2024-05-08-222442

openshift-ci bot requested review from dkhater-redhat and jkyros April 25, 2024 22:42

wking changed the title ~~pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap~~ OCPBUGS-33018: pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap Apr 25, 2024

openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Apr 25, 2024

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 25, 2024

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 25, 2024

openshift-ci bot requested a review from sergiordlr April 25, 2024 22:44

yuqi-zhang reviewed Apr 25, 2024

View reviewed changes

pkg/daemon/daemon.go Outdated Show resolved Hide resolved

wking force-pushed the daemon-ignore-watch-failure-unless-kubelet-needs-rebootstrap branch from 1a8ab77 to 66cf5a1 Compare April 25, 2024 22:55

wking force-pushed the daemon-ignore-watch-failure-unless-kubelet-needs-rebootstrap branch from 66cf5a1 to 2974e2d Compare April 25, 2024 23:13

yuqi-zhang approved these changes Apr 25, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2024

openshift-ci bot assigned yuqi-zhang Apr 26, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2024

openshift-merge-bot bot merged commit 26c7742 into openshift:master Apr 26, 2024
14 of 15 checks passed

wking deleted the daemon-ignore-watch-failure-unless-kubelet-needs-rebootstrap branch April 26, 2024 15:44

This was referenced May 2, 2024

OCPBUGS-33018: pkg/daemon: logSystem the Kube API errors that trigger a re-bootstrap #4342

Merged

OCPBUGS-33018: pkg/daemon: Only re-bootstrap kubelet on X.509 errors #4351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-33018: pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap #4337

OCPBUGS-33018: pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap #4337

wking commented Apr 25, 2024

openshift-ci-robot commented Apr 25, 2024

wking commented Apr 25, 2024

openshift-ci-robot commented Apr 25, 2024

yuqi-zhang commented Apr 25, 2024

openshift-ci bot commented Apr 25, 2024

wking commented Apr 25, 2024

openshift-ci bot commented Apr 25, 2024

yuqi-zhang Apr 25, 2024

wking commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

yuqi-zhang commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

openshift-ci-robot commented Apr 26, 2024

openshift-merge-robot commented Apr 29, 2024

openshift-merge-robot commented May 9, 2024

OCPBUGS-33018: pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap #4337

OCPBUGS-33018: pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap #4337

Conversation

wking commented Apr 25, 2024

openshift-ci-robot commented Apr 25, 2024

wking commented Apr 25, 2024

openshift-ci-robot commented Apr 25, 2024

yuqi-zhang commented Apr 25, 2024

openshift-ci bot commented Apr 25, 2024

wking commented Apr 25, 2024

openshift-ci bot commented Apr 25, 2024

yuqi-zhang Apr 25, 2024

Choose a reason for hiding this comment

wking commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

yuqi-zhang commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

openshift-ci-robot commented Apr 26, 2024

openshift-merge-robot commented Apr 29, 2024

openshift-merge-robot commented May 9, 2024