-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-33018: pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap #4337
OCPBUGS-33018: pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap #4337
Conversation
@wking: This pull request references Jira Issue OCPBUGS-33018, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/Jira refresh |
@wking: This pull request references Jira Issue OCPBUGS-33018, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test ? |
@yuqi-zhang: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1a8ab77
to
66cf5a1
Compare
Running one to see if this works: /payload-job periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 If that shows us using the target payload we expect, we probably want to run a |
@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/465e69e0-0359-11ef-8ea6-b7071b44f025-0 |
Some adjustements to the logic from 4d447c5 (backportable version of api-int cert work, 2024-01-09, openshift#4106). Factor out the common rebootstrap logic into a new function, just to avoid repeating ourselves in two places. This part is a pure refactor, with no user-visible changes. Significantly for users, 4d447c5's utilruntime.ErrorHandlers handling introduced a race like: 1. Daemon is rolling an update out onto disk. 2. Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail. 3. utilruntime.ErrorHandlers feeds that failure into errCh. 4. stopCh set to trigger graceful shutdowns in the child goroutines. 5. Child goroutine performing the update moves into its deferred rollbacks (for example, see code around "error rolling back files writes"). 6. But before that rollback completes, the main goroutine returns ErrAuxiliary, and the container exits 255 without finishing the rollback. 7. Kubelet launches a replacement container. 8. Replacement container is upset at the interrupted cleanup, complains "content mismatch for file..." about some of the incoming-but-not-rolled-back content, and exits. 9. Return to step 7, looping forever. But we ignored these utilruntime.ErrorHandlers errors before 4d447c5, without trouble beyond responding to api-int Certificate Authority rotations. We still want to rebootstrap when we have api-int trouble, but I'm not gating both the rebootstrap and the ErrAuxiliary return on deferKubeletRestart. So now: 1. Daemon is rolling an update out onto disk. 2. Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail. 3. utilruntime.ErrorHandlers feeds that failure into errCh. 4. Daemon knows it does not have a pending kubelet rebootstrap, and it ignores the error. Or we can have: ... 4. Daemon knows it needs to rebootstrap the kubelet, so it does. 5. Daemon returns ErrAuxiliary, and the container exits 255. There's still room for a partial-rollback race there if we have an api-int certificate rotation in the works while the Kube API hiccups while we're rolling update files out to disk, because the current goroutine handling launches children like: go wait.Until(dn.worker, time.Second, stopCh) so there's no mechanism for the canceled child goroutine to tell the parent "got your message, and I've finished my graceful shutdown". But this commit means we are only exposed when there's an ongoing api-int rotation, and that reduces our risk significantly (they're less common than generic "any kind of Kube API hiccup matching the daemon's substrings"). [1] is up to track providing that child-to-parent return path. [1]: https://issues.redhat.com/browse/MCO-1154
66cf5a1
to
2974e2d
Compare
}); err != nil { | ||
return fmt.Errorf("something went wrong while waiting for kubeconfig file to generate: %v", err) | ||
} | ||
if dn.deferKubeletRestart && (strings.Contains(strings.ToLower(err.Error()), "failed to watch") || strings.Contains(strings.ToLower(err.Error()), "unknown authority") || strings.Contains(strings.ToLower(err.Error()), "error on the server")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a second I thought this was doing the same thing, but the key difference is that ErrAuxiliary
gets wrapped into a kubelet restart. I think this should be safe so let's give it a try
and:
Not sure what's going on with the config operator. And gather-extra failed to collect pod logs. And no bytes in the pod logs in the must-gather: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-must-gather/artifacts/must-gather.tar | tar -tvz | grep '/openshift-config-operator.*/logs/'
-rw------- 1012900000/root 0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/current.insecure.log
-rw------- 1012900000/root 0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/current.log
-rw------- 1012900000/root 0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/previous.insecure.log
-rw------- 1012900000/root 0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/previous.log
-rw------- 1012900000/root 0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/current.insecure.log
-rw------- 1012900000/root 0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/current.log
-rw------- 1012900000/root 0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/previous.insecure.log
-rw------- 1012900000/root 0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/previous.log Ah, because we failed to pull that image: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "openshift-config-operator-55d4f665c8-l9l66").status.initContainerStatuses[]'
{
"image": "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef",
"imageID": "",
"lastState": {},
"name": "openshift-api",
"ready": false,
"restartCount": 0,
"started": false,
"state": {
"waiting": {
"message": "Back-off pulling image \"registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef\"",
"reason": "ImagePullBackOff"
}
}
} Because networking fell over? Or some kind of proxy thing? I dunno: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/events.json | jq -r '.items[] | select(.reason == "Failed" and (.message | contains("3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef"))).message'
Failed to pull image "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef": (Mirrors also failed: [virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: reading manifest sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef in virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image: manifest unknown]): registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: pinging container registry registry.build05.ci.openshift.org: Get "https://registry.build05.ci.openshift.org/v2/": dial tcp 54.145.168.129:443: connect: network is unreachable
Failed to pull image "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef": (Mirrors also failed: [virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: reading manifest sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef in virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image: manifest unknown]): registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: pinging container registry registry.build05.ci.openshift.org: Get "https://registry.build05.ci.openshift.org/v2/": dial tcp 54.165.220.45:443: connect: network is unreachable But: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version'
2024-04-26T01:41:22Z - Partial 4.16.0-0.ci.test-2024-04-25-231905-ci-op-s98n4wp9-latest
2024-04-26T00:10:33Z 2024-04-26T01:05:53Z Completed 4.15.10 so looks like it is doing the 4.15 -> freshly-built 4.16 that we want. I'll run another: /payload-job periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 |
@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a858820-0389-11ef-992a-e70884edc1d8-0 |
We can't seem to pre-merge test this via the payload and it's a safe enough change. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-hypershift In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
26c7742
into
openshift:master
@wking: Jira Issue OCPBUGS-33018: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-33018 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Fix included in accepted release 4.16.0-0.nightly-2024-04-29-154406 |
Fix included in accepted release 4.16.0-0.nightly-2024-05-08-222442 |
Some adjustements to the logic from 4d447c5 (#4106).
Factor out the common rebootstrap logic into a new function, just to avoid repeating ourselves in two places. This part is a pure refactor, with no user-visible changes.
Significantly for users, 4d447c5's
utilruntime.ErrorHandlers
handling introduced a race like:utilruntime.ErrorHandlers
feeds that failure intoerrCh
.stopCh
set to trigger graceful shutdowns in the child goroutines.error rolling back files writes
).ErrAuxiliary
, and the container exits 255 without finishing the rollback.content mismatch for file...
about some of the incoming-but-not-rolled-back content, and exits.But we ignored these
utilruntime.ErrorHandlers
errors before 4d447c5, without trouble beyond responding toapi-int
Certificate Authority rotations. We still want to rebootstrap when we haveapi-int
trouble, but I'm not gating both the rebootstrap and theErrAuxiliary
return ondeferKubeletRestart
. So now:utilruntime.ErrorHandlers
feeds that failure intoerrCh
.Or we can have:
...
4. Daemon knows it needs to rebootstrap the kubelet, so it does.
5. Daemon returns
ErrAuxiliary
, and the container exits 255.There's still room for a partial-rollback race there if we have an
api-int
certificate rotation in the works while the Kube API hiccups while we're rolling update files out to disk, because the current goroutine handling launches children like:so there's no mechanism for the canceled child goroutine to tell the parent "got your message, and I've finished my graceful shutdown". But this commit means we are only exposed when there's an ongoing
api-int
rotation, and that reduces our risk significantly (they're less common than generic "any kind of Kube API hiccup matching the daemon's substrings"). MCO-1154 is up to track providing that child-to-parent return path.