Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-33018: pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap #4337

Conversation

wking
Copy link
Member

@wking wking commented Apr 25, 2024

Some adjustements to the logic from 4d447c5 (#4106).

Factor out the common rebootstrap logic into a new function, just to avoid repeating ourselves in two places. This part is a pure refactor, with no user-visible changes.

Significantly for users, 4d447c5's utilruntime.ErrorHandlers handling introduced a race like:

  1. Daemon is rolling an update out onto disk.
  2. Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.
  3. utilruntime.ErrorHandlers feeds that failure into errCh.
  4. stopCh set to trigger graceful shutdowns in the child goroutines.
  5. Child goroutine performing the update moves into its deferred rollbacks (for example, see code around error rolling back files writes).
  6. But before that rollback completes, the main goroutine returns ErrAuxiliary, and the container exits 255 without finishing the rollback.
  7. Kubelet launches a replacement container.
  8. Replacement container is upset at the interrupted cleanup, complains content mismatch for file... about some of the incoming-but-not-rolled-back content, and exits.
  9. Return to step 7, looping forever.

But we ignored these utilruntime.ErrorHandlers errors before 4d447c5, without trouble beyond responding to api-int Certificate Authority rotations. We still want to rebootstrap when we have api-int trouble, but I'm not gating both the rebootstrap and the ErrAuxiliary return on deferKubeletRestart. So now:

  1. Daemon is rolling an update out onto disk.
  2. Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.
  3. utilruntime.ErrorHandlers feeds that failure into errCh.
  4. Daemon knows it does not have a pending kubelet rebootstrap, and it ignores the error.

Or we can have:

...
4. Daemon knows it needs to rebootstrap the kubelet, so it does.
5. Daemon returns ErrAuxiliary, and the container exits 255.

There's still room for a partial-rollback race there if we have an api-int certificate rotation in the works while the Kube API hiccups while we're rolling update files out to disk, because the current goroutine handling launches children like:

go wait.Until(dn.worker, time.Second, stopCh)

so there's no mechanism for the canceled child goroutine to tell the parent "got your message, and I've finished my graceful shutdown". But this commit means we are only exposed when there's an ongoing api-int rotation, and that reduces our risk significantly (they're less common than generic "any kind of Kube API hiccup matching the daemon's substrings"). MCO-1154 is up to track providing that child-to-parent return path.

@wking wking changed the title pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap OCPBUGS-33018: pkg/daemon: Ignore watch failure unless kubelet needs a rebootstrap Apr 25, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Apr 25, 2024
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-33018, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Some adjustements to the logic from 4d447c5 (#4106).

Factor out the common rebootstrap logic into a new function, just to avoid repeating ourselves in two places. This part is a pure refactor, with no user-visible changes.

Significantly for users, 4d447c5's utilruntime.ErrorHandlers handling introduced a race like:

  1. Daemon is rolling an update out onto disk.
  2. Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.
  3. utilruntime.ErrorHandlers feeds that failure into errCh.
  4. stopCh set to trigger graceful shutdowns in the child goroutines.
  5. Child goroutine performing the update moves into its deferred rollbacks (for example, see code around error rolling back files writes).
  6. But before that rollback completes, the main goroutine returns ErrAuxiliary, and the container exits 255 without finishing the rollback.
  7. Kubelet launches a replacement container.
  8. Replacement container is upset at the interrupted cleanup, complains content mismatch for file... about some of the incoming-but-not-rolled-back content, and exits.
  9. Return to step 7, looping forever.

But we ignored these utilruntime.ErrorHandlers errors before 4d447c5, without trouble beyond responding to api-int Certificate Authority rotations. We still want to rebootstrap when we have api-int trouble, but I'm not gating both the rebootstrap and the ErrAuxiliary return on deferKubeletRestart. So now:

  1. Daemon is rolling an update out onto disk.
  2. Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.
  3. utilruntime.ErrorHandlers feeds that failure into errCh.
  4. Daemon knows it does not have a pending kubelet rebootstrap, and it ignores the error.

Or we can have:

...
4. Daemon knows it needs to rebootstrap the kubelet, so it does.
5. Daemon returns ErrAuxiliary, and the container exits 255.

There's still room for a partial-rollback race there if we have an api-int certificate rotation in the works while the Kube API hiccups while we're rolling update files out to disk, because the current goroutine handling launches children like:

go wait.Until(dn.worker, time.Second, stopCh)

so there's no mechanism for the canceled child goroutine to tell the parent "got your message, and I've finished my graceful shutdown". But this commit means we are only exposed when there's an ongoing api-int rotation, and that reduces our risk significantly (they're less common than generic "any kind of Kube API hiccup matching the daemon's substrings"). MCO-1154 is up to track providing that child-to-parent return path.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 25, 2024
@wking
Copy link
Member Author

wking commented Apr 25, 2024

/Jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 25, 2024
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-33018, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/Jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sergiordlr April 25, 2024 22:44
@yuqi-zhang
Copy link
Contributor

/test ?

Copy link
Contributor

openshift-ci bot commented Apr 25, 2024

@yuqi-zhang: The following commands are available to trigger required jobs:

  • /test 4.12-upgrade-from-stable-4.11-images
  • /test cluster-bootimages
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-op
  • /test e2e-gcp-op-single-node
  • /test e2e-hypershift
  • /test images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-fips-op
  • /test e2e-aws-ovn-upgrade-out-of-change
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-ovn-upgrade-out-of-change
  • /test e2e-azure-upgrade
  • /test e2e-gcp-op-techpreview
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-gcp-rt
  • /test e2e-gcp-rt-op
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-dualstack
  • /test e2e-openstack-externallb
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test e2e-vsphere-upi-zones
  • /test e2e-vsphere-zones
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test okd-scos-images
  • /test security

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-bootstrap-unit
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade-out-of-change
  • pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-techpreview
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-security
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/daemon/daemon.go Outdated Show resolved Hide resolved
@wking wking force-pushed the daemon-ignore-watch-failure-unless-kubelet-needs-rebootstrap branch from 1a8ab77 to 66cf5a1 Compare April 25, 2024 22:55
@wking
Copy link
Member Author

wking commented Apr 25, 2024

Running one to see if this works:

/payload-job periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6

If that shows us using the target payload we expect, we probably want to run a /payload-aggregate ... because the periodic is only mostly dead, not completely dead, and we don't want to misread a got-lucky-and-passed racy win as "actually fixed this issue".

Copy link
Contributor

openshift-ci bot commented Apr 25, 2024

@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/465e69e0-0359-11ef-8ea6-b7071b44f025-0

Some adjustements to the logic from 4d447c5 (backportable version
of api-int cert work, 2024-01-09, openshift#4106).

Factor out the common rebootstrap logic into a new function, just to
avoid repeating ourselves in two places.  This part is a pure
refactor, with no user-visible changes.

Significantly for users, 4d447c5's utilruntime.ErrorHandlers
handling introduced a race like:

1. Daemon is rolling an update out onto disk.
2. Kube API or DNS hiccup causes the daemon's watches, or other Kube
   API requests, to fail.
3. utilruntime.ErrorHandlers feeds that failure into errCh.
4. stopCh set to trigger graceful shutdowns in the child goroutines.
5. Child goroutine performing the update moves into its deferred
   rollbacks (for example, see code around "error rolling back files
   writes").
6. But before that rollback completes, the main goroutine returns
   ErrAuxiliary, and the container exits 255 without finishing the
   rollback.
7. Kubelet launches a replacement container.
8. Replacement container is upset at the interrupted cleanup,
   complains "content mismatch for file..." about some of the
   incoming-but-not-rolled-back content, and exits.
9. Return to step 7, looping forever.

But we ignored these utilruntime.ErrorHandlers errors before 4d447c5,
without trouble beyond responding to api-int Certificate Authority
rotations.  We still want to rebootstrap when we have api-int trouble,
but I'm not gating both the rebootstrap and the ErrAuxiliary return on
deferKubeletRestart.  So now:

1. Daemon is rolling an update out onto disk.
2. Kube API or DNS hiccup causes the daemon's watches, or other Kube
   API requests, to fail.
3. utilruntime.ErrorHandlers feeds that failure into errCh.
4. Daemon knows it does not have a pending kubelet rebootstrap, and it
   ignores the error.

Or we can have:

...
4. Daemon knows it needs to rebootstrap the kubelet, so it does.
5. Daemon returns ErrAuxiliary, and the container exits 255.

There's still room for a partial-rollback race there if we have an
api-int certificate rotation in the works while the Kube API hiccups
while we're rolling update files out to disk, because the current
goroutine handling launches children like:

  go wait.Until(dn.worker, time.Second, stopCh)

so there's no mechanism for the canceled child goroutine to tell the
parent "got your message, and I've finished my graceful shutdown".
But this commit means we are only exposed when there's an ongoing
api-int rotation, and that reduces our risk significantly (they're
less common than generic "any kind of Kube API hiccup matching the
daemon's substrings").  [1] is up to track providing that
child-to-parent return path.

[1]: https://issues.redhat.com/browse/MCO-1154
@wking wking force-pushed the daemon-ignore-watch-failure-unless-kubelet-needs-rebootstrap branch from 66cf5a1 to 2974e2d Compare April 25, 2024 23:13
}); err != nil {
return fmt.Errorf("something went wrong while waiting for kubeconfig file to generate: %v", err)
}
if dn.deferKubeletRestart && (strings.Contains(strings.ToLower(err.Error()), "failed to watch") || strings.Contains(strings.ToLower(err.Error()), "unknown authority") || strings.Contains(strings.ToLower(err.Error()), "error on the server")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a second I thought this was doing the same thing, but the key difference is that ErrAuxiliary gets wrapped into a kubelet restart. I think this should be safe so let's give it a try

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2024
@wking
Copy link
Member Author

wking commented Apr 26, 2024

failed, with:

INFO[2024-04-25T23:12:42Z] Resolved source https://github.com/openshift/machine-config-operator to master@9b99954a, merging: #4337 66cf5a19 @wking

and:

event happened 663 times, something is wrong: namespace/openshift-config-operator node/master-0.ostest.test.metalkube.org pod/openshift-config-operator-55d4f665c8-l9l66 hmsg/b037c7d6d4 - reason/BackOff Back-off pulling image "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef" (04:11:53Z) result=reject }

Not sure what's going on with the config operator. And gather-extra failed to collect pod logs. And no bytes in the pod logs in the must-gather:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-must-gather/artifacts/must-gather.tar | tar -tvz | grep '/openshift-config-operator.*/logs/'
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/current.insecure.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/current.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/previous.insecure.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-api/openshift-api/logs/previous.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/current.insecure.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/current.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/previous.insecure.log
-rw------- 1012900000/root        0 2024-04-25 21:20 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-0a2db4d71d7957fc2a92bc07c98918f169650dc0a6d040f40a26313e98bba9c3/namespaces/openshift-config-operator/pods/openshift-config-operator-55d4f665c8-l9l66/openshift-config-operator/openshift-config-operator/logs/previous.log

Ah, because we failed to pull that image:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "openshift-config-operator-55d4f665c8-l9l66").status.initContainerStatuses[]'
{
  "image": "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef",
  "imageID": "",
  "lastState": {},
  "name": "openshift-api",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
    "waiting": {
      "message": "Back-off pulling image \"registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef\"",
      "reason": "ImagePullBackOff"
    }
  }
}

Because networking fell over? Or some kind of proxy thing? I dunno:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/events.json | jq -r '.items[] | select(.reason == "Failed" and (.message | contains("3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef"))).message'
Failed to pull image "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef": (Mirrors also failed: [virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: reading manifest sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef in virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image: manifest unknown]): registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: pinging container registry registry.build05.ci.openshift.org: Get "https://registry.build05.ci.openshift.org/v2/": dial tcp 54.145.168.129:443: connect: network is unreachable
Failed to pull image "registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef": (Mirrors also failed: [virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: reading manifest sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef in virthost.ostest.test.metalkube.org:5000/localimages/local-upgrade-image: manifest unknown]): registry.build05.ci.openshift.org/ci-op-s98n4wp9/stable@sha256:3ac58db589768e083d2a4378dec14a7ebc2329e8629f7bc78ab1a5c691a6e0ef: pinging container registry registry.build05.ci.openshift.org: Get "https://registry.build05.ci.openshift.org/v2/": dial tcp 54.165.220.45:443: connect: network is unreachable

But:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-machine-config-operator-4337-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1783635215088881664/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version'
2024-04-26T01:41:22Z - Partial 4.16.0-0.ci.test-2024-04-25-231905-ci-op-s98n4wp9-latest
2024-04-26T00:10:33Z 2024-04-26T01:05:53Z Completed 4.15.10

so looks like it is doing the 4.15 -> freshly-built 4.16 that we want. I'll run another:

/payload-job periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6

Copy link
Contributor

openshift-ci bot commented Apr 26, 2024

@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9a858820-0389-11ef-992a-e70884edc1d8-0

@yuqi-zhang
Copy link
Contributor

We can't seem to pre-merge test this via the payload and it's a safe enough change.

/lgtm
/override ci/prow/e2e-hypershift

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2024
Copy link
Contributor

openshift-ci bot commented Apr 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Apr 26, 2024

@yuqi-zhang: Overrode contexts on behalf of yuqi-zhang: ci/prow/e2e-hypershift

In response to this:

We can't seem to pre-merge test this via the payload and it's a safe enough change.

/lgtm
/override ci/prow/e2e-hypershift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented Apr 26, 2024

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 2974e2d link false /test security

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 26c7742 into openshift:master Apr 26, 2024
14 of 15 checks passed
@openshift-ci-robot
Copy link
Contributor

@wking: Jira Issue OCPBUGS-33018: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-33018 has been moved to the MODIFIED state.

In response to this:

Some adjustements to the logic from 4d447c5 (#4106).

Factor out the common rebootstrap logic into a new function, just to avoid repeating ourselves in two places. This part is a pure refactor, with no user-visible changes.

Significantly for users, 4d447c5's utilruntime.ErrorHandlers handling introduced a race like:

  1. Daemon is rolling an update out onto disk.
  2. Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.
  3. utilruntime.ErrorHandlers feeds that failure into errCh.
  4. stopCh set to trigger graceful shutdowns in the child goroutines.
  5. Child goroutine performing the update moves into its deferred rollbacks (for example, see code around error rolling back files writes).
  6. But before that rollback completes, the main goroutine returns ErrAuxiliary, and the container exits 255 without finishing the rollback.
  7. Kubelet launches a replacement container.
  8. Replacement container is upset at the interrupted cleanup, complains content mismatch for file... about some of the incoming-but-not-rolled-back content, and exits.
  9. Return to step 7, looping forever.

But we ignored these utilruntime.ErrorHandlers errors before 4d447c5, without trouble beyond responding to api-int Certificate Authority rotations. We still want to rebootstrap when we have api-int trouble, but I'm not gating both the rebootstrap and the ErrAuxiliary return on deferKubeletRestart. So now:

  1. Daemon is rolling an update out onto disk.
  2. Kube API or DNS hiccup causes the daemon's watches, or other Kube API requests, to fail.
  3. utilruntime.ErrorHandlers feeds that failure into errCh.
  4. Daemon knows it does not have a pending kubelet rebootstrap, and it ignores the error.

Or we can have:

...
4. Daemon knows it needs to rebootstrap the kubelet, so it does.
5. Daemon returns ErrAuxiliary, and the container exits 255.

There's still room for a partial-rollback race there if we have an api-int certificate rotation in the works while the Kube API hiccups while we're rolling update files out to disk, because the current goroutine handling launches children like:

go wait.Until(dn.worker, time.Second, stopCh)

so there's no mechanism for the canceled child goroutine to tell the parent "got your message, and I've finished my graceful shutdown". But this commit means we are only exposed when there's an ongoing api-int rotation, and that reduces our risk significantly (they're less common than generic "any kind of Kube API hiccup matching the daemon's substrings"). MCO-1154 is up to track providing that child-to-parent return path.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@wking wking deleted the daemon-ignore-watch-failure-unless-kubelet-needs-rebootstrap branch April 26, 2024 15:44
@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-04-29-154406

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-05-08-222442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants