OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists #1566

swatisehgal · 2023-05-04T17:17:24Z

What this PR does / why we need it:

This is a targeted cherry pick containing only the core fix from kubernetes#116376.

Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the resources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with UnexpectedAdmissionError error.

For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)

Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.

openshift-ci · 2023-05-04T17:17:32Z

@swatisehgal: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-05-04T17:17:34Z

@swatisehgal: the contents of this pull request could be automatically validated.

The following commits are valid:

0f1feac|UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy devices exist: the upstream PR kubernetes/kubernetes#116376 has merged

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci · 2023-05-04T17:34:18Z

@swatisehgal: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

OCPBUGS-2180: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-05-04T17:34:20Z

@swatisehgal: This pull request references Jira Issue OCPBUGS-2180, which is invalid:

expected the bug to target the "4.10.z" version, but it targets "4.13.0" instead
expected Jira Issue OCPBUGS-2180 to depend on a bug targeting a version in 4.11.0, 4.11.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

What this PR does / why we need it:

This is a targeted cherry pick containing only the core fix from https://github.com/kubernetes/kubernetes/pull//116337.

Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the eesources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with UnexpectedAdmissionError error.

For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)

Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-05-04T17:38:27Z

@swatisehgal: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-05-04T17:38:30Z

@swatisehgal: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

What this PR does / why we need it:

This is a targeted cherry pick containing only the core fix from https://github.com/kubernetes/kubernetes/pull//116337.

Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the eesources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with UnexpectedAdmissionError error.

For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)

Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-05-04T17:52:52Z

@swatisehgal: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-05-05T10:25:43Z

@swatisehgal: the contents of this pull request could be automatically validated.

The following commits are valid:

df4c0e9|UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy devices exist: the upstream PR kubernetes/kubernetes#116376 has merged

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

…althy devices exist In case of node reboot/kubelet restart, the flow of events involves obtaining the state from the checkpoint file followed by setting the `healthDevices`/`unhealthyDevices` to its zero value. This is done to allow the device plugin to re-register itself so that capacity can be updated appropriately. During the allocation phase, we need to check if the resources requested by the pod have been registered AND healthy devices are present on the node to be allocated. Also we need to move this check above `needed==0` where needed is required - devices allocated to the container (which is obtained from the checkpoint file) because even in cases where no additional devices have to be allocated (as they were pre-allocated), we still need to make sure he devices that were previously allocated are healthy. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>

openshift-ci-robot · 2023-05-05T12:50:04Z

@swatisehgal: the contents of this pull request could be automatically validated.

The following commits are valid:

613c8d1|UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy devices exist: the upstream PR kubernetes/kubernetes#116376 has merged

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

swatisehgal · 2023-05-05T15:07:30Z

/test e2e-aws-cgroupsv2

swatisehgal · 2023-05-05T15:08:03Z

/test e2e-gcp

openshift-ci · 2023-05-15T12:35:42Z

@swatisehgal: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

swatisehgal · 2023-05-18T16:35:27Z

/test k8s-e2e-aws

openshift-ci · 2023-05-18T18:10:10Z

@swatisehgal: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

swatisehgal · 2023-05-26T10:14:49Z

/jira refresh

openshift-ci-robot · 2023-05-26T10:14:52Z

@swatisehgal: This pull request references Jira Issue OCPBUGS-8287, which is invalid:

expected dependent Jira Issue OCPBUGS-14140 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is New instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

swatisehgal · 2023-05-30T11:31:17Z

/cc @soltysh

soltysh

/lgtm
/approve
/label backport-risk-assessed

openshift-ci · 2023-05-30T16:14:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: soltysh, swatisehgal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/DOWNSTREAM_OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

swatisehgal · 2023-06-01T08:43:13Z

/jira refresh

openshift-ci-robot · 2023-06-01T08:43:16Z

@swatisehgal: This pull request references Jira Issue OCPBUGS-8287, which is invalid:

expected dependent Jira Issue OCPBUGS-14140 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is New instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2023-08-30T09:00:16Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2023-09-30T00:30:33Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-ci-robot · 2023-09-30T00:30:41Z

@swatisehgal: This pull request references Jira Issue OCPBUGS-8287, which is invalid:

expected dependent Jira Issue OCPBUGS-14140 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is POST instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

What this PR does / why we need it:

This is a targeted cherry pick containing only the core fix from kubernetes#116376.

Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the resources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with UnexpectedAdmissionError error.

For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)

Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

swatisehgal · 2023-10-09T11:46:14Z

/close
We longer plan to backport the fix to 4.10 branch.

openshift-ci · 2023-10-09T11:46:30Z

@swatisehgal: Closed this PR.

In response to this:

/close
We longer plan to backport the fix to 4.10 branch.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-10-09T11:46:32Z

@swatisehgal: This pull request references Jira Issue OCPBUGS-8287. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

What this PR does / why we need it:

This is a targeted cherry pick containing only the core fix from kubernetes#116376.

Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the resources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with UnexpectedAdmissionError error.

For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)

Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the backports/validated-commits Indicates that all commits come to merged upstream PRs. label May 4, 2023

openshift-ci bot requested review from rphillips and sjenning May 4, 2023 17:18

swatisehgal changed the title ~~[release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists~~ OCPBUGS-2180: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists May 4, 2023

swatisehgal changed the title ~~OCPBUGS-2180: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists~~ [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists May 4, 2023

swatisehgal force-pushed the devicemgr-fix-4.10 branch from 0f1feac to df4c0e9 Compare May 5, 2023 10:25

swatisehgal force-pushed the devicemgr-fix-4.10 branch from df4c0e9 to 613c8d1 Compare May 5, 2023 12:49

swatisehgal changed the title ~~[release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists~~ OCPBUGS-8287 [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists May 26, 2023

swatisehgal changed the title ~~OCPBUGS-8287 [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists~~ OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists May 26, 2023

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 26, 2023

openshift-ci bot requested a review from soltysh May 30, 2023 11:31

soltysh approved these changes May 30, 2023

View reviewed changes

openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label May 30, 2023

openshift-ci bot assigned duanwei33, gangwgr, kasturinarra, sunilcio, wangke19, xingxingxia, zhouying7780 and soltysh May 30, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 30, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 30, 2023

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2023

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 30, 2023

openshift-ci bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 30, 2023

openshift-ci bot closed this Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists #1566

OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists #1566

swatisehgal commented May 4, 2023 •

edited

openshift-ci bot commented May 4, 2023

openshift-ci-robot commented May 4, 2023

openshift-ci bot commented May 4, 2023

openshift-ci-robot commented May 4, 2023

What this PR does / why we need it:

openshift-ci bot commented May 4, 2023

openshift-ci-robot commented May 4, 2023

What this PR does / why we need it:

openshift-ci bot commented May 4, 2023

openshift-ci-robot commented May 5, 2023

openshift-ci-robot commented May 5, 2023

swatisehgal commented May 5, 2023

swatisehgal commented May 5, 2023

openshift-ci bot commented May 15, 2023

swatisehgal commented May 18, 2023

openshift-ci bot commented May 18, 2023

swatisehgal commented May 26, 2023

openshift-ci-robot commented May 26, 2023

swatisehgal commented May 30, 2023

soltysh left a comment

openshift-ci bot commented May 30, 2023

swatisehgal commented Jun 1, 2023

openshift-ci-robot commented Jun 1, 2023

openshift-bot commented Aug 30, 2023

openshift-bot commented Sep 30, 2023

openshift-ci-robot commented Sep 30, 2023

What this PR does / why we need it:

swatisehgal commented Oct 9, 2023

openshift-ci bot commented Oct 9, 2023

openshift-ci-robot commented Oct 9, 2023

What this PR does / why we need it:

OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists #1566

OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists #1566

Conversation

swatisehgal commented May 4, 2023 • edited

What this PR does / why we need it:

openshift-ci bot commented May 4, 2023

openshift-ci-robot commented May 4, 2023

openshift-ci bot commented May 4, 2023

openshift-ci-robot commented May 4, 2023

What this PR does / why we need it:

openshift-ci bot commented May 4, 2023

openshift-ci-robot commented May 4, 2023

What this PR does / why we need it:

openshift-ci bot commented May 4, 2023

openshift-ci-robot commented May 5, 2023

openshift-ci-robot commented May 5, 2023

swatisehgal commented May 5, 2023

swatisehgal commented May 5, 2023

openshift-ci bot commented May 15, 2023

swatisehgal commented May 18, 2023

openshift-ci bot commented May 18, 2023

swatisehgal commented May 26, 2023

openshift-ci-robot commented May 26, 2023

swatisehgal commented May 30, 2023

soltysh left a comment

Choose a reason for hiding this comment

openshift-ci bot commented May 30, 2023

swatisehgal commented Jun 1, 2023

openshift-ci-robot commented Jun 1, 2023

openshift-bot commented Aug 30, 2023

openshift-bot commented Sep 30, 2023

openshift-ci-robot commented Sep 30, 2023

What this PR does / why we need it:

swatisehgal commented Oct 9, 2023

openshift-ci bot commented Oct 9, 2023

openshift-ci-robot commented Oct 9, 2023

What this PR does / why we need it:

swatisehgal commented May 4, 2023 •

edited