Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists #1566

Closed

Conversation

swatisehgal
Copy link

@swatisehgal swatisehgal commented May 4, 2023

What this PR does / why we need it:

This is a targeted cherry pick containing only the core fix from kubernetes#116376.

Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the resources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with UnexpectedAdmissionError error.

For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)

Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.

@openshift-ci-robot openshift-ci-robot added the backports/validated-commits Indicates that all commits come to merged upstream PRs. label May 4, 2023
@openshift-ci
Copy link

openshift-ci bot commented May 4, 2023

@swatisehgal: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@swatisehgal: the contents of this pull request could be automatically validated.

The following commits are valid:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci openshift-ci bot requested review from rphillips and sjenning May 4, 2023 17:18
@swatisehgal swatisehgal changed the title [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists OCPBUGS-2180: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists May 4, 2023
@openshift-ci
Copy link

openshift-ci bot commented May 4, 2023

@swatisehgal: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

OCPBUGS-2180: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 4, 2023
@openshift-ci-robot
Copy link

@swatisehgal: This pull request references Jira Issue OCPBUGS-2180, which is invalid:

  • expected the bug to target the "4.10.z" version, but it targets "4.13.0" instead
  • expected Jira Issue OCPBUGS-2180 to depend on a bug targeting a version in 4.11.0, 4.11.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

What this PR does / why we need it:

This is a targeted cherry pick containing only the core fix from https://github.com/kubernetes/kubernetes/pull//116337.

Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the eesources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with UnexpectedAdmissionError error.

For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)

Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@swatisehgal swatisehgal changed the title OCPBUGS-2180: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists May 4, 2023
@openshift-ci
Copy link

openshift-ci bot commented May 4, 2023

@swatisehgal: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 4, 2023
@openshift-ci-robot
Copy link

@swatisehgal: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

What this PR does / why we need it:

This is a targeted cherry pick containing only the core fix from https://github.com/kubernetes/kubernetes/pull//116337.

Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the eesources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with UnexpectedAdmissionError error.

For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)

Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented May 4, 2023

@swatisehgal: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@swatisehgal: the contents of this pull request could be automatically validated.

The following commits are valid:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

…althy devices exist

In case of node reboot/kubelet restart, the flow of events involves
obtaining the state from the checkpoint file followed by setting
the `healthDevices`/`unhealthyDevices` to its zero value. This is
done to allow the device plugin to re-register itself so that
capacity can be updated appropriately.

During the allocation phase, we need to check if the resources requested
by the pod have been registered AND healthy devices are present on
the node to be allocated.

Also we need to move this check above `needed==0` where needed is
required - devices allocated to the container (which is obtained from
the checkpoint file) because even in cases where no additional devices
have to be allocated (as they were pre-allocated), we still need to
make sure he devices that were previously allocated are healthy.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
@openshift-ci-robot
Copy link

@swatisehgal: the contents of this pull request could be automatically validated.

The following commits are valid:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@swatisehgal
Copy link
Author

/test e2e-aws-cgroupsv2

@swatisehgal
Copy link
Author

/test e2e-gcp

@openshift-ci
Copy link

openshift-ci bot commented May 15, 2023

@swatisehgal: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@swatisehgal
Copy link
Author

/test k8s-e2e-aws

@openshift-ci
Copy link

openshift-ci bot commented May 18, 2023

@swatisehgal: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@swatisehgal swatisehgal changed the title [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists OCPBUGS-8287 [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists May 26, 2023
@swatisehgal swatisehgal changed the title OCPBUGS-8287 [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists May 26, 2023
@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 26, 2023
@swatisehgal
Copy link
Author

/jira refresh

@openshift-ci-robot
Copy link

@swatisehgal: This pull request references Jira Issue OCPBUGS-8287, which is invalid:

  • expected dependent Jira Issue OCPBUGS-14140 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is New instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@swatisehgal
Copy link
Author

/cc @soltysh

@openshift-ci openshift-ci bot requested a review from soltysh May 30, 2023 11:31
Copy link
Member

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label May 30, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 30, 2023
@openshift-ci
Copy link

openshift-ci bot commented May 30, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: soltysh, swatisehgal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 30, 2023
@swatisehgal
Copy link
Author

/jira refresh

@openshift-ci-robot
Copy link

@swatisehgal: This pull request references Jira Issue OCPBUGS-8287, which is invalid:

  • expected dependent Jira Issue OCPBUGS-14140 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is New instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2023
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 30, 2023
@openshift-ci-robot
Copy link

@swatisehgal: This pull request references Jira Issue OCPBUGS-8287, which is invalid:

  • expected dependent Jira Issue OCPBUGS-14140 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is POST instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

What this PR does / why we need it:

This is a targeted cherry pick containing only the core fix from kubernetes#116376.

Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the resources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with UnexpectedAdmissionError error.

For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)

Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 30, 2023
@swatisehgal
Copy link
Author

/close
We longer plan to backport the fix to 4.10 branch.

@openshift-ci openshift-ci bot closed this Oct 9, 2023
@openshift-ci
Copy link

openshift-ci bot commented Oct 9, 2023

@swatisehgal: Closed this PR.

In response to this:

/close
We longer plan to backport the fix to 4.10 branch.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@swatisehgal: This pull request references Jira Issue OCPBUGS-8287. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

What this PR does / why we need it:

This is a targeted cherry pick containing only the core fix from kubernetes#116376.

Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the resources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with UnexpectedAdmissionError error.

For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)

Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. backports/validated-commits Indicates that all commits come to merged upstream PRs. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet