New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-8287: [release-4.10] UPSTREAM: 116376: node: device-mgr: Handle recovery by checking if healthy device exists #1566
Conversation
@swatisehgal: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@swatisehgal: the contents of this pull request could be automatically validated. The following commits are valid:
Comment |
@swatisehgal: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@swatisehgal: This pull request references Jira Issue OCPBUGS-2180, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@swatisehgal: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@swatisehgal: No Jira issue is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@swatisehgal: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
0f1feac
to
df4c0e9
Compare
@swatisehgal: the contents of this pull request could be automatically validated. The following commits are valid:
Comment |
…althy devices exist In case of node reboot/kubelet restart, the flow of events involves obtaining the state from the checkpoint file followed by setting the `healthDevices`/`unhealthyDevices` to its zero value. This is done to allow the device plugin to re-register itself so that capacity can be updated appropriately. During the allocation phase, we need to check if the resources requested by the pod have been registered AND healthy devices are present on the node to be allocated. Also we need to move this check above `needed==0` where needed is required - devices allocated to the container (which is obtained from the checkpoint file) because even in cases where no additional devices have to be allocated (as they were pre-allocated), we still need to make sure he devices that were previously allocated are healthy. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
df4c0e9
to
613c8d1
Compare
@swatisehgal: the contents of this pull request could be automatically validated. The following commits are valid:
Comment |
/test e2e-aws-cgroupsv2 |
/test e2e-gcp |
@swatisehgal: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test k8s-e2e-aws |
@swatisehgal: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/jira refresh |
@swatisehgal: This pull request references Jira Issue OCPBUGS-8287, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @soltysh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/label backport-risk-assessed
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: soltysh, swatisehgal The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/jira refresh |
@swatisehgal: This pull request references Jira Issue OCPBUGS-8287, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
@swatisehgal: This pull request references Jira Issue OCPBUGS-8287, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/close |
@swatisehgal: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@swatisehgal: This pull request references Jira Issue OCPBUGS-8287. The bug has been updated to no longer refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
This is a targeted cherry pick containing only the core fix from kubernetes#116376.
Rationale for the backporting the bugfix
During the allocation phase, we need to ensure that the resources requested by the pod should only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated.
If these conditions are not satisfied (which can happen in case of node reboot/kubelet restart because there is no way to control the pod restart order), the pod would fail with
UnexpectedAdmissionError
error.For more details on the rationale, refer to the comment here: kubernetes#109595 (comment)
Additional Notes for the reviewers:
Validation of the change can be performed using openshift/origin#27902.