-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a bug in the gpu device plugin #77035
Conversation
Hi @chardch. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @jiayingz |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/kind bug
/ok-to-test
/priority backlog
/test pull-kubernetes-e2e-gce-device-plugin-gpu
@chardch if this fails again, possibly something is not right with the changed image.
Thanks, I'll check the image again |
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
1 similar comment
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
@neolit123 It looks like the pod error was due to the nvidia driver version being insufficient for the new CUDA10 vector addition container. Thanks to @jiayingz for helping to diagnose the issue here. |
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
Been trying to debug this, but so far it looks like the driver version (410.79) and CUDA runtime version (10.0.130) from the failing container should be compatible. |
@chardch: GitHub didn't allow me to assign the following users: Robert, Bailey. Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign mikedanese update: The failing test looks to be due to the cuda10 container not having LD_LIBRARY_PATH set. |
/test pull-kubernetes-e2e-gce-device-plugin-gpu |
The issue was that the Nvidia base image used in cuda-vector-add:2.0 was stale and didn't include the update that set the LD_LIBRARY_PATH environment variable in the image. This caused the container using cuda-vector-add:2.0 to not be able to find the nvidia libraries, since this PR also included GoogleCloudPlatform/container-engine-accelerators#111, which reverted setting LD_LIBRARY_PATH due to the change in the base nvidia image. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chardch, mikedanese The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
What this PR does / why we need it:
The gpu device plugin was not registering any gpu devices that were installed after the plugin server started. This caused fewer gpus to be registered with the kubelet when the plugin server started before all gpu devices are finished installing.
This PR uses the updated gpu device plugin that periodically checks for new gpu devices.
Refer to GoogleCloudPlatform/container-engine-accelerators#110
Does this PR introduce a user-facing change?: