Fix a bug in the gpu device plugin #77035

chardch · 2019-04-25T01:06:06Z

What type of PR is this?

/kind bug

What this PR does / why we need it:
The gpu device plugin was not registering any gpu devices that were installed after the plugin server started. This caused fewer gpus to be registered with the kubelet when the plugin server started before all gpu devices are finished installing.

This PR uses the updated gpu device plugin that periodically checks for new gpu devices.
Refer to GoogleCloudPlatform/container-engine-accelerators#110

Does this PR introduce a user-facing change?:

NONE

…red. Refer to GoogleCloudPlatform/container-engine-accelerators#110

k8s-ci-robot · 2019-04-25T01:06:14Z

Hi @chardch. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chardch · 2019-04-25T17:45:46Z

/cc @jiayingz

jiayingz · 2019-04-25T23:09:56Z

/lgtm

neolit123

/kind bug
/ok-to-test
/priority backlog

/test pull-kubernetes-e2e-gce-device-plugin-gpu
@chardch if this fails again, possibly something is not right with the changed image.

chardch · 2019-04-26T21:19:27Z

Thanks, I'll check the image again

chardch · 2019-04-26T21:22:36Z

/test pull-kubernetes-e2e-gce-device-plugin-gpu

chardch · 2019-04-26T23:10:36Z

/test pull-kubernetes-e2e-gce-device-plugin-gpu

chardch · 2019-04-26T23:52:33Z

@neolit123 It looks like the pod error was due to the nvidia driver version being insufficient for the new CUDA10 vector addition container. Thanks to @jiayingz for helping to diagnose the issue here.

chardch · 2019-04-26T23:57:10Z

The device plugin image looks correct.

chardch · 2019-04-29T18:16:15Z

/test pull-kubernetes-e2e-gce-device-plugin-gpu

chardch · 2019-04-29T23:21:40Z

Been trying to debug this, but so far it looks like the driver version (410.79) and CUDA runtime version (10.0.130) from the failing container should be compatible.

k8s-ci-robot · 2019-04-29T23:48:53Z

@chardch: GitHub didn't allow me to assign the following users: Robert, Bailey.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign Robert Bailey

This PR only touches the device plugin image, which has been verified to work. The failing test is currently due to an issue in the Nvidia driver version (410.79) being insufficient for the CUDA runtime version (10.0.130), which should be a compatible pair as listed here: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chardch · 2019-04-29T23:52:01Z

/assign mikedanese

update: The failing test looks to be due to the cuda10 container not having LD_LIBRARY_PATH set.

chardch · 2019-05-01T01:03:44Z

/test pull-kubernetes-e2e-gce-device-plugin-gpu

chardch · 2019-05-01T03:31:15Z

The issue was that the Nvidia base image used in cuda-vector-add:2.0 was stale and didn't include the update that set the LD_LIBRARY_PATH environment variable in the image.

This caused the container using cuda-vector-add:2.0 to not be able to find the nvidia libraries, since this PR also included GoogleCloudPlatform/container-engine-accelerators#111, which reverted setting LD_LIBRARY_PATH due to the change in the base nvidia image.

mikedanese · 2019-05-02T02:36:01Z

/approve

k8s-ci-robot · 2019-05-02T02:36:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chardch, mikedanese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster/OWNERS~~ [mikedanese]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Fix a bug in the gpu device plugin where not all devices were registe…

2c68133

…red. Refer to GoogleCloudPlatform/container-engine-accelerators#110

k8s-ci-robot requested review from Katharine and mikedanese April 25, 2019 01:06

k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 25, 2019

k8s-ci-robot requested a review from jiayingz April 25, 2019 17:45

k8s-ci-robot assigned jiayingz Apr 25, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2019

neolit123 reviewed Apr 26, 2019

View reviewed changes

k8s-ci-robot assigned mikedanese Apr 29, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2019

k8s-ci-robot merged commit 206eb91 into kubernetes:master May 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug in the gpu device plugin #77035

Fix a bug in the gpu device plugin #77035

chardch commented Apr 25, 2019 •

edited

k8s-ci-robot commented Apr 25, 2019

chardch commented Apr 25, 2019

jiayingz commented Apr 25, 2019

neolit123 left a comment

chardch commented Apr 26, 2019

chardch commented Apr 26, 2019

chardch commented Apr 26, 2019

chardch commented Apr 26, 2019

chardch commented Apr 26, 2019

chardch commented Apr 29, 2019

chardch commented Apr 29, 2019

k8s-ci-robot commented Apr 29, 2019

chardch commented Apr 29, 2019 •

edited

chardch commented May 1, 2019

chardch commented May 1, 2019 •

edited

mikedanese commented May 2, 2019

k8s-ci-robot commented May 2, 2019

Fix a bug in the gpu device plugin #77035

Fix a bug in the gpu device plugin #77035

Conversation

chardch commented Apr 25, 2019 • edited

k8s-ci-robot commented Apr 25, 2019

chardch commented Apr 25, 2019

jiayingz commented Apr 25, 2019

neolit123 left a comment

Choose a reason for hiding this comment

chardch commented Apr 26, 2019

chardch commented Apr 26, 2019

chardch commented Apr 26, 2019

chardch commented Apr 26, 2019

chardch commented Apr 26, 2019

chardch commented Apr 29, 2019

chardch commented Apr 29, 2019

k8s-ci-robot commented Apr 29, 2019

chardch commented Apr 29, 2019 • edited

chardch commented May 1, 2019

chardch commented May 1, 2019 • edited

mikedanese commented May 2, 2019

k8s-ci-robot commented May 2, 2019

chardch commented Apr 25, 2019 •

edited

chardch commented Apr 29, 2019 •

edited

chardch commented May 1, 2019 •

edited