Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in the gpu device plugin #77035

Merged
merged 1 commit into from
May 2, 2019

Conversation

chardch
Copy link
Contributor

@chardch chardch commented Apr 25, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:
The gpu device plugin was not registering any gpu devices that were installed after the plugin server started. This caused fewer gpus to be registered with the kubelet when the plugin server started before all gpu devices are finished installing.

This PR uses the updated gpu device plugin that periodically checks for new gpu devices.
Refer to GoogleCloudPlatform/container-engine-accelerators#110

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 25, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @chardch. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 25, 2019
@chardch
Copy link
Contributor Author

chardch commented Apr 25, 2019

/cc @jiayingz

@jiayingz
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2019
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/kind bug
/ok-to-test
/priority backlog

/test pull-kubernetes-e2e-gce-device-plugin-gpu
@chardch if this fails again, possibly something is not right with the changed image.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 26, 2019
@chardch
Copy link
Contributor Author

chardch commented Apr 26, 2019

Thanks, I'll check the image again

@chardch
Copy link
Contributor Author

chardch commented Apr 26, 2019

/test pull-kubernetes-e2e-gce-device-plugin-gpu

1 similar comment
@chardch
Copy link
Contributor Author

chardch commented Apr 26, 2019

/test pull-kubernetes-e2e-gce-device-plugin-gpu

@chardch
Copy link
Contributor Author

chardch commented Apr 26, 2019

@neolit123 It looks like the pod error was due to the nvidia driver version being insufficient for the new CUDA10 vector addition container. Thanks to @jiayingz for helping to diagnose the issue here.
driver_version_pod_error

@chardch
Copy link
Contributor Author

chardch commented Apr 26, 2019

The device plugin image looks correct.
device-plugin

@chardch
Copy link
Contributor Author

chardch commented Apr 29, 2019

/test pull-kubernetes-e2e-gce-device-plugin-gpu

@chardch
Copy link
Contributor Author

chardch commented Apr 29, 2019

Been trying to debug this, but so far it looks like the driver version (410.79) and CUDA runtime version (10.0.130) from the failing container should be compatible.

@k8s-ci-robot
Copy link
Contributor

@chardch: GitHub didn't allow me to assign the following users: Robert, Bailey.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign Robert Bailey

This PR only touches the device plugin image, which has been verified to work. The failing test is currently due to an issue in the Nvidia driver version (410.79) being insufficient for the CUDA runtime version (10.0.130), which should be a compatible pair as listed here: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chardch
Copy link
Contributor Author

chardch commented Apr 29, 2019

/assign mikedanese

update: The failing test looks to be due to the cuda10 container not having LD_LIBRARY_PATH set.

@chardch
Copy link
Contributor Author

chardch commented May 1, 2019

/test pull-kubernetes-e2e-gce-device-plugin-gpu

@chardch
Copy link
Contributor Author

chardch commented May 1, 2019

The issue was that the Nvidia base image used in cuda-vector-add:2.0 was stale and didn't include the update that set the LD_LIBRARY_PATH environment variable in the image.

This caused the container using cuda-vector-add:2.0 to not be able to find the nvidia libraries, since this PR also included GoogleCloudPlatform/container-engine-accelerators#111, which reverted setting LD_LIBRARY_PATH due to the change in the base nvidia image.

@mikedanese
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chardch, mikedanese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2019
@k8s-ci-robot k8s-ci-robot merged commit 206eb91 into kubernetes:master May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants