nvidia-device-plugin breaking after latest Docker 19.03.9 release #9148

aychang95 · 2020-05-20T07:05:15Z

**1. What kops version are you running?
v1.16.0

**2. What Kubernetes version are you running?
v1.15.11

3. What cloud provider are you using?
aws

4. What commands did you run? What is the simplest way to reproduce this issue?
Just follow the "preferred' approach for the nvidia-device-plugin hook here
https://github.com/kubernetes/kops/tree/master/hooks/nvidia-device-plugin
Generated image with makefile or use fifarzh's docker registry for CUDA 10

5. What happened after the commands executed?
GPU nodes with the nvidia plugin hook fails with errors like

Warning  ContainerGCFailed        29s                    kubelet, ip-172-20-57-161.ec2.internal     rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

and endlessly loops with docker services not working.

6. What did you expect to happen?
p2 or p3 GPU nodes are configured with nvidia drivers and cuda 10 configured.

**7. Please provide your cluster manifest. Execute
This issue seems to be isolated with the hook so I don't think my manifest is needed.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
verbose logs are generic docker API inaccessibility issues

9. Anything else do we need to know?
Did some digging and it's most likely due to the latest docker v19.03.9 release May 18th, 2020

The text was updated successfully, but these errors were encountered:

aychang95 · 2020-05-20T07:10:35Z

I addressed this issue by downgrading docker 19.03.9->19.03.8 in the gpu nodes by editing the https://github.com/kubernetes/kops/blob/master/hooks/nvidia-device-plugin/image/files/02-nvidia-docker.sh file.

You can try using the image I built and pushed:
aychang/nvidia-device-plugin:0.2.0-cuda10.0

Example in ig group:

spec:
  hooks:
  - execContainer:
      image: aychang/nvidia-device-plugin:0.2.0-cuda10.0

It takes a little longer but GPU nodes now go back to being configured with NVIDIA drivers and cuda10

fejta-bot · 2020-08-19T17:38:18Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-09-18T18:21:48Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

olemarkus · 2020-09-19T06:39:57Z

/remove-lifecycle rotten

May be that this hook should be removed. Not sure we really need that anymore given things like nvidia's GPU operator.

aychang95 · 2020-10-06T15:48:06Z

@olemarkus
As someone who would love to transfer to an automated management system like GPU operator, just want to check if this is aligned with the Kops roadmap?

We've been using kops hooks to apply the device plugins for each GPU software component which gets tedious and hard to manage pretty fast...especially with how quickly NVIDIA updates come out. Wouldn't mind this hook being removed for easier alternatives.

rifelpet · 2020-10-06T16:02:47Z

I'm not aware of any Kops maintainers using GPUs in their clusters, and Kops doesn't have automated e2e tests that cover GPU usage, so any changes to Kops' GPU support would need testing (or contribution) from the community. If there's a better solution like the GPU operator, we'd be happy to review any docs or code updates :)

dalefwillis · 2020-10-06T20:41:35Z

@olemarkus are there any succinct details on setting up nvidia's gpu operator with kops? In the past (pre 1.15) I have setup GPU support using hooks specified by @aychang95.

With a new cluster (1.16) and given some of the issue comments here I spent the day fighting the good fight to setup the GPU operator on AWS without success (if I need to I'll open up an issue ticket). Not that I expect the kops community to be the ones required to maintain this kind of documentation for outside operators - just appreciate any insights you can provide on this (my current issue is the OS version seems to cause some issues).

olemarkus · 2020-10-07T07:08:20Z

I have not had the chance to look into nvidia's operator just yet. But GPU support is something that is interesting for my employer, so I suspect I'll be able to make some form of support/documentation in the not-to-distant future.

I would at least use the Ubuntu AMIs if trying to use the operator. That will work on newer kops regardless of k8s version.

dalefwillis · 2020-10-07T13:30:33Z

@olemarkus can you give a quick opinion on which versions of ubuntu, kops and k8s would be the best config? I'm running kops 1.16.0-alpha.2 with k8s v1.16.3. I tried AMIs of Ubuntu yesterday, 16, 18, and 20 all gave different variations of problems. (Very well could be my kops version but I have istio running and I've been a bit scared of upgrading too much and having the pop-sickle sticks come tumbling down..)

aychang95 · 2020-10-07T14:57:48Z

@dalefwillis I've using nvidia.com/gpu resources by setting the gpu-operator up with kops 1.18., kubectl 1.17., with Ubuntu 18.04 nodes (all matching kernel versions). I have istio running as well. What issues are you running into?

@rifelpet
Overall, using the gpu-operator with kops is very low-lift since it uses the operator framework. For the purpose of getting the nvidia container runtime, plugin, and drivers set up on gpu instances, it's definitely a lot less manual k8s labor than the current kops hook and plugin approach. I have an e2e set up using the gpu-operator in kops-managed cluster, but I think the question is would you want a guide in the kops documentation for something as independent as the gpu-operator? I could see it as an addon like the ambassador addon, and it would be nice to have more visuals/materials on kops with gpus

rifelpet · 2020-10-07T21:31:08Z

@aychang95 I think if there's any kops-specific info that would be beneficial to someone installing the operator then we could have a docs page for it, otherwise we just tell users to follow the official installation docs for the operator.

aychang95 changed the title ~~nvidia-device-plugin breaking after latest 19.03.9 release~~ nvidia-device-plugin breaking after latest Docker 19.03.9 release May 21, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 18, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 19, 2020

olemarkus mentioned this issue Oct 18, 2020

Add some quick notes on how to get GPU opertor working #10067

Merged

k8s-ci-robot closed this as completed in #10067 Oct 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-device-plugin breaking after latest Docker 19.03.9 release #9148

nvidia-device-plugin breaking after latest Docker 19.03.9 release #9148

aychang95 commented May 20, 2020

aychang95 commented May 20, 2020 •

edited

fejta-bot commented Aug 19, 2020

fejta-bot commented Sep 18, 2020

olemarkus commented Sep 19, 2020

aychang95 commented Oct 6, 2020

rifelpet commented Oct 6, 2020

dalefwillis commented Oct 6, 2020

olemarkus commented Oct 7, 2020

dalefwillis commented Oct 7, 2020

aychang95 commented Oct 7, 2020

rifelpet commented Oct 7, 2020

nvidia-device-plugin breaking after latest Docker 19.03.9 release #9148

nvidia-device-plugin breaking after latest Docker 19.03.9 release #9148

Comments

aychang95 commented May 20, 2020

aychang95 commented May 20, 2020 • edited

fejta-bot commented Aug 19, 2020

fejta-bot commented Sep 18, 2020

olemarkus commented Sep 19, 2020

aychang95 commented Oct 6, 2020

rifelpet commented Oct 6, 2020

dalefwillis commented Oct 6, 2020

olemarkus commented Oct 7, 2020

dalefwillis commented Oct 7, 2020

aychang95 commented Oct 7, 2020

rifelpet commented Oct 7, 2020

aychang95 commented May 20, 2020 •

edited