Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-device-plugin breaking after latest Docker 19.03.9 release #9148

Closed
aychang95 opened this issue May 20, 2020 · 11 comments · Fixed by #10067
Closed

nvidia-device-plugin breaking after latest Docker 19.03.9 release #9148

aychang95 opened this issue May 20, 2020 · 11 comments · Fixed by #10067

Comments

@aychang95
Copy link

**1. What kops version are you running?
v1.16.0

**2. What Kubernetes version are you running?
v1.15.11

3. What cloud provider are you using?
aws

4. What commands did you run? What is the simplest way to reproduce this issue?
Just follow the "preferred' approach for the nvidia-device-plugin hook here
https://github.com/kubernetes/kops/tree/master/hooks/nvidia-device-plugin
Generated image with makefile or use fifarzh's docker registry for CUDA 10

5. What happened after the commands executed?
GPU nodes with the nvidia plugin hook fails with errors like

Warning  ContainerGCFailed        29s                    kubelet, ip-172-20-57-161.ec2.internal     rpc error: code = Unknown desc = Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

and endlessly loops with docker services not working.

6. What did you expect to happen?
p2 or p3 GPU nodes are configured with nvidia drivers and cuda 10 configured.

**7. Please provide your cluster manifest. Execute
This issue seems to be isolated with the hook so I don't think my manifest is needed.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

verbose logs are generic docker API inaccessibility issues

9. Anything else do we need to know?
Did some digging and it's most likely due to the latest docker v19.03.9 release May 18th, 2020

@aychang95
Copy link
Author

aychang95 commented May 20, 2020

I addressed this issue by downgrading docker 19.03.9->19.03.8 in the gpu nodes by editing the https://github.com/kubernetes/kops/blob/master/hooks/nvidia-device-plugin/image/files/02-nvidia-docker.sh file.

You can try using the image I built and pushed:
aychang/nvidia-device-plugin:0.2.0-cuda10.0

Example in ig group:

spec:
  hooks:
  - execContainer:
      image: aychang/nvidia-device-plugin:0.2.0-cuda10.0

It takes a little longer but GPU nodes now go back to being configured with NVIDIA drivers and cuda10

@aychang95 aychang95 changed the title nvidia-device-plugin breaking after latest 19.03.9 release nvidia-device-plugin breaking after latest Docker 19.03.9 release May 21, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 18, 2020
@olemarkus
Copy link
Member

/remove-lifecycle rotten

May be that this hook should be removed. Not sure we really need that anymore given things like nvidia's GPU operator.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 19, 2020
@aychang95
Copy link
Author

@olemarkus
As someone who would love to transfer to an automated management system like GPU operator, just want to check if this is aligned with the Kops roadmap?

We've been using kops hooks to apply the device plugins for each GPU software component which gets tedious and hard to manage pretty fast...especially with how quickly NVIDIA updates come out. Wouldn't mind this hook being removed for easier alternatives.

@rifelpet
Copy link
Member

rifelpet commented Oct 6, 2020

I'm not aware of any Kops maintainers using GPUs in their clusters, and Kops doesn't have automated e2e tests that cover GPU usage, so any changes to Kops' GPU support would need testing (or contribution) from the community. If there's a better solution like the GPU operator, we'd be happy to review any docs or code updates :)

@dalefwillis
Copy link

@olemarkus are there any succinct details on setting up nvidia's gpu operator with kops? In the past (pre 1.15) I have setup GPU support using hooks specified by @aychang95.

With a new cluster (1.16) and given some of the issue comments here I spent the day fighting the good fight to setup the GPU operator on AWS without success (if I need to I'll open up an issue ticket). Not that I expect the kops community to be the ones required to maintain this kind of documentation for outside operators - just appreciate any insights you can provide on this (my current issue is the OS version seems to cause some issues).

@olemarkus
Copy link
Member

I have not had the chance to look into nvidia's operator just yet. But GPU support is something that is interesting for my employer, so I suspect I'll be able to make some form of support/documentation in the not-to-distant future.

I would at least use the Ubuntu AMIs if trying to use the operator. That will work on newer kops regardless of k8s version.

@dalefwillis
Copy link

@olemarkus can you give a quick opinion on which versions of ubuntu, kops and k8s would be the best config? I'm running kops 1.16.0-alpha.2 with k8s v1.16.3. I tried AMIs of Ubuntu yesterday, 16, 18, and 20 all gave different variations of problems. (Very well could be my kops version but I have istio running and I've been a bit scared of upgrading too much and having the pop-sickle sticks come tumbling down..)

@aychang95
Copy link
Author

@dalefwillis I've using nvidia.com/gpu resources by setting the gpu-operator up with kops 1.18., kubectl 1.17., with Ubuntu 18.04 nodes (all matching kernel versions). I have istio running as well. What issues are you running into?

@rifelpet
Overall, using the gpu-operator with kops is very low-lift since it uses the operator framework. For the purpose of getting the nvidia container runtime, plugin, and drivers set up on gpu instances, it's definitely a lot less manual k8s labor than the current kops hook and plugin approach. I have an e2e set up using the gpu-operator in kops-managed cluster, but I think the question is would you want a guide in the kops documentation for something as independent as the gpu-operator? I could see it as an addon like the ambassador addon, and it would be nice to have more visuals/materials on kops with gpus

@rifelpet
Copy link
Member

rifelpet commented Oct 7, 2020

@aychang95 I think if there's any kops-specific info that would be beneficial to someone installing the operator then we could have a docs page for it, otherwise we just tell users to follow the official installation docs for the operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants