-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running runsc with containerd and --nvproxy=true
removes NVIDIA drivers from container in Kubernetes
#9368
Comments
Adding the logs for the container: |
Thanks for the very detailed report! Apologies for the delay. nvproxy is not supported with We are currently focused on establishing support in GKE. GKE uses a different GPU+container stack. It does not use To summarize, nvproxy works in the following environments:
|
Thanks for the followup @ayushr2. In the meantime I've made some progress where by just using
I tried to also run this under |
Yeah I don't think it will work just yet. In GKE, the container spec defines which GPUs to expose in My best guess is that k8s-device-plugin is creating bind mounts of In docker mode, the GPU devices are explicitly exposed like this. In GKE, the device files are automatically created here because |
Thanks for the detailed reply @ayushr2! Though I'm a bit out of my depth here, your guidance has been very helpful. I'm trying to better understand the differences for GKE; could you please point me to where the container spec/sandbox is defined? I'm not sure if it's possible to try to port that configuration over to Amazon Linux or if I should just try to add the feature directly to the gVisor code you pointed me to. |
I've tried very naively to add the following snipped to
in order to try to mount the devices during runtime, but seems like even this isn't enough |
You probably also want Also note that the minor number of /dev/nvidia-uvm is different inside the sandbox. So just copying from host won't work. |
Yeah, from reading the code and looking at the logs seems like gVisor automatically assigns a minor number to the device. Unfortunately your suggestion still didn't work. I'll leave the logs for the container in case you (or anyone that comes across this issue) want to use it for debugging (note that I had already added a |
Got it, thanks for working with me on this. Just to set the expectations, adding support for k8s-device-plugin is currently not on our roadmap. We are focused on maturing GPU support in GKE first. OSS contributions for GPU support in additional environments is appreciated in the meantime! |
No worries! In the meantime, we don't have a strict requirement for having NVIDIA working with gVisor so we can get around it. I'd love to help bringing in this feature but it would still need to get more familiarized with gVisor, but I'll help in any way I can! |
A friendly reminder that this issue had no activity for 120 days. |
@PedroRibeiro95 Have you done any new researches on this? I was doing some researches on this and it looks like it should be working with following configurations: k8s-device-plugin does have a config called I never tested anything, everything mentioned above is just pure guess from me, but let me know if my reasoning makes sense or not. |
Hey @sfc-gh-hyu, thanks for the detailed instructions. I haven't revisited this in the meantime as other priorities came up, but I will be testing it again very soon. I will try to follow what you suggested and I will report back with more details. |
A friendly reminder that this issue had no activity for 120 days. |
Updating this issue to inform we managed to get it working by using |
Description
Hello. I'm trying to get gVisor to work with NVIDIA drivers in Kubernetes, using the regular AWS EKS Amazon Linux 2 AMI (not the GPU one). I can confirm that both work separately; however, I'm having a lot of troubles getting gVisor to work with the NVIDIA drivers. When I try to run the
nvidia:cuda
image using the gVisor runtime class, I can see that the environment variables are correctly set, but thenvidia-smi
binary is gone. These are all the files I'm using:config.toml
runsc.toml
Test pod:
Execing in to the pod:
I have the NVIDIA Plugin DaemonSet running using the
nvidia
runtime class.Steps to reproduce
runsc version
uname
Linux ip-10-253-32-249.ec2.internal 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
The text was updated successfully, but these errors were encountered: