-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The GPU cannot be mounted correctly #26
Comments
I see you are using k8s 1.23. GPUMounter has a known issue in k8s v1.20+, refer to #19 . |
好的,已经看到了。请问一下 大佬最近有计划修复么? |
You can use @cool9203 's branch in cool9203@5ca4e5c in k8s v1.20+. |
OK , 我现在已经能分配GPU了。但是执行过程有一点点报错。我的应用docker容器中包含 work日志 2024-05-13T09:41:37.399Z DEBUG collector/collector.go:130 GPU: /dev/nvidia4 allocated to Pod: gpu-test-7c9d8f59-jgwwx-slave-pod-d71533 in Namespace gpu-pool
2024-05-13T09:41:37.399Z INFO collector/collector.go:136 GPU status update successfully
2024-05-13T09:41:37.399Z INFO gpu-mount/server.go:81 Start mounting, Total: 1 Current: 1
2024-05-13T09:41:37.399Z INFO util/util.go:19 Start mount GPU: {"MinorNumber":4,"DeviceFilePath":"/dev/nvidia4","UUID":"GPU-a9f53ecd-233a-01d6-12e3-7f63bcd0054d","State":"GPU_ALLOCATED_STATE","PodName":"gpu-test-7c9d8f59-jgwwx-slave-pod-d71533","Namespace":"gpu-pool"} to Pod: gpu-test-7c9d8f59-jgwwx
2024-05-13T09:41:37.399Z INFO util/util.go:24 Pod :gpu-test-7c9d8f59-jgwwx container ID: e12e13eb92c5cfbff030a8cb34c0d892b76ed51fc4f3ee5b7b82ce81e0ead82b
2024-05-13T09:41:37.399Z INFO util/util.go:35 Successfully get cgroup path: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podbc3a50ad_e861_4af1_a259_551397fc0253.slice/docker-e12e13eb92c5cfbff030a8cb34c0d892b76ed51fc4f3ee5b7b82ce81e0ead82b.scope for Pod: gpu-test-7c9d8f59-jgwwx
2024-05-13T09:41:37.400Z INFO util/util.go:41 Successfully add GPU: {"MinorNumber":4,"DeviceFilePath":"/dev/nvidia4","UUID":"GPU-a9f53ecd-233a-01d6-12e3-7f63bcd0054d","State":"GPU_ALLOCATED_STATE","PodName":"gpu-test-7c9d8f59-jgwwx-slave-pod-d71533","Namespace":"gpu-pool"} permisssion for Pod: gpu-test-7c9d8f59-jgwwx
2024-05-13T09:41:37.401Z INFO util/util.go:57 Successfully get PID: 44365 of Pod: gpu-test-7c9d8f59-jgwwx Container: e12e13eb92c5cfbff030a8cb34c0d892b76ed51fc4f3ee5b7b82ce81e0ead82b
2024-05-13T09:41:37.405Z ERROR namespace/namespace.go:171 Failed to execute cmd: mknod -m 666 /dev/nvidia4 c 195 4
2024-05-13T09:41:37.405Z ERROR namespace/namespace.go:172 Std Output:
2024-05-13T09:41:37.405Z ERROR namespace/namespace.go:173 Err Output: mknod: cannot set permissions of '/dev/nvidia4': Operation not supported
2024-05-13T09:41:37.405Z ERROR util/util.go:65 Failed to create device file in Target PID Namespace: 44365 Pod: gpu-test-7c9d8f59-jgwwx Namespace: gpu-pool
2024-05-13T09:41:37.405Z ERROR gpu-mount/server.go:84 Mount GPU: {"MinorNumber":4,"DeviceFilePath":"/dev/nvidia4","UUID":"GPU-a9f53ecd-233a-01d6-12e3-7f63bcd0054d","State":"GPU_ALLOCATED_STATE","PodName":"gpu-test-7c9d8f59-jgwwx-slave-pod-d71533","Namespace":"gpu-pool"} to Pod: gpu-test-7c9d8f59-jgwwx in Namespace: gpu-pool failed
2024-05-13T09:41:37.405Z ERROR gpu-mount/server.go:85 Error while executing command: exit status 1 容器内部 (base) root@gpu-test-7c9d8f59-jgwwx:~# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-419ba880-653e-81b2-6994-d388621d2168)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-a9f53ecd-233a-01d6-12e3-7f63bcd0054d) |
我更换了一个镜像就成功了,感谢大佬。 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem description
After GPUMounter is deployed, the service starts normally. I created a test POD to see if the GPU was allocated correctly. Following the documentation, the log shows that the GPU has been assigned to the POD, but when I check into the POD, there is still no GPU.
Execution process
root@pt15:~/app/GPUMounter-master/deploy# kubectl get pod NAME READY STATUS RESTARTS AGE dynamic-7db67d5cf5-qdg4p 1/1 Running 0 60m
environment
Log
The text was updated successfully, but these errors were encountered: