Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The GPU cannot be mounted correctly #26

Closed
bilbilmyc opened this issue May 10, 2024 · 5 comments
Closed

The GPU cannot be mounted correctly #26

bilbilmyc opened this issue May 10, 2024 · 5 comments

Comments

@bilbilmyc
Copy link

Problem description

After GPUMounter is deployed, the service starts normally. I created a test POD to see if the GPU was allocated correctly. Following the documentation, the log shows that the GPU has been assigned to the POD, but when I check into the POD, there is still no GPU.

Execution process

root@pt15:~/app/GPUMounter-master/deploy# kubectl get pod
NAME                        READY   STATUS    RESTARTS   AGE
dynamic-7db67d5cf5-qdg4p    1/1     Running   0          60m
root@pt15:~/app/GPUMounter-master/deploy# curl -X GET 'http://12.2.100.15:32688/addgpu/namespace/default/pod/dynamic-7db67d5cf5-qdg4p/gpu/2/isEntireMount/true'
Add GPU Success
root@pt15:~/app/GPUMounter-master/deploy# kubectl exec -it dynamic-7db67d5cf5-qdg4p -- nvidia-smi -L
No devices found.

environment

  1. k8s 1.23.17
  2. docker 20

Log

  1. This is worker log

The log shows that the GPU is correctly allocated

2024-05-10T07:41:59.383Z	INFO	gpu-mount/server.go:35	AddGPU Service Called
2024-05-10T07:41:59.383Z	INFO	gpu-mount/server.go:36	request: pod_name:"dynamic-7db67d5cf5-qdg4p" namespace:"default" gpu_num:2 is_entire_mount:true
2024-05-10T07:41:59.399Z	INFO	gpu-mount/server.go:55	Successfully get Pod: default in cluster
2024-05-10T07:41:59.399Z	INFO	allocator/allocator.go:159	Get pod default/dynamic-7db67d5cf5-qdg4p mount type
2024-05-10T07:41:59.399Z	INFO	collector/collector.go:91	Updating GPU status
2024-05-10T07:41:59.400Z	INFO	collector/collector.go:136	GPU status update successfully
2024-05-10T07:41:59.409Z	INFO	allocator/allocator.go:59	Creating GPU Slave Pod: dynamic-7db67d5cf5-qdg4p-slave-pod-047a26 for Owner Pod: dynamic-7db67d5cf5-qdg4p
2024-05-10T07:41:59.409Z	INFO	allocator/allocator.go:238	Checking Pods: dynamic-7db67d5cf5-qdg4p-slave-pod-047a26 state
2024-05-10T07:41:59.412Z	INFO	allocator/allocator.go:264	Pod: dynamic-7db67d5cf5-qdg4p-slave-pod-047a26 creating
2024-05-10T07:42:04.187Z	INFO	allocator/allocator.go:252	Not Found....
2024-05-10T07:42:04.187Z	INFO	allocator/allocator.go:277	Pods: dynamic-7db67d5cf5-qdg4p-slave-pod-047a26 are running
2024-05-10T07:42:04.187Z	INFO	allocator/allocator.go:84	Successfully create Slave Pod: dynamic-7db67d5cf5-qdg4p-slave-pod-047a26, for Owner Pod: dynamic-7db67d5cf5-qdg4p
2024-05-10T07:42:04.187Z	INFO	collector/collector.go:91	Updating GPU status
2024-05-10T07:42:04.188Z	INFO	collector/collector.go:136	GPU status update successfully
2024-05-10T07:42:04.188Z	INFO	gpu-mount/server.go:97	Successfully mount all GPU to Pod: dynamic-7db67d5cf5-qdg4p in Namespace: default
@pokerfaceSad
Copy link
Owner

I see you are using k8s 1.23. GPUMounter has a known issue in k8s v1.20+, refer to #19 .

@bilbilmyc
Copy link
Author

好的,已经看到了。请问一下 大佬最近有计划修复么?

@pokerfaceSad
Copy link
Owner

You can use @cool9203 's branch in cool9203@5ca4e5c in k8s v1.20+.

@bilbilmyc
Copy link
Author

OK , 我现在已经能分配GPU了。但是执行过程有一点点报错。我的应用docker容器中包含mknod 这个命令,分配GPU的时候日志会报错找不到这个命令,但是进入到容器中可以正常查看到GPU卡。我查看了FAQ,并且确认了自己的容器包含mknod命令。

work日志

2024-05-13T09:41:37.399Z	DEBUG	collector/collector.go:130	GPU: /dev/nvidia4 allocated to Pod: gpu-test-7c9d8f59-jgwwx-slave-pod-d71533 in Namespace gpu-pool
2024-05-13T09:41:37.399Z	INFO	collector/collector.go:136	GPU status update successfully
2024-05-13T09:41:37.399Z	INFO	gpu-mount/server.go:81	Start mounting, Total: 1 Current: 1
2024-05-13T09:41:37.399Z	INFO	util/util.go:19	Start mount GPU: {"MinorNumber":4,"DeviceFilePath":"/dev/nvidia4","UUID":"GPU-a9f53ecd-233a-01d6-12e3-7f63bcd0054d","State":"GPU_ALLOCATED_STATE","PodName":"gpu-test-7c9d8f59-jgwwx-slave-pod-d71533","Namespace":"gpu-pool"} to Pod: gpu-test-7c9d8f59-jgwwx
2024-05-13T09:41:37.399Z	INFO	util/util.go:24	Pod :gpu-test-7c9d8f59-jgwwx container ID: e12e13eb92c5cfbff030a8cb34c0d892b76ed51fc4f3ee5b7b82ce81e0ead82b
2024-05-13T09:41:37.399Z	INFO	util/util.go:35	Successfully get cgroup path: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podbc3a50ad_e861_4af1_a259_551397fc0253.slice/docker-e12e13eb92c5cfbff030a8cb34c0d892b76ed51fc4f3ee5b7b82ce81e0ead82b.scope for Pod: gpu-test-7c9d8f59-jgwwx
2024-05-13T09:41:37.400Z	INFO	util/util.go:41	Successfully add GPU: {"MinorNumber":4,"DeviceFilePath":"/dev/nvidia4","UUID":"GPU-a9f53ecd-233a-01d6-12e3-7f63bcd0054d","State":"GPU_ALLOCATED_STATE","PodName":"gpu-test-7c9d8f59-jgwwx-slave-pod-d71533","Namespace":"gpu-pool"} permisssion for Pod: gpu-test-7c9d8f59-jgwwx
2024-05-13T09:41:37.401Z	INFO	util/util.go:57	Successfully get PID: 44365 of Pod: gpu-test-7c9d8f59-jgwwx Container: e12e13eb92c5cfbff030a8cb34c0d892b76ed51fc4f3ee5b7b82ce81e0ead82b
2024-05-13T09:41:37.405Z	ERROR	namespace/namespace.go:171	Failed to execute cmd: mknod -m 666 /dev/nvidia4 c 195 4
2024-05-13T09:41:37.405Z	ERROR	namespace/namespace.go:172	Std Output:
2024-05-13T09:41:37.405Z	ERROR	namespace/namespace.go:173	Err Output: mknod: cannot set permissions of '/dev/nvidia4': Operation not supported

2024-05-13T09:41:37.405Z	ERROR	util/util.go:65	Failed to create device file in Target PID Namespace: 44365 Pod: gpu-test-7c9d8f59-jgwwx Namespace: gpu-pool
2024-05-13T09:41:37.405Z	ERROR	gpu-mount/server.go:84	Mount GPU: {"MinorNumber":4,"DeviceFilePath":"/dev/nvidia4","UUID":"GPU-a9f53ecd-233a-01d6-12e3-7f63bcd0054d","State":"GPU_ALLOCATED_STATE","PodName":"gpu-test-7c9d8f59-jgwwx-slave-pod-d71533","Namespace":"gpu-pool"} to Pod: gpu-test-7c9d8f59-jgwwx in Namespace: gpu-pool failed
2024-05-13T09:41:37.405Z	ERROR	gpu-mount/server.go:85	Error while executing command: exit status 1

容器内部

(base) root@gpu-test-7c9d8f59-jgwwx:~# nvidia-smi  -L
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-419ba880-653e-81b2-6994-d388621d2168)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-a9f53ecd-233a-01d6-12e3-7f63bcd0054d)

image

@bilbilmyc
Copy link
Author

我更换了一个镜像就成功了,感谢大佬。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants