Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not use GPUMounter on k8s #20

Open
Crazybean-lwb opened this issue Feb 18, 2022 · 6 comments
Open

Can not use GPUMounter on k8s #20

Crazybean-lwb opened this issue Feb 18, 2022 · 6 comments

Comments

@Crazybean-lwb
Copy link

Crazybean-lwb commented Feb 18, 2022

environment:

  • k8s 1.16.15
  • docker 20.10.10

problem: following QuickStart.md, I install GPUMounter successfully in my k8s. However, never request remove gpu and add gpu sucessfully.

I pasted some logs from gpu-mounter-master-container:

remove gpu
2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:120 access remove gpu service
2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:134 GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd
2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:135 GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd
2022-02-18T03:44:55.184Z INFO GPUMounter-master/main.go:146 Pod: jupyter-lab-54d76f5d58-rlklh Namespace: default UUIDs: GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd force: true
2022-02-18T03:44:55.188Z INFO GPUMounter-master/main.go:169 Found Pod: jupyter-lab-54d76f5d58-rlklh in Namespace: default on Node: dev06.ucd.qzm.stonewise.cn
2022-02-18T03:44:55.193Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-fbfj8 Node: dev05.ucd.qzm.stonewise.cn
2022-02-18T03:44:55.193Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-kwmsn Node: dev06.ucd.qzm.stonewise.cn
2022-02-18T03:44:55.201Z ERROR GPUMounter-master/main.go:217 Invalid UUIDs: GPU-5d237016-9ea5-77bd-8c2f-2b3fd4bfa2cd

add gpu
2022-02-18T03:42:22.897Z INFO GPUMounter-master/main.go:25 access add gpu service
2022-02-18T03:42:22.898Z INFO GPUMounter-master/main.go:30 Pod: jupyter-lab-54d76f5d58-rlklh Namespace: default GPU Num: 4 Is entire mount: false
2022-02-18T03:42:22.902Z INFO GPUMounter-master/main.go:66 Found Pod: jupyter-lab-54d76f5d58-rlklh in Namespace: default on Node: dev06.ucd.qzm.stonewise.cn
2022-02-18T03:42:22.907Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-fbfj8 Node: dev05.ucd.qzm.stonewise.cn
2022-02-18T03:42:22.907Z INFO GPUMounter-master/main.go:265 Worker: gpu-mounter-workers-kwmsn Node: dev06.ucd.qzm.stonewise.cn
2022-02-18T03:42:22.921Z ERROR GPUMounter-master/main.go:98 Failed to call add gpu service
2022-02-18T03:42:22.921Z ERROR GPUMounter-master/main.go:99 rpc error: code = Unknown desc = FailedCreated

@Crazybean-lwb
Copy link
Author

do not have slave pod in my namespace: gpu-pool

@pokerfaceSad
Copy link
Owner

@liuweibin6566396837
Thanks for your issue.
Show more relevant logs of gpu-mounter-worker(/etc/GPUMounter/log/GPUMounter-worker.log) plz.

@pokerfaceSad
Copy link
Owner

pokerfaceSad commented Feb 18, 2022

It seems like that you edit the k8s version in this issue.
What's your k8s version?
In current version, GPUMounter has a known bug on k8s v1.20+ mentioned in #19 (comment).

@Crazybean-lwb
Copy link
Author

It seems like that you edit the k8s version in this issue. What's your k8s version? In current version, GPUMounter has a known bug on k8s v1.20+ mentioned in #19 (comment).

thanks for your reply. I have fixed the problem earlier(just make sure env:
- name: NVIDIA_VISIBLE_DEVICES
value: "none"
the problem will be solved.

@Crazybean-lwb
Copy link
Author

It seems like that you edit the k8s version in this issue. What's your k8s version? In current version, GPUMounter has a known bug on k8s v1.20+ mentioned in #19 (comment).

Now, I met a new bug in cluster
k8s 1.20.11
docker 20.10.10

bug: when I request addgpu, it return "Add GPU Success", however no slaver pod in gpu-pool.
I found some unusual log in worker's pod, show some logs follows:

2022-03-15T13:12:43.240Z INFO collector/collector.go:136 GPU status update successfully
2022-03-15T13:12:46.402Z INFO allocator/allocator.go:59 Creating GPU Slave Pod: base-0-slave-pod-595282 for Owner Pod: base-0
2022-03-15T13:12:46.403Z INFO allocator/allocator.go:238 Checking Pods: base-0-slave-pod-595282 state
2022-03-15T13:12:50.450Z INFO allocator/allocator.go:252 Not Found....
2022-03-15T13:12:50.450Z INFO allocator/allocator.go:277 Pods: base-0-slave-pod-595282 are running
2022-03-15T13:12:50.450Z INFO allocator/allocator.go:84 Successfully create Slave Pod: base-0-slave-pod-595282, for Owner Pod: base-0
2022-03-15T13:12:50.450Z INFO collector/collector.go:91 Updating GPU status
2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia0 allocated to Pod: xiaoxuan-fbdd-0 in Namespace shixiaoxuan
2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia1 allocated to Pod: zwbgpu-pytorch-1-6-0 in Namespace zhouwenbiao
2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia7 allocated to Pod: admet-predict-0 in Namespace liqinze
2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia5 allocated to Pod: xiaoxuan-test-d2m-0 in Namespace shixiaoxuan
2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia2 allocated to Pod: bf-dev-2-0 in Namespace baifang
2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia3 allocated to Pod: bf-dev-2-0 in Namespace baifang
2022-03-15T13:12:50.452Z DEBUG collector/collector.go:130 GPU: /dev/nvidia6 allocated to Pod: minisomdimsanbai-0 in Namespace yangdeai
2022-03-15T13:12:50.452Z INFO collector/collector.go:136 GPU status update successfully
2022-03-15T13:12:50.452Z INFO gpu-mount/server.go:97 Successfully mount all GPU to Pod: base-0 in Namespace: liuweibin

@pokerfaceSad
Copy link
Owner

Thx for your report.
It seems that you have the unfixed issue mentioned in #19.
GPUMounter can not work well in k8s v1.20+ in current version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants