You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am elihe from Zhihu. I have seen your article in Zhihu before. After reading your code I have a question:
Why is each slave pod bound to only one GPU in GetAvailableGPU method of pkg/util/gpu/allocator/allocator.go?
As far as I'm concerned, in a large-scale cluster, this will bring additional load to the master node (there will be a larger number of pod creation requests); And the creation of multiple single-card pods may cause two competing GPU mount requests all failing (for example There are 4 available GPUs and two requests to mount 4 cards. One request successfully created slave pods 1 and 2, and the other created slave pods 3 and 4. They will all be unable to obtain more resources.)
If you agree with me, can I submit a merge request to optimize this?
The text was updated successfully, but these errors were encountered:
In my opinion, it is really a trade off to bind only one GPU to one slave pod. Because if we request all GPUs by one slave pod, it will be complicated to unmount.
In current implementation, we just need to delete a slave pod if we will to unmount a GPU. But if we request all GPUs by only one slave pod, during a unmount operation, it will be complicated to tell kubelet and kube-scheduler the unmounted GPU is free (May be need some hack).
So I think it is really a trade off.
Please feel free to correct me if my opinion is unreasonable.
Hello, I am elihe from Zhihu. I have seen your article in Zhihu before. After reading your code I have a question:
Why is each slave pod bound to only one GPU in GetAvailableGPU method of pkg/util/gpu/allocator/allocator.go?
As far as I'm concerned, in a large-scale cluster, this will bring additional load to the master node (there will be a larger number of pod creation requests); And the creation of multiple single-card pods may cause two competing GPU mount requests all failing (for example There are 4 available GPUs and two requests to mount 4 cards. One request successfully created slave pods 1 and 2, and the other created slave pods 3 and 4. They will all be unable to obtain more resources.)
If you agree with me, can I submit a merge request to optimize this?
The text was updated successfully, but these errors were encountered: