-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob #180
Comments
@HeGaoYuan thanks GaoYuan, for current implementation, DAGScheduling does conflict with GangScheduling, you can disable DAGScheduling for temporary by setting up start up flags : gang scheduling should split up when DAG is enabled, for example: a TFJob with 1 ps and 10 workers, there should be 2 PodGroups grouped by replica, please assign an issue for me and I'll implement it later. |
In my understanding, the reason why we use GangScheduling is avoid deadlock when multiple job instances request resources. I think 2 PodGroups also may lead to deadlock problem. Maybe we should think a better method to solve this problem? |
In this case, the launcher pod is only a single pod and requires less resources. To keep it simple, we can make all workers as a gang and exclude the launcher pod. This won't solve all the cases, but it is simple and worth to try if it works in most cases in reality. what do you think ? |
In my opinion, because we have the launcherRunsWorkload ability, so the launcher pod may request more resources. |
ok, in that case, we may disable the dag schduling if launcherRunsWorkload is enabled, @HeGaoYuan @SimonCqk opinions ? |
In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running. |
then there comes two anomalies:
|
is launcherRunsWorkload a valid use case ? why is that needed ? @HeGaoYuan |
launcherRunsWorkload is not introduced by me, but I use this ability in practice. The only advantage of this ability is it will reduce the number of pods |
good consideration! |
good point, actually the global flag |
@HeGaoYuan I post an issue and will refactor it soon #194 |
@HeGaoYuan Hi, I've posted a pull requests to fix it, and a latest image tagged with daily will be pushed to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :) |
Great! I will try it. |
What happened:
The mpijob worker pods are pending, and there is no launcher pod
The events of worker pod are as follows
I think the core reason is the DAGScheduling and GangScheduling(volcano) conflict in mpijob.
I can fix this problem by adding this args in the kubedl deployment.
What you expected to happen:
No pending
How to reproduce it:
enable DAGScheduling and GangScheduling(volcano) to run a mpijob
Anything else we need to know?:
Environment:
kubectl version
):cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: