[BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob #180

HeGaoYuan · 2021-09-22T14:40:51Z

What happened:
The mpijob worker pods are pending, and there is no launcher pod

mpi-demo-worker-0             0/1     Pending     0          13s
mpi-demo-worker-1             0/1     Pending     0          13s

The events of worker pod are as follows

Events:
  Type     Reason            Age   From     Message
  ----     ------            ----  ----     -------
  Warning  FailedScheduling  65s   volcano  3/2 tasks in gang unschedulable: pod group is not ready, 2 Pending, 3 minAvailable.

I think the core reason is the DAGScheduling and GangScheduling(volcano) conflict in mpijob.

I can fix this problem by adding this args in the kubedl deployment.

- --feature-gates
- DAGScheduling=false

What you expected to happen:

No pending

How to reproduce it:
enable DAGScheduling and GangScheduling(volcano) to run a mpijob

Anything else we need to know?:

Environment:

KubeDL version:
Kubernetes version (use kubectl version):
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

SimonCqk · 2021-09-23T13:20:01Z

@HeGaoYuan thanks GaoYuan, for current implementation, DAGScheduling does conflict with GangScheduling, you can disable DAGScheduling for temporary by setting up start up flags : --feature-gates=DAGScheduling=false

gang scheduling should split up when DAG is enabled, for example: a TFJob with 1 ps and 10 workers, there should be 2 PodGroups grouped by replica, please assign an issue for me and I'll implement it later.

HeGaoYuan · 2021-09-23T14:09:27Z

@HeGaoYuan thanks GaoYuan, for current implementation, DAGScheduling does conflict with GangScheduling, you can disable DAGScheduling for temporary by setting up start up flags : --feature-gates=DAGScheduling=false

gang scheduling should split up when DAG is enabled, for example: a TFJob with 1 ps and 10 workers, there should be 2 PodGroups grouped by replica, please assign an issue for me and I'll implement it later.

In my understanding, the reason why we use GangScheduling is avoid deadlock when multiple job instances request resources. I think 2 PodGroups also may lead to deadlock problem. Maybe we should think a better method to solve this problem?

jian-he · 2021-09-30T18:21:00Z

In this case, the launcher pod is only a single pod and requires less resources. To keep it simple, we can make all workers as a gang and exclude the launcher pod.

This won't solve all the cases, but it is simple and worth to try if it works in most cases in reality. what do you think ?

HeGaoYuan · 2021-09-30T22:40:15Z

In this case, the launcher pod is only a single pod and requires less resources. To keep it simple, we can make all workers as a gang and exclude the launcher pod.

This won't solve all the cases, but it is simple and worth to try if it works in most cases in reality. what do you think ?

In my opinion, because we have the launcherRunsWorkload ability, so the launcher pod may request more resources.

jian-he · 2021-10-01T17:19:07Z

ok, in that case, we may disable the dag schduling if launcherRunsWorkload is enabled, @HeGaoYuan @SimonCqk opinions ?

HeGaoYuan · 2021-10-12T15:07:03Z

ok, in that case, we may disable the dag schduling if launcherRunsWorkload is enabled, @HeGaoYuan @SimonCqk opinions ?

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

SimonCqk · 2021-10-13T03:17:57Z

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

then there comes two anomalies:

thousands of init-containers will be created and watch or polling from apiserver, which causes heavy overhead for control plane.
downstream pods may scheduled in advance(not only upstream Running tiggers downstream, but also upstream Succeed triggers downstream), who consumed resources but keep stalling.

jian-he · 2021-10-13T05:32:27Z

is launcherRunsWorkload a valid use case ? why is that needed ？ @HeGaoYuan

HeGaoYuan · 2021-10-13T06:41:53Z

is launcherRunsWorkload a valid use case ? why is that needed ？ @HeGaoYuan

launcherRunsWorkload is not introduced by me, but I use this ability in practice. The only advantage of this ability is it will reduce the number of pods

HeGaoYuan · 2021-10-13T06:48:25Z

speaking of launcherRunsWorkload, now the launcherRunsWorkload is a global variable of kubedl, I suggest to change it as a field of job, what do you think? @jian-he @SimonCqk

HeGaoYuan · 2021-10-13T12:57:01Z

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

then there comes two anomalies:

thousands of init-containers will be created and watch or polling from apiserver, which causes heavy overhead for control plane.

downstream pods may scheduled in advance(not only upstream Running tiggers downstream, but also upstream Succeed triggers downstream), who consumed resources but keep stalling.

good consideration!

SimonCqk · 2021-10-14T03:13:26Z

speaking of launcherRunsWorkload, now the launcherRunsWorkload is a global variable of kubedl, I suggest to change it as a field of job, what do you think? @jian-he @SimonCqk

good point, actually the global flag launcherRunsWorkload can be removed, mpiReplicaSpecs with Launcher role indicates that mpijob will be driven by Launcher pod, which is launcherRunsWorkload semantics.

SimonCqk · 2021-10-14T03:16:51Z

@HeGaoYuan I post an issue and will refactor it soon #194

SimonCqk · 2022-01-25T07:09:37Z

@HeGaoYuan Hi, I've posted a pull requests to fix it, and a latest image tagged with daily will be pushed to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :)

HeGaoYuan · 2022-01-26T11:08:16Z

@HeGaoYuan Hi, I've posted a pull requests to fix it, and a latest image tagged with daily will be pushed to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :)

Great! I will try it.

HeGaoYuan assigned SimonCqk Sep 22, 2021

SimonCqk mentioned this issue Oct 14, 2021

[feature request] get rid of launcherRunsWorkload global flag #194

Open

SimonCqk closed this as completed Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob #180

[BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob #180

HeGaoYuan commented Sep 22, 2021 •

edited

Loading

SimonCqk commented Sep 23, 2021

HeGaoYuan commented Sep 23, 2021 •

edited

Loading

jian-he commented Sep 30, 2021

HeGaoYuan commented Sep 30, 2021 •

edited

Loading

jian-he commented Oct 1, 2021

HeGaoYuan commented Oct 12, 2021 •

edited

Loading

SimonCqk commented Oct 13, 2021

jian-he commented Oct 13, 2021

HeGaoYuan commented Oct 13, 2021

HeGaoYuan commented Oct 13, 2021

HeGaoYuan commented Oct 13, 2021

SimonCqk commented Oct 14, 2021

SimonCqk commented Oct 14, 2021

SimonCqk commented Jan 25, 2022

HeGaoYuan commented Jan 26, 2022

[BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob #180

[BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob #180

Comments

HeGaoYuan commented Sep 22, 2021 • edited Loading

SimonCqk commented Sep 23, 2021

HeGaoYuan commented Sep 23, 2021 • edited Loading

jian-he commented Sep 30, 2021

HeGaoYuan commented Sep 30, 2021 • edited Loading

jian-he commented Oct 1, 2021

HeGaoYuan commented Oct 12, 2021 • edited Loading

SimonCqk commented Oct 13, 2021

jian-he commented Oct 13, 2021

HeGaoYuan commented Oct 13, 2021

HeGaoYuan commented Oct 13, 2021

HeGaoYuan commented Oct 13, 2021

SimonCqk commented Oct 14, 2021

SimonCqk commented Oct 14, 2021

SimonCqk commented Jan 25, 2022

HeGaoYuan commented Jan 26, 2022

HeGaoYuan commented Sep 22, 2021 •

edited

Loading

HeGaoYuan commented Sep 23, 2021 •

edited

Loading

HeGaoYuan commented Sep 30, 2021 •

edited

Loading

HeGaoYuan commented Oct 12, 2021 •

edited

Loading