Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob #180

Closed
HeGaoYuan opened this issue Sep 22, 2021 · 15 comments
Closed
Assignees

Comments

@HeGaoYuan
Copy link
Contributor

HeGaoYuan commented Sep 22, 2021

What happened:
The mpijob worker pods are pending, and there is no launcher pod

mpi-demo-worker-0             0/1     Pending     0          13s
mpi-demo-worker-1             0/1     Pending     0          13s

The events of worker pod are as follows

Events:
  Type     Reason            Age   From     Message
  ----     ------            ----  ----     -------
  Warning  FailedScheduling  65s   volcano  3/2 tasks in gang unschedulable: pod group is not ready, 2 Pending, 3 minAvailable.

I think the core reason is the DAGScheduling and GangScheduling(volcano) conflict in mpijob.

I can fix this problem by adding this args in the kubedl deployment.

- --feature-gates
- DAGScheduling=false

What you expected to happen:

No pending

How to reproduce it:
enable DAGScheduling and GangScheduling(volcano) to run a mpijob

Anything else we need to know?:

Environment:

  • KubeDL version:
  • Kubernetes version (use kubectl version):
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@SimonCqk
Copy link
Collaborator

@HeGaoYuan thanks GaoYuan, for current implementation, DAGScheduling does conflict with GangScheduling, you can disable DAGScheduling for temporary by setting up start up flags : --feature-gates=DAGScheduling=false

gang scheduling should split up when DAG is enabled, for example: a TFJob with 1 ps and 10 workers, there should be 2 PodGroups grouped by replica, please assign an issue for me and I'll implement it later.

@HeGaoYuan
Copy link
Contributor Author

HeGaoYuan commented Sep 23, 2021

@HeGaoYuan thanks GaoYuan, for current implementation, DAGScheduling does conflict with GangScheduling, you can disable DAGScheduling for temporary by setting up start up flags : --feature-gates=DAGScheduling=false

gang scheduling should split up when DAG is enabled, for example: a TFJob with 1 ps and 10 workers, there should be 2 PodGroups grouped by replica, please assign an issue for me and I'll implement it later.

In my understanding, the reason why we use GangScheduling is avoid deadlock when multiple job instances request resources. I think 2 PodGroups also may lead to deadlock problem. Maybe we should think a better method to solve this problem?

@jian-he
Copy link
Collaborator

jian-he commented Sep 30, 2021

In this case, the launcher pod is only a single pod and requires less resources. To keep it simple, we can make all workers as a gang and exclude the launcher pod.

This won't solve all the cases, but it is simple and worth to try if it works in most cases in reality. what do you think ?

@HeGaoYuan
Copy link
Contributor Author

HeGaoYuan commented Sep 30, 2021

In this case, the launcher pod is only a single pod and requires less resources. To keep it simple, we can make all workers as a gang and exclude the launcher pod.

This won't solve all the cases, but it is simple and worth to try if it works in most cases in reality. what do you think ?

In my opinion, because we have the launcherRunsWorkload ability, so the launcher pod may request more resources.

@jian-he
Copy link
Collaborator

jian-he commented Oct 1, 2021

ok, in that case, we may disable the dag schduling if launcherRunsWorkload is enabled, @HeGaoYuan @SimonCqk opinions ?

@HeGaoYuan
Copy link
Contributor Author

HeGaoYuan commented Oct 12, 2021

ok, in that case, we may disable the dag schduling if launcherRunsWorkload is enabled, @HeGaoYuan @SimonCqk opinions ?

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

@SimonCqk
Copy link
Collaborator

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

then there comes two anomalies:

  • thousands of init-containers will be created and watch or polling from apiserver, which causes heavy overhead for control plane.
  • downstream pods may scheduled in advance(not only upstream Running tiggers downstream, but also upstream Succeed triggers downstream), who consumed resources but keep stalling.

@jian-he
Copy link
Collaborator

jian-he commented Oct 13, 2021

is launcherRunsWorkload a valid use case ? why is that needed ? @HeGaoYuan

@HeGaoYuan
Copy link
Contributor Author

is launcherRunsWorkload a valid use case ? why is that needed ? @HeGaoYuan

launcherRunsWorkload is not introduced by me, but I use this ability in practice. The only advantage of this ability is it will reduce the number of pods

@HeGaoYuan
Copy link
Contributor Author

speaking of launcherRunsWorkload, now the launcherRunsWorkload is a global variable of kubedl, I suggest to change it as a field of job, what do you think? @jian-he @SimonCqk

@HeGaoYuan
Copy link
Contributor Author

In my opinion, we may need to refactor the dag scheduling. Like the implementation of mpijob's initContainers, we can implement the dag scheduling by injecting the initContainer, the initContainer will wait the dependency pod to running.

then there comes two anomalies:

  • thousands of init-containers will be created and watch or polling from apiserver, which causes heavy overhead for control plane.
  • downstream pods may scheduled in advance(not only upstream Running tiggers downstream, but also upstream Succeed triggers downstream), who consumed resources but keep stalling.

good consideration!

@SimonCqk
Copy link
Collaborator

speaking of launcherRunsWorkload, now the launcherRunsWorkload is a global variable of kubedl, I suggest to change it as a field of job, what do you think? @jian-he @SimonCqk

good point, actually the global flag launcherRunsWorkload can be removed, mpiReplicaSpecs with Launcher role indicates that mpijob will be driven by Launcher pod, which is launcherRunsWorkload semantics.

@SimonCqk
Copy link
Collaborator

@HeGaoYuan I post an issue and will refactor it soon #194

@SimonCqk
Copy link
Collaborator

@HeGaoYuan Hi, I've posted a pull requests to fix it, and a latest image tagged with daily will be pushed to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :)

@HeGaoYuan
Copy link
Contributor Author

@HeGaoYuan Hi, I've posted a pull requests to fix it, and a latest image tagged with daily will be pushed to docker hub, if you want to try as soon as possible, please pull the latest commits in master branch :)

Great! I will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants