Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? #1617

Closed
goodpp opened this issue Jun 17, 2022 · 5 comments · Fixed by #1621
Closed
Assignees

Comments

@goodpp
Copy link

goodpp commented Jun 17, 2022

请问下,MPIJOB刚性调度在资源不足的情况下,显示Running状态,是Bug吗?

== MPIJOB
NAME AGE STATE
hvd-tf1-mnist 16m Running

=== POD
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hvd-tf1-mnist-launcher 0/1 Pending 0 14m
hvd-tf1-mnist-worker-0 1/1 Running 0 14m 10.42.1.172 openpai-212
hvd-tf1-mnist-worker-1 0/1 Pending 0 14m

==== PodGroup Status
status:
conditions:
- lastTransitionTime: "2022-06-17T10:21:33Z"
message: '2/1 tasks in gang unschedulable: pod group is not ready, 1 Running,
3 minAvailable'
reason: NotEnoughResources
status: "True"
transitionID: cd024380-e518-43f0-9c44-3664ebb10429
type: Unschedulable
phase: Unknown
running: 1

==== MPIJOB Status
status:
conditions:
- lastTransitionTime: "2022-06-17T10:06:10Z"
lastUpdateTime: "2022-06-17T10:06:10Z"
message: MPIJob aios/hvd-tf1-mnist is created.
reason: MPIJobCreated
status: "True"
type: Created
- lastTransitionTime: "2022-06-17T10:06:11Z"
lastUpdateTime: "2022-06-17T10:06:11Z"
message: MPIJob hvd-tf1-mnist is running.
reason: JobRunning
status: "True"
type: Running
replicaStatuses:
Launcher: {}
Worker:
active: 1

@goodpp
Copy link
Author

goodpp commented Jun 17, 2022

training-operator version (v1.4.0)

public.ecr.aws/j1r0q0g6/training/training-operator:174e8813666951ded505daf334a37f60fd50c18d

@zw0610
Copy link
Member

zw0610 commented Jun 20, 2022

A bit translation. MPIJob shows Status as 'Running' while not every related Pod is running.

It seems two issues come at once:

  1. Even with enable-gang-scheduling=true, some Worker Pods get scheduled. (This seems a gang-scheduler issue instead of the training-operator one.)
  2. With partial Worker Pods in Pending status, the MPIJob appeared in Running status. (If reproduced, this is an operator issue.)

I shall try to reproduce the issue and fix it.

@goodpp
Copy link
Author

goodpp commented Jun 20, 2022

@zw0610 是的,确实存在2个问题

@goodpp
Copy link
Author

goodpp commented Jun 22, 2022

补充一些日志:
MPIJob启动8个worker,每个worker一个GPU,空闲4个GPU

== MPIJOB
NAME AGE STATE
tjob-horovod-test-demo-12-1-0-9 4m4s Running

=== POD
tjob-horovod-test-demo-12-1-0-9-launcher 0/1 Init:0/1 0 4m28s
tjob-horovod-test-demo-12-1-0-9-worker-0 1/1 Running 0 4m28s
tjob-horovod-test-demo-12-1-0-9-worker-1 1/1 Running 0 4m28s
tjob-horovod-test-demo-12-1-0-9-worker-2 1/1 Running 0 4m28s
tjob-horovod-test-demo-12-1-0-9-worker-3 1/1 Running 0 4m28s
tjob-horovod-test-demo-12-1-0-9-worker-4 0/1 Pending 0 4m28s
tjob-horovod-test-demo-12-1-0-9-worker-5 0/1 Pending 0 4m28s
tjob-horovod-test-demo-12-1-0-9-worker-6 0/1 Pending 0 4m28s
tjob-horovod-test-demo-12-1-0-9-worker-7 0/1 Pending 0 4m28s

== MPIJOB status
status:
conditions:
- lastTransitionTime: "2022-06-22T03:16:24Z"
lastUpdateTime: "2022-06-22T03:16:24Z"
message: MPIJob aios/tjob-horovod-test-demo-12-1-0-9 is created.
reason: MPIJobCreated
status: "True"
type: Created
- lastTransitionTime: "2022-06-22T03:16:26Z"
lastUpdateTime: "2022-06-22T03:16:26Z"
message: MPIJob tjob-horovod-test-demo-12-1-0-9 is running.
reason: JobRunning
status: "True"
type: Running
replicaStatuses:
Launcher: {}
Worker:
active: 4

=== PodGroup status
status:
conditions:
- lastTransitionTime: "2022-06-22T03:21:14Z"
message: '4/5 tasks in gang unschedulable: pod group is not ready, 1 Bound,
4 Running, 9 minAvailable'
reason: NotEnoughResources
status: "True"
transitionID: d3456190-5eee-4254-86a4-94f477fb5fdc
type: Unschedulable
phase: Unknown
running: 4

@zw0610
Copy link
Member

zw0610 commented Jun 23, 2022

/assign @hackerboy01
/assign @Garrybest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants