Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFJob with 1 replicas can't use gang-scheduling #922

Closed
zionwu opened this issue Jan 29, 2019 · 4 comments
Closed

TFJob with 1 replicas can't use gang-scheduling #922

zionwu opened this issue Jan 29, 2019 · 4 comments

Comments

@zionwu
Copy link
Contributor

zionwu commented Jan 29, 2019

When I enabled gang-scheduling, I expect all the TF jobs to be scheduled by "kube-batch", so that all the jobs will have the same scheduling policy.

However, If I submit a TFJob with 1 replicas and specified the schedulerName to "kube-batch", The job stays pending. The cause is that TF-operator is not creating PDB if the replicas is less than 2 for the job:

func (jc *JobController) SyncPdb(job metav1.Object, minAvailableReplicas int32) (*v1beta1.PodDisruptionBudget, error) {
	labelJobName := jc.Controller.GetJobNameLabelKey()
	// Non-distributed training is not required gang scheduling
	if minAvailableReplicas < 2 {
		return nil, nil
	}
       .....

Can we remove this check to make the scheduling policy for all jobs consistent?

@gaocegege
Copy link
Member

gaocegege commented Jan 29, 2019

/cc @k82cn

I think we could remove it, while we are going to replace pdb with pod group. Maybe we could fix it after it.

@k82cn
Copy link
Collaborator

k82cn commented Jan 29, 2019

I think we could remove it, while we are going to replace pdb with pod group. Maybe we could fix it after it.

+1, kube-batch also support single pod in podgroup; it's safe to remove this check :)

@jlewi
Copy link
Contributor

jlewi commented Feb 4, 2019

Anybody want to submit a fix?

@johnugeorge
Copy link
Member

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants