You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I enabled gang-scheduling, I expect all the TF jobs to be scheduled by "kube-batch", so that all the jobs will have the same scheduling policy.
However, If I submit a TFJob with 1 replicas and specified the schedulerName to "kube-batch", The job stays pending. The cause is that TF-operator is not creating PDB if the replicas is less than 2 for the job:
func (jc*JobController) SyncPdb(job metav1.Object, minAvailableReplicasint32) (*v1beta1.PodDisruptionBudget, error) {
labelJobName:=jc.Controller.GetJobNameLabelKey()
// Non-distributed training is not required gang schedulingifminAvailableReplicas<2 {
returnnil, nil
}
.....
Can we remove this check to make the scheduling policy for all jobs consistent?
The text was updated successfully, but these errors were encountered:
When I enabled gang-scheduling, I expect all the TF jobs to be scheduled by "kube-batch", so that all the jobs will have the same scheduling policy.
However, If I submit a TFJob with 1 replicas and specified the schedulerName to "kube-batch", The job stays pending. The cause is that TF-operator is not creating PDB if the replicas is less than 2 for the job:
Can we remove this check to make the scheduling policy for all jobs consistent?
The text was updated successfully, but these errors were encountered: