Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training-operator set scheduler error #1447

Closed
qiankunli opened this issue Oct 14, 2021 · 15 comments
Closed

training-operator set scheduler error #1447

qiankunli opened this issue Oct 14, 2021 · 15 comments

Comments

@qiankunli
Copy link
Contributor

training-operator v1.3.0

if spec.Template.Spec.SchedulerName is null, IsGangSchedulerSet will return false, podTemplate.Spec.SchedulerName will not set gangSchedulerName

// pkg/controller.v1/tensorflow/tfjob_controller.go
	if r.Config.EnableGangScheduling {
		if !util.IsGangSchedulerSet(replicas, gangSchedulerName) {
			errMsg := "Another scheduler is specified when gang-scheduling is enabled and it will not be overwritten"
			logger.Warning(errMsg)
			r.Recorder.Event(tfjob, v1.EventTypeWarning, podTemplateSchedulerNameReason, errMsg)
		} else {
			podTemplate.Spec.SchedulerName = gangSchedulerName
		}
         ...
// pkg/common/util/scheduler.go
func IsGangSchedulerSet(replicas map[commonv1.ReplicaType]*commonv1.ReplicaSpec, schedulerName string) bool {
	if len(schedulerName) == 0 {
		schedulerName = DefaultGangSchedulerName
	}

	for _, spec := range replicas {
		logrus.Info(fmt.Sprintf("Spec %v, SchedulerName %v",spec,spec.Template.Spec.SchedulerName))
		if spec.Template.Spec.SchedulerName != "" && spec.Template.Spec.SchedulerName == schedulerName {
			return true
		}
	}
	return false
}

the same code in tf-operator v1.2.1

// pkg/controller.v1/tensorflow/pod.go
	if isNonGangSchedulerSet(replicas) {
			errMsg := "Another scheduler is specified when gang-scheduling is enabled and it will not be overwritten"
			logger.Warning(errMsg)
			tc.Recorder.Event(tfjob, v1.EventTypeWarning, podTemplateSchedulerNameReason, errMsg)
		} else {
			podTemplate.Spec.SchedulerName = gangSchedulerName
		}
func isNonGangSchedulerSet(replicas map[commonv1.ReplicaType]*commonv1.ReplicaSpec) bool {
	for _, spec := range replicas {
		if spec.Template.Spec.SchedulerName != "" && spec.Template.Spec.SchedulerName != gangSchedulerName {
			return true
		}
	}
	return false
}
@gaocegege
Copy link
Member

/assign @Jeffwan

@Jeffwan
Copy link
Member

Jeffwan commented Oct 14, 2021

if spec.Template.Spec.SchedulerName is null, IsGangSchedulerSet will return false, podTemplate.Spec.SchedulerName will not set gangSchedulerName

em. I agree the logic here is kind of confusing, however current patch solve missing problem but doesn't make sense to me.

We need to define the correct behavior here. With gang enabled,

  1. If there's different scheduler name, then we file warning events.
  2. If there's empty scheduler name, we set to default
  3. If there's already volcano scheduler name, we keep it same.

IsGangSchedulerSet can be refactored to two methods, IsGangSchedulerSet and IsGangSchedulerMismatch, Does this sounds better? @qiankunli

@gaocegege
Copy link
Member

If there's empty scheduler name, we set to default

I think it should be: If there is empty scheduler name or null

@qiankunli
Copy link
Contributor Author

qiankunli commented Oct 15, 2021

If there's empty scheduler name, we set to default

I think it should be: If there is empty scheduler name or null

@Jeffwan I think it is a better idea. IsGangSchedulerSet have two logic

  1. is replica set schedulerName
  2. is schedulerName is same with gangSchedulerName

we can remove logic 2

if r.Config.EnableGangScheduling {
                schedulerNameFromRequest = util.parseGangScheduler(replicas)
		if len(schedulerNameFromRequest) > 0 {
			errMsg := fme.Sprintf("Another scheduler = %s  is specified when gang-scheduling is enabled and it will not be overwritten",schedulerNameFromRequest)
                         logger.Warning(errMsg)
			r.Recorder.Event(tfjob, v1.EventTypeWarning, podTemplateSchedulerNameReason, errMsg)
		} else {
			podTemplate.Spec.SchedulerName = gangSchedulerName
		}
}

or just simply

if r.Config.EnableGangScheduling {
    if util.IsGangSchedulerSet(replicas) {
	    errMsg := "Another scheduler is specified when gang-scheduling is enabled and it will not be overwritten"
	    logger.Warning(errMsg)
	    r.Recorder.Event(tfjob, v1.EventTypeWarning, podTemplateSchedulerNameReason, errMsg)
    } else {
	    podTemplate.Spec.SchedulerName = gangSchedulerName
    }

@gaocegege
Copy link
Member

Personally, I prefer the latter. WDYT @Jeffwan

@Jeffwan
Copy link
Member

Jeffwan commented Oct 23, 2021

Personally, I prefer the latter. WDYT @Jeffwan

@gaocegege @qiankunli

Agree. The only thing I feel a little bit misleading is the method name util.IsGangSchedulerSet. Does util.IsNonDefaultGangSchedulerSet sounds better?

Let's move forward with latter.

@qiankunli
Copy link
Contributor Author

qiankunli commented Oct 25, 2021

@Jeffwan is util.IsNonDefaultGangSchedulerSet means pod.spec.SchedulerName = nil or pod.spec.SchedulerName != volcano?

and the code is

const gangSchedulerName = "volcano"
if r.Config.EnableGangScheduling {
    if util.IsNonDefaultGangSchedulerSet(replicas, gangSchedulerName){
         if  pod.spec.SchedulerName == nil {
                podTemplate.Spec.SchedulerName = gangSchedulerName
         }else{
                 errMsg := "Another scheduler is specified when gang-scheduling is enabled and it will not be overwritten"
	        logger.Warning(errMsg)
	        r.Recorder.Event(tfjob, v1.EventTypeWarning, podTemplateSchedulerNameReason, errMsg)
         }
    }
}

@whybeyoung
Copy link

The same error here.

.org/v1","resourceVersion":"20490546"}, "reason": "SettedPodTemplateSchedulerName", "message": "Another scheduler is specified when gang-scheduling is enabled and it will not be overwritten"}
time="2021-11-04T06:14:09Z" level=info msg="Controller ctr-multi-train-3r1h2 created pod ctr-multi-train-3r1h2-worker-0" job=.ctr-multi-train-3r1h2 pod=.ctr-multi-train-3r1h2-worker-0 uid=
time="2021-11-04T06:14:09Z" level=info msg="Need to create new pod: worker-1" job=ai-ctr.ctr-multi-train-3r1h2 uid=aca068ca-09a3-416e-8d3b-d3f3905b1a6a
time="2021-11-04T06:14:09Z" level=warning msg="Another scheduler is specified when gang-scheduling is enabled and it will not be overwritten" job=ai-ctr.ctr-multi-train-3r1h2 replica-type=worker uid=aca068ca-09a3-416e-8d3b-d3f3905b1a6a

so, does this mean the volcano is not working?

@gaocegege
Copy link
Member

@Jeffwan Can you please help review the PR #1448 ?

@berlinsaint Personally, I think it does not work with this bug not fixed.

@Jeffwan
Copy link
Member

Jeffwan commented Nov 4, 2021

@gaocegege Sounds good. I ping @qiankunli for renaming

@qiankunli
Copy link
Contributor Author

@Jeffwan

i add a GetSchedulerName method, it may be simple to read

if r.Config.EnableGangScheduling {
		podSchedulerName := util.GetSchedulerName(replicas)
		if len(podSchedulerName) == 0 {
			podTemplate.Spec.SchedulerName = gangSchedulerName
		} else if strings.Compare(podSchedulerName, gangSchedulerName) != 0 {
			errMsg := "Another scheduler is specified when gang-scheduling is enabled and it will not be overwritten"
			logger.Warning(errMsg)
			r.Recorder.Event(tfjob, v1.EventTypeWarning, podTemplateSchedulerNameReason, errMsg)
		}
}

@gaocegege
Copy link
Member

@berlinsaint The patch is merged, you can use the lastest code. The bug should be fixed.

@whybeyoung
Copy link

@berlinsaint The patch is merged, you can use the lastest code. The bug should be fixed.

Thanks. In fact , i have compiled them myself this afternoon..

@gaocegege
Copy link
Member

Awesome. Sorry for the late fix.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Mar 2, 2022
@stale stale bot closed this as completed Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants