-
Notifications
You must be signed in to change notification settings - Fork 34
error in judging status of mxjob #32
Comments
@jokerwenxiao mxnet-operator is designed for distributed training for the moment. In other words, mxnet-operator doesn't support the configuration of "scheduler replicas:0 , server replicas:0 and worker replicas:1". The behavior of what you watched is a bug and have to be fixed later. Why don't run your container instance just as a pod for your case? |
@suleisl2000 just like tf-operator, i can use worker-0 to train no-distributed job. I wonder if mxnet-operator can do this in the future. thank you! |
@jokerwenxiao ok, we'd like to keep same behavior with tf-operator, we will handle it later. |
Did you modify the crd of mxnet-operator? I can't create mxjob with the same settings, its crd has set the minimum of replica to 1, and tf-operator does things like it, too. |
@KingOnTheStar |
i set scheduler replicas:0 , server replicas:0 and worker replicas:1 to run a simple mxnet training script(not distributed). At the moment I created mxjob, the status of the mxjob became "Succeeded", but worker pod is running.
mxjob detail is as follow:
}
The text was updated successfully, but these errors were encountered: