error in judging status of mxjob #32

jokerwenxiao · 2019-05-07T05:56:18Z

i set scheduler replicas:0 , server replicas:0 and worker replicas:1 to run a simple mxnet training script(not distributed). At the moment I created mxjob, the status of the mxjob became "Succeeded", but worker pod is running.

mxjob detail is as follow:

{

"apiVersion": "kubeflow.org/v1beta1",
"kind": "MXJob",
"metadata": {
    "creationTimestamp": "2019-05-07T05:36:22Z",
    "generation": 1,
    "name": "mxnet-0da7ef98",
    "namespace": "86872fb0-64d1-11e9-96c1-1ee8ba783f60",
    "resourceVersion": "44498209",
    "selfLink": "/apis/kubeflow.org/v1beta1/namespaces/86872fb0-64d1-11e9-96c1-1ee8ba783f60/mxjobs/mxnet-0da7ef98",
    "uid": "0c639d60-708a-11e9-991a-005056ae0022"
},
"spec": {
    "cleanPodPolicy": "None",
    "jobMode": "MXTrain",
    "mxReplicaSpecs": {
        "Scheduler": {
            "replicas": 0,
            "restartPolicy": "Never",
            "template": {
                "metadata": {
                    "creationTimestamp": null
                },
                "spec": {
                    "containers": [
                        {
                            "args": [
                                "python3",
                                "/mxnet_test/example/image-classification/train_mnist.py",
                                "--num-layers=2",
                                "--num-epochs=20"
                            ],
                            "env": [
                                {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "none"
                                }
                            ],
                            "image": "100.2.28.186/admin/mxnet:cpu",
                            "name": "mxnet",
                            "ports": [
                                {
                                    "containerPort": 9091,
                                    "name": "mxjob-port"
                                }
                            ],
                            "resources": {
                                "limits": {
                                    "cpu": "0",
                                    "memory": "0"
                                },
                                "requests": {
                                    "cpu": "0",
                                    "memory": "0"
                                }
                            },
                            "volumeMounts": [
                                {
                                    "mountPath": "/mxnet_test/example/image-classification/",
                                    "name": "host-mount"
                                },
                                {
                                    "mountPath": "/mxnet_test/data",
                                    "name": "data-mount"
                                }
                            ]
                        }
                    ],
                    "nodeSelector": {
                        "group_label": "24c489bc-64d1-11e9-96c1-1ee8ba783f60"
                    },
                    "restartPolicy": "OnFailure",
                    "volumes": [
                        {
                            "hostPath": {
                                "path": "/mxnet_test/example/image-classification/",
                                "type": "Directory"
                            },
                            "name": "host-mount"
                        },
                        {
                            "hostPath": {
                                "path": "/mxnet_test/data",
                                "type": "Directory"
                            },
                            "name": "data-mount"
                        }
                    ]
                }
            }
        },
        "Server": {
            "replicas": 0,
            "restartPolicy": "Never",
            "template": {
                "metadata": {
                    "creationTimestamp": null
                },
                "spec": {
                    "containers": [
                        {
                            "args": [
                                "python3",
                                "/mxnet_test/example/image-classification/train_mnist.py",
                                "--num-layers=2",
                                "--num-epochs=20"
                            ],
                            "env": [
                                {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "none"
                                }
                            ],
                            "image": "100.2.28.186/admin/mxnet:cpu",
                            "name": "mxnet",
                            "ports": [
                                {
                                    "containerPort": 9091,
                                    "name": "mxjob-port"
                                }
                            ],
                            "resources": {
                                "limits": {
                                    "cpu": "0",
                                    "memory": "0"
                                },
                                "requests": {
                                    "cpu": "0",
                                    "memory": "0"
                                }
                            },
                            "volumeMounts": [
                                {
                                    "mountPath": "/mxnet_test/example/image-classification/",
                                    "name": "host-mount"
                                },
                                {
                                    "mountPath": "/mxnet_test/data",
                                    "name": "data-mount"
                                }
                            ]
                        }
                    ],
                    "nodeSelector": {
                        "group_label": "24c489bc-64d1-11e9-96c1-1ee8ba783f60"
                    },
                    "restartPolicy": "OnFailure",
                    "volumes": [
                        {
                            "hostPath": {
                                "path": "/mxnet_test/example/image-classification/",
                                "type": "Directory"
                            },
                            "name": "host-mount"
                        },
                        {
                            "hostPath": {
                                "path": "/mxnet_test/data",
                                "type": "Directory"
                            },
                            "name": "data-mount"
                        }
                    ]
                }
            }
        },
        "Worker": {
            "replicas": 1,
            "restartPolicy": "Never",
            "template": {
                "metadata": {
                    "creationTimestamp": null
                },
                "spec": {
                    "containers": [
                        {
                            "args": [
                                "python3",
                                "/mxnet_test/example/image-classification/train_mnist.py",
                                "--num-layers=2",
                                "--num-epochs=20"
                            ],
                            "env": [
                                {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "none"
                                }
                            ],
                            "image": "100.2.28.186/admin/mxnet:cpu",
                            "name": "mxnet",
                            "ports": [
                                {
                                    "containerPort": 9091,
                                    "name": "mxjob-port"
                                }
                            ],
                            "resources": {
                                "limits": {
                                    "cpu": "1",
                                    "memory": "1Gi"
                                },
                                "requests": {
                                    "cpu": "1",
                                    "memory": "1Gi"
                                }
                            },
                            "volumeMounts": [
                                {
                                    "mountPath": "/mxnet_test/example/image-classification/",
                                    "name": "host-mount"
                                },
                                {
                                    "mountPath": "/mxnet_test/data",
                                    "name": "data-mount"
                                }
                            ]
                        }
                    ],
                    "nodeSelector": {
                        "group_label": "24c489bc-64d1-11e9-96c1-1ee8ba783f60"
                    },
                    "restartPolicy": "OnFailure",
                    "volumes": [
                        {
                            "hostPath": {
                                "path": "/mxnet_test/example/image-classification/",
                                "type": "Directory"
                            },
                            "name": "host-mount"
                        },
                        {
                            "hostPath": {
                                "path": "/mxnet_test/data",
                                "type": "Directory"
                            },
                            "name": "data-mount"
                        }
                    ]
                }
            }
        }
    }
},
"status": {
    "completionTime": "2019-05-07T05:36:22Z",
    "conditions": [
        {
            "lastTransitionTime": "2019-05-07T05:36:22Z",
            "lastUpdateTime": "2019-05-07T05:36:22Z",
            "message": "MXJob mxnet-0da7ef98 is created.",
            "reason": "MXJobCreated",
            "status": "True",
            "type": "Created"
        },
        {
            "lastTransitionTime": "2019-05-07T05:36:22Z",
            "lastUpdateTime": "2019-05-07T05:36:22Z",
            "message": "MXJob mxnet-0da7ef98 is successfully completed.",
            "reason": "MXJobSucceeded",
            "status": "True",
            "type": "Succeeded"
        }
    ],
    "mxReplicaStatuses": {
        "Scheduler": {},
        "Server": {},
        "Worker": {}
    },
    "startTime": "2019-05-07T05:36:22Z"
}

}

The text was updated successfully, but these errors were encountered:

suleisl2000 · 2019-05-07T08:21:59Z

@jokerwenxiao mxnet-operator is designed for distributed training for the moment. In other words, mxnet-operator doesn't support the configuration of "scheduler replicas:0 , server replicas:0 and worker replicas:1". The behavior of what you watched is a bug and have to be fixed later. Why don't run your container instance just as a pod for your case?

jokerwenxiao · 2019-05-07T08:57:50Z

@suleisl2000 just like tf-operator, i can use worker-0 to train no-distributed job. I wonder if mxnet-operator can do this in the future. thank you!

suleisl2000 · 2019-05-07T10:21:24Z

@jokerwenxiao ok, we'd like to keep same behavior with tf-operator, we will handle it later.

KingOnTheStar · 2019-05-12T14:35:49Z

Did you modify the crd of mxnet-operator? I can't create mxjob with the same settings, its crd has set the minimum of replica to 1, and tf-operator does things like it, too.

jokerwenxiao · 2019-05-13T01:19:30Z

@KingOnTheStar
After deploying kubeflow, I haven't modified the crd of mxnet-operator.
for tf-operator, I deleted ps replicaSpec, leave only worker replicaSpec.

jokerwenxiao closed this as completed May 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error in judging status of mxjob #32

error in judging status of mxjob #32

jokerwenxiao commented May 7, 2019 •

edited

Loading

suleisl2000 commented May 7, 2019

jokerwenxiao commented May 7, 2019

suleisl2000 commented May 7, 2019

KingOnTheStar commented May 12, 2019

jokerwenxiao commented May 13, 2019

error in judging status of mxjob #32

error in judging status of mxjob #32

Comments

jokerwenxiao commented May 7, 2019 • edited Loading

suleisl2000 commented May 7, 2019

jokerwenxiao commented May 7, 2019

suleisl2000 commented May 7, 2019

KingOnTheStar commented May 12, 2019

jokerwenxiao commented May 13, 2019

jokerwenxiao commented May 7, 2019 •

edited

Loading