Skip to content
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.

error in judging status of mxjob #32

Closed
jokerwenxiao opened this issue May 7, 2019 · 5 comments
Closed

error in judging status of mxjob #32

jokerwenxiao opened this issue May 7, 2019 · 5 comments

Comments

@jokerwenxiao
Copy link

jokerwenxiao commented May 7, 2019

i set scheduler replicas:0 , server replicas:0 and worker replicas:1 to run a simple mxnet training script(not distributed). At the moment I created mxjob, the status of the mxjob became "Succeeded", but worker pod is running.

mxjob detail is as follow:

{

"apiVersion": "kubeflow.org/v1beta1",
"kind": "MXJob",
"metadata": {
    "creationTimestamp": "2019-05-07T05:36:22Z",
    "generation": 1,
    "name": "mxnet-0da7ef98",
    "namespace": "86872fb0-64d1-11e9-96c1-1ee8ba783f60",
    "resourceVersion": "44498209",
    "selfLink": "/apis/kubeflow.org/v1beta1/namespaces/86872fb0-64d1-11e9-96c1-1ee8ba783f60/mxjobs/mxnet-0da7ef98",
    "uid": "0c639d60-708a-11e9-991a-005056ae0022"
},
"spec": {
    "cleanPodPolicy": "None",
    "jobMode": "MXTrain",
    "mxReplicaSpecs": {
        "Scheduler": {
            "replicas": 0,
            "restartPolicy": "Never",
            "template": {
                "metadata": {
                    "creationTimestamp": null
                },
                "spec": {
                    "containers": [
                        {
                            "args": [
                                "python3",
                                "/mxnet_test/example/image-classification/train_mnist.py",
                                "--num-layers=2",
                                "--num-epochs=20"
                            ],
                            "env": [
                                {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "none"
                                }
                            ],
                            "image": "100.2.28.186/admin/mxnet:cpu",
                            "name": "mxnet",
                            "ports": [
                                {
                                    "containerPort": 9091,
                                    "name": "mxjob-port"
                                }
                            ],
                            "resources": {
                                "limits": {
                                    "cpu": "0",
                                    "memory": "0"
                                },
                                "requests": {
                                    "cpu": "0",
                                    "memory": "0"
                                }
                            },
                            "volumeMounts": [
                                {
                                    "mountPath": "/mxnet_test/example/image-classification/",
                                    "name": "host-mount"
                                },
                                {
                                    "mountPath": "/mxnet_test/data",
                                    "name": "data-mount"
                                }
                            ]
                        }
                    ],
                    "nodeSelector": {
                        "group_label": "24c489bc-64d1-11e9-96c1-1ee8ba783f60"
                    },
                    "restartPolicy": "OnFailure",
                    "volumes": [
                        {
                            "hostPath": {
                                "path": "/mxnet_test/example/image-classification/",
                                "type": "Directory"
                            },
                            "name": "host-mount"
                        },
                        {
                            "hostPath": {
                                "path": "/mxnet_test/data",
                                "type": "Directory"
                            },
                            "name": "data-mount"
                        }
                    ]
                }
            }
        },
        "Server": {
            "replicas": 0,
            "restartPolicy": "Never",
            "template": {
                "metadata": {
                    "creationTimestamp": null
                },
                "spec": {
                    "containers": [
                        {
                            "args": [
                                "python3",
                                "/mxnet_test/example/image-classification/train_mnist.py",
                                "--num-layers=2",
                                "--num-epochs=20"
                            ],
                            "env": [
                                {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "none"
                                }
                            ],
                            "image": "100.2.28.186/admin/mxnet:cpu",
                            "name": "mxnet",
                            "ports": [
                                {
                                    "containerPort": 9091,
                                    "name": "mxjob-port"
                                }
                            ],
                            "resources": {
                                "limits": {
                                    "cpu": "0",
                                    "memory": "0"
                                },
                                "requests": {
                                    "cpu": "0",
                                    "memory": "0"
                                }
                            },
                            "volumeMounts": [
                                {
                                    "mountPath": "/mxnet_test/example/image-classification/",
                                    "name": "host-mount"
                                },
                                {
                                    "mountPath": "/mxnet_test/data",
                                    "name": "data-mount"
                                }
                            ]
                        }
                    ],
                    "nodeSelector": {
                        "group_label": "24c489bc-64d1-11e9-96c1-1ee8ba783f60"
                    },
                    "restartPolicy": "OnFailure",
                    "volumes": [
                        {
                            "hostPath": {
                                "path": "/mxnet_test/example/image-classification/",
                                "type": "Directory"
                            },
                            "name": "host-mount"
                        },
                        {
                            "hostPath": {
                                "path": "/mxnet_test/data",
                                "type": "Directory"
                            },
                            "name": "data-mount"
                        }
                    ]
                }
            }
        },
        "Worker": {
            "replicas": 1,
            "restartPolicy": "Never",
            "template": {
                "metadata": {
                    "creationTimestamp": null
                },
                "spec": {
                    "containers": [
                        {
                            "args": [
                                "python3",
                                "/mxnet_test/example/image-classification/train_mnist.py",
                                "--num-layers=2",
                                "--num-epochs=20"
                            ],
                            "env": [
                                {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "none"
                                }
                            ],
                            "image": "100.2.28.186/admin/mxnet:cpu",
                            "name": "mxnet",
                            "ports": [
                                {
                                    "containerPort": 9091,
                                    "name": "mxjob-port"
                                }
                            ],
                            "resources": {
                                "limits": {
                                    "cpu": "1",
                                    "memory": "1Gi"
                                },
                                "requests": {
                                    "cpu": "1",
                                    "memory": "1Gi"
                                }
                            },
                            "volumeMounts": [
                                {
                                    "mountPath": "/mxnet_test/example/image-classification/",
                                    "name": "host-mount"
                                },
                                {
                                    "mountPath": "/mxnet_test/data",
                                    "name": "data-mount"
                                }
                            ]
                        }
                    ],
                    "nodeSelector": {
                        "group_label": "24c489bc-64d1-11e9-96c1-1ee8ba783f60"
                    },
                    "restartPolicy": "OnFailure",
                    "volumes": [
                        {
                            "hostPath": {
                                "path": "/mxnet_test/example/image-classification/",
                                "type": "Directory"
                            },
                            "name": "host-mount"
                        },
                        {
                            "hostPath": {
                                "path": "/mxnet_test/data",
                                "type": "Directory"
                            },
                            "name": "data-mount"
                        }
                    ]
                }
            }
        }
    }
},
"status": {
    "completionTime": "2019-05-07T05:36:22Z",
    "conditions": [
        {
            "lastTransitionTime": "2019-05-07T05:36:22Z",
            "lastUpdateTime": "2019-05-07T05:36:22Z",
            "message": "MXJob mxnet-0da7ef98 is created.",
            "reason": "MXJobCreated",
            "status": "True",
            "type": "Created"
        },
        {
            "lastTransitionTime": "2019-05-07T05:36:22Z",
            "lastUpdateTime": "2019-05-07T05:36:22Z",
            "message": "MXJob mxnet-0da7ef98 is successfully completed.",
            "reason": "MXJobSucceeded",
            "status": "True",
            "type": "Succeeded"
        }
    ],
    "mxReplicaStatuses": {
        "Scheduler": {},
        "Server": {},
        "Worker": {}
    },
    "startTime": "2019-05-07T05:36:22Z"
}

}

@suleisl2000
Copy link
Contributor

@jokerwenxiao mxnet-operator is designed for distributed training for the moment. In other words, mxnet-operator doesn't support the configuration of "scheduler replicas:0 , server replicas:0 and worker replicas:1". The behavior of what you watched is a bug and have to be fixed later. Why don't run your container instance just as a pod for your case?

@jokerwenxiao
Copy link
Author

@suleisl2000 just like tf-operator, i can use worker-0 to train no-distributed job. I wonder if mxnet-operator can do this in the future. thank you!

@suleisl2000
Copy link
Contributor

@jokerwenxiao ok, we'd like to keep same behavior with tf-operator, we will handle it later.

@KingOnTheStar
Copy link
Contributor

Did you modify the crd of mxnet-operator? I can't create mxjob with the same settings, its crd has set the minimum of replica to 1, and tf-operator does things like it, too.

@jokerwenxiao
Copy link
Author

@KingOnTheStar
After deploying kubeflow, I haven't modified the crd of mxnet-operator.
for tf-operator, I deleted ps replicaSpec, leave only worker replicaSpec.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants