Skip to content
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.

the status of worker-0 is error, but the status of mxjob is Succeeded #38

Open
jokerwenxiao opened this issue May 21, 2019 · 2 comments
Labels

Comments

@jokerwenxiao
Copy link

kubeflow version: 0.5.0
mxnet-operator version: v1beta1

kubernetes dashboard display
image

worker-0 log:
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, data_dir='/admin/public/model/mxnet_distributed/data', disp_batches=10, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='1, 28, 28', initializer='default', kv_store='dist_device_sync', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=6000, num_layers=2, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
Traceback (most recent call last):
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 99, in
fit.fit(args, sym, get_mnist_iter)
File "/admin/public/model/mxnet_model/mxnet_distributed/common/fit.py", line 180, in fit
(train, val) = data_loader(args, kv)
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 57, in get_mnist_iter
'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 37, in read_data
with gzip.open(os.path.join(args.data_dir,label)) as flbl:
File "/opt/conda/lib/python3.6/gzip.py", line 53, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/opt/conda/lib/python3.6/gzip.py", line 163, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/admin/public/model/mxnet_distributed/data/train-labels-idx1-ubyte.gz'

mxjob status:

{
  "status": {
        "completionTime": "2019-05-21T08:37:24Z",
        "conditions": [
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:36:41Z",
                "message": "MXJob mxnet-8d1f211e is created.",
                "reason": "MXJobCreated",
                "status": "True",
                "type": "Created"
            },
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:36:46Z",
                "message": "MXJob mxnet-8d1f211e is running.",
                "reason": "MXJobRunning",
                "status": "False",
                "type": "Running"
            },
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:37:24Z",
                "message": "MXJob mxnet-8d1f211e is successfully completed.",
                "reason": "MXJobSucceeded",
                "status": "True",
                "type": "Succeeded"
            }
        ],
        "mxReplicaStatuses": {
            "Scheduler": {},
            "Server": {},
            "Worker": {}
        },
        "startTime": "2019-05-21T08:36:44Z"
	}
}
@KingOnTheStar
Copy link
Contributor

Does this error occur accidentally or it must appear under some operation? I try some tests and find the status will go wrong only when worker break down and the scheduler completed at the same time. Does your scheduler stop with worker?

@KingOnTheStar
Copy link
Contributor

Scheduler completed will lead to the success of mxjob. It's possible when scheduler completed, the worker is still running instead of being error so the mxjob status is set to succeeded.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants