[v1alpha2] The state of distributed model training. #544

yph152 · 2018-04-18T06:34:03Z

@gaocegege @ScorpioCPH @DjangoPeng
As for the state judgment of model training, I think there are usually several situations:
1、The best case is that the non-chief worker task is wrong, because these tasks are actually stateless. When such a worker task is restored, it will reconnect with its PS task and restart the process that was previously broken.
2、More bad a situation is PS task fails, then there is a problem, because the PS task is state, all the worker task need to rely on them to send their gradient and obtain a new parameter values. In this case, their chief worker task is responsible for monitoring this error, and if this error occurs, the chief worker task interrupts the entire training and resumes all PS tasks from the previous checkpoint.
3、In the worst case is devoted to the worker task failed, because we make it in charge of all the other tasks, we make sure that it has failed, to cluster all the tasks in the back in a good condition. So what we do is we interrupt the training.

According to the above analysis, I think it can be divided into several types:

The user sets the chief, so when the chief node fails, I think the distributed training fails;
The user does not set chief, so there are some situations here:

(1) the user may use worker0 as the chief node, so we believe that worker0 has problems and distributed training fails;

(2) the user does not use worker0 as the chief node, so we believe that the problem of ps appears and distributed training fails;

But the above two cases, we have no way to judge whether the user use worker0 as devoted to nodes, so I think, this kind of situation, is any worker0 and ps nodes appear problem, I think the distributed training failure.

In the case of all workers, this situation is that the worker has a problem and the distributed training fails.

What do you think?

DjangoPeng · 2018-04-18T07:58:35Z

/area operator
/kind discussion
/priority p0

/cc @ddysher @jlewi

gaocegege · 2018-04-18T09:15:01Z

Personally, I think we do not make decisions for users. We consider the job failed if there is one worker failed and the user do not specify the chief worker.

ScorpioCPH · 2018-04-18T09:44:37Z

The user does not set chief.

We will use worker-0 as chief worker by default.

Training fails is a big topic we have discussed several times, It depends on many cases:

Synchronous training or asynchronous training.
Whether user have written costume code to save/reload model from checkpoint file or not.
And some other cases.

DjangoPeng · 2018-04-19T00:38:43Z

Training fails is a big topic we have discussed several times.

How many cases do we support now? Or what's the default policy for distributed TFJob?

DjangoPeng · 2018-04-19T00:43:34Z

@yph152 See also:

#283
#333

yph152 · 2018-04-19T07:58:05Z

@DjangoPeng @gaocegege @ScorpioCPH
Training for distributed model does have a lot of kinds of situation, for the back-end to determine all of the state is very difficult, can't we make for the entire state of the art distributed training to users to set up, we only need to according to the users to set returned to the user related condition. Such as:

1, user Settings container restart strategy for Always, so once the pod running, so the follow-up if pod fail, we still think is running, the final training good or bad, to the user's own judgment;

The restarting strategy of the user setting the container is OnFailure, which is basically the same as Always;
If the user sets "Never", it is similar, that is, once Running is Running, failure fails;

WDYT？

wenzhel101 · 2018-04-19T20:42:13Z

IMO, the overall state for distributed training should be:

succeeded: master/chief succeeded.
failed: any replica got permanent error(Only identify specific exit codes as retryable error #518). Every worker/ps container can exit due to permanent errors before master/chief, In which case restart won't help, controller should fail the job and terminate all the replicas. For retryable exit status, should respect user's choice(RestartPolicy).

yph152 · 2018-04-20T03:47:48Z

@0olwzo0 My understanding is:

User Settings (RestartPolicy)Never, OnFailure,Always, we use the k8s default policy;
When the user sets the (RestartPolicy) ExitCode, we handle it according to ExitCode;
Is that right?

gaocegege · 2018-04-27T07:07:48Z

Ref #562

gaocegege · 2018-06-04T07:42:52Z

dup with #562

k8s-ci-robot added area/operator kind/discussion priority/p0 labels Apr 18, 2018

gaocegege mentioned this issue Apr 22, 2018

[v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550

Closed

gaocegege closed this as completed Jun 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1alpha2] The state of distributed model training. #544

[v1alpha2] The state of distributed model training. #544

yph152 commented Apr 18, 2018

DjangoPeng commented Apr 18, 2018

gaocegege commented Apr 18, 2018

ScorpioCPH commented Apr 18, 2018

DjangoPeng commented Apr 19, 2018

DjangoPeng commented Apr 19, 2018

yph152 commented Apr 19, 2018

wenzhel101 commented Apr 19, 2018

yph152 commented Apr 20, 2018

gaocegege commented Apr 27, 2018

gaocegege commented Jun 4, 2018

[v1alpha2] The state of distributed model training. #544

[v1alpha2] The state of distributed model training. #544

Comments

yph152 commented Apr 18, 2018

DjangoPeng commented Apr 18, 2018

gaocegege commented Apr 18, 2018

ScorpioCPH commented Apr 18, 2018

DjangoPeng commented Apr 19, 2018

DjangoPeng commented Apr 19, 2018

yph152 commented Apr 19, 2018

wenzhel101 commented Apr 19, 2018

yph152 commented Apr 20, 2018

gaocegege commented Apr 27, 2018

gaocegege commented Jun 4, 2018