Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1alpha2] The state of distributed model training. #544

Closed
yph152 opened this issue Apr 18, 2018 · 10 comments
Closed

[v1alpha2] The state of distributed model training. #544

yph152 opened this issue Apr 18, 2018 · 10 comments

Comments

@yph152
Copy link
Contributor

yph152 commented Apr 18, 2018

@gaocegege @ScorpioCPH @DjangoPeng
As for the state judgment of model training, I think there are usually several situations:
1、The best case is that the non-chief worker task is wrong, because these tasks are actually stateless. When such a worker task is restored, it will reconnect with its PS task and restart the process that was previously broken.
2、More bad a situation is PS task fails, then there is a problem, because the PS task is state, all the worker task need to rely on them to send their gradient and obtain a new parameter values. In this case, their chief worker task is responsible for monitoring this error, and if this error occurs, the chief worker task interrupts the entire training and resumes all PS tasks from the previous checkpoint.
3、In the worst case is devoted to the worker task failed, because we make it in charge of all the other tasks, we make sure that it has failed, to cluster all the tasks in the back in a good condition. So what we do is we interrupt the training.

According to the above analysis, I think it can be divided into several types:

  1. The user sets the chief, so when the chief node fails, I think the distributed training fails;

  2. The user does not set chief, so there are some situations here:

(1) the user may use worker0 as the chief node, so we believe that worker0 has problems and distributed training fails;

(2) the user does not use worker0 as the chief node, so we believe that the problem of ps appears and distributed training fails;

But the above two cases, we have no way to judge whether the user use worker0 as devoted to nodes, so I think, this kind of situation, is any worker0 and ps nodes appear problem, I think the distributed training failure.

  1. In the case of all workers, this situation is that the worker has a problem and the distributed training fails.

What do you think?

@DjangoPeng
Copy link
Member

/area operator
/kind discussion
/priority p0

/cc @ddysher @jlewi

@gaocegege
Copy link
Member

Personally, I think we do not make decisions for users. We consider the job failed if there is one worker failed and the user do not specify the chief worker.

@ScorpioCPH
Copy link
Member

The user does not set chief.

We will use worker-0 as chief worker by default.

Training fails is a big topic we have discussed several times, It depends on many cases:

  • Synchronous training or asynchronous training.
  • Whether user have written costume code to save/reload model from checkpoint file or not.
  • And some other cases.

@DjangoPeng
Copy link
Member

Training fails is a big topic we have discussed several times.

How many cases do we support now? Or what's the default policy for distributed TFJob?

@DjangoPeng
Copy link
Member

@yph152 See also:

#283
#333

@yph152
Copy link
Contributor Author

yph152 commented Apr 19, 2018

@DjangoPeng @gaocegege @ScorpioCPH
Training for distributed model does have a lot of kinds of situation, for the back-end to determine all of the state is very difficult, can't we make for the entire state of the art distributed training to users to set up, we only need to according to the users to set returned to the user related condition. Such as:

1, user Settings container restart strategy for Always, so once the pod running, so the follow-up if pod fail, we still think is running, the final training good or bad, to the user's own judgment;

  1. The restarting strategy of the user setting the container is OnFailure, which is basically the same as Always;

  2. If the user sets "Never", it is similar, that is, once Running is Running, failure fails;

WDYT?

@wenzhel101
Copy link
Contributor

IMO, the overall state for distributed training should be:

  • succeeded: master/chief succeeded.
  • failed: any replica got permanent error(Only identify specific exit codes as retryable error #518). Every worker/ps container can exit due to permanent errors before master/chief, In which case restart won't help, controller should fail the job and terminate all the replicas. For retryable exit status, should respect user's choice(RestartPolicy).

@yph152
Copy link
Contributor Author

yph152 commented Apr 20, 2018

@0olwzo0 My understanding is:

  • User Settings (RestartPolicy)Never, OnFailure,Always, we use the k8s default policy;
  • When the user sets the (RestartPolicy) ExitCode, we handle it according to ExitCode;
    Is that right?

@gaocegege
Copy link
Member

Ref #562

@gaocegege
Copy link
Member

dup with #562

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants