-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale down of pytorchJob cause workers pod to restart #1509
Comments
Yes, the PyTorch elastic relies on checkpointing and restarting.
Sometimes the agent deals with the restart, thus the pod does not restart. |
Yeah, It's reasonable that process restart while scaling down happen, but can't figure it out why pod needs to restart. |
There are two processes in the pod: agent and worker. The agent will not restart in the most case, and the worker will restart. |
Could you be more specific about trigger condition of agent restarting(agent restart processes actually ) or pod restarting. |
For example, if the worker cannot find the other workers, the agent and the work will both exit. And the pod will restart. It does not affect the elasticPolicy.maxRestarts elasticPolicy.maxRestarts equals to https://github.com/pytorch/pytorch/blob/df11e2d6f9782fc3995e17ef09a5ef3812da041d/torch/distributed/run.py#L205 |
It is a feature of PyTorch elastic. I think it is to avoid permanent failure. For example, if the worker always fail, we should not always retry. |
Appreciate for your reply,I got it. Thus ,pod restarting is inevitable while scaling down under this mechanism. |
I think it may be possible. But in the current implementation, it is inevitable. I am not sure if the behavior is the same in the next PyTorch release (1.11 or 1.10.2) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, I run imagenet example with pytorch elastic
1、Firstly, I create PytorchJob with the following elasticPolicy
2、Scale down to 2 nodes by editing
Worker.replicas
to 23、 Scale up to 3 nodes by editing
Worker.replicas
to 3Is scale down cause workers pod to restart the original design?What are the considerations that are different from scale up
the max times of scale down operation is limited to
maxRestarts
?The text was updated successfully, but these errors were encountered: