Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale down of pytorchJob cause workers pod to restart #1509

Closed
tingweiwu opened this issue Dec 14, 2021 · 10 comments
Closed

Scale down of pytorchJob cause workers pod to restart #1509

tingweiwu opened this issue Dec 14, 2021 · 10 comments

Comments

@tingweiwu
Copy link

tingweiwu commented Dec 14, 2021

Hi, I run imagenet example with pytorch elastic

1、Firstly, I create PytorchJob with the following elasticPolicy

spec:
  elasticPolicy:
    rdzvBackend: c10d
    minReplicas: 2
    maxReplicas: 10
    maxRestarts: 1

NAME                                  READY   STATUS      RESTARTS   AGE
elastic-example-imagenet-2-worker-0   1/1     Running     0          15s
elastic-example-imagenet-2-worker-1   1/1     Running     0          15s
elastic-example-imagenet-2-worker-2   1/1     Running     0          15s

 kubectl logs elastic-example-imagenet-2-worker-2
Namespace(log_dir='/log', master_addr='127.0.0.1', master_port=29500, max_restarts=1, module=False, monitor_interval=5, nnodes='2:10', no_python=False, node_rank=0, nproc_per_node='1', rdzv_backend='c10d', rdzv_conf='', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_id='none', redirects='0', role='default', run_path=False, standalone=False, start_method='spawn', tee='0', training_script='/code/imagenet.py', training_script_args=['--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200'])
args Namespace(log_dir='/log', master_addr='127.0.0.1', master_port=29500, max_restarts=1, module=False, monitor_interval=5, nnodes='2:10', no_python=False, node_rank=0, nproc_per_node='1', rdzv_backend='c10d', rdzv_conf='', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_id='none', redirects='0', role='default', run_path=False, standalone=False, start_method='spawn', tee='0', training_script='/code/imagenet.py', training_script_args=['--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200']) 
config: LaunchConfig(min_nodes=2, max_nodes=10, nproc_per_node=1, run_id='none', role='default', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_backend='c10d', rdzv_configs={'timeout': 900}, rdzv_timeout=-1, max_restarts=1, monitor_interval=5, start_method='spawn', log_dir='/log', redirects=<Std.NONE: 0>, tee=<Std.NONE: 0>, metrics_cfg={}), cmd:  /opt/conda/bin/python , cmd_args ['-u', '/code/imagenet.py', '--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200']
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : /code/imagenet.py
  min_nodes        : 2
  max_nodes        : 10
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : c10d
  rdzv_endpoint    : elastic-example-imagenet-2-worker-0:23456
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 1
  monitor_interval : 5
  log_dir          : /log
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /log/none_unuhjzmc
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=elastic-example-imagenet-2-worker-0
  master_port=34818
  group_rank=2
  group_world_size=3
  local_ranks=[0]
  role_ranks=[2]
  global_ranks=[2]
  role_world_sizes=[3]
  global_world_sizes=[3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /log/none_unuhjzmc/attempt_0/0/error.json
Namespace(arch='resnet18', batch_size=32, checkpoint_file='/log/checkpoint.pth.tar', data='/data/tiny-imagenet-200', dist_backend='gloo', epochs=20, lr=0.1, momentum=0.9, print_freq=10, weight_decay=0.0001, workers=0)
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][   0/1042]	Time  6.849 ( 6.849)	Data  0.430 ( 0.430)	Loss 6.9316e+00 (6.9316e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)

2、Scale down to 2 nodes by editing Worker.replicas to 2

# kubectl logs elastic-example-imagenet-2-worker-0
.............
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /log/none_4wzushrn
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=elastic-example-imagenet-2-worker-0
  master_port=34818
  group_rank=0
  group_world_size=3
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[3]
  global_world_sizes=[3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /log/none_4wzushrn/attempt_0/0/error.json
Namespace(arch='resnet18', batch_size=32, checkpoint_file='/log/checkpoint.pth.tar', data='/data/tiny-imagenet-200', dist_backend='gloo', epochs=20, lr=0.1, momentum=0.9, print_freq=10, weight_decay=0.0001, workers=0)
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][   0/1042]	Time  6.850 ( 6.850)	Data  0.795 ( 0.795)	Loss 7.2169e+00 (7.2169e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)
Epoch: [0][  10/1042]	Time  5.161 ( 5.376)	Data  0.297 ( 0.328)	Loss 6.2373e+00 (6.4351e+00)	Acc@1   3.12 (  0.85)	Acc@5   3.12 (  2.27)
Traceback (most recent call last):
  File "/code/imagenet.py", line 589, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/code/imagenet.py", line 190, in main
    train(train_loader, model, criterion, optimizer, epoch, print_freq)
  File "/code/imagenet.py", line 462, in train
    loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [/opt/conda/conda-bld/pytorch_1634272168290/work/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.140.0.110]:62863

# kubectl get pods
NAME                                  READY   STATUS      RESTARTS   AGE
elastic-example-imagenet-2-worker-0   1/1     Running     1          2m22s
elastic-example-imagenet-2-worker-1   1/1     Running     1          2m22s


# kubectl logs elastic-example-imagenet-2-worker-0
Namespace(log_dir='/log', master_addr='127.0.0.1', master_port=29500, max_restarts=1, module=False, monitor_interval=5, nnodes='2:10', no_python=False, node_rank=0, nproc_per_node='1', rdzv_backend='c10d', rdzv_conf='', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_id='none', redirects='0', role='default', run_path=False, standalone=False, start_method='spawn', tee='0', training_script='/code/imagenet.py', training_script_args=['--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200'])
args Namespace(log_dir='/log', master_addr='127.0.0.1', master_port=29500, max_restarts=1, module=False, monitor_interval=5, nnodes='2:10', no_python=False, node_rank=0, nproc_per_node='1', rdzv_backend='c10d', rdzv_conf='', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_id='none', redirects='0', role='default', run_path=False, standalone=False, start_method='spawn', tee='0', training_script='/code/imagenet.py', training_script_args=['--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200']) 
config: LaunchConfig(min_nodes=2, max_nodes=10, nproc_per_node=1, run_id='none', role='default', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_backend='c10d', rdzv_configs={'timeout': 900}, rdzv_timeout=-1, max_restarts=1, monitor_interval=5, start_method='spawn', log_dir='/log', redirects=<Std.NONE: 0>, tee=<Std.NONE: 0>, metrics_cfg={}), cmd:  /opt/conda/bin/python , cmd_args ['-u', '/code/imagenet.py', '--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200']
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : /code/imagenet.py
  min_nodes        : 2
  max_nodes        : 10
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : c10d
  rdzv_endpoint    : elastic-example-imagenet-2-worker-0:23456
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 1
  monitor_interval : 5
  log_dir          : /log
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /log/none_ghgeo4qi
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=elastic-example-imagenet-2-worker-0
  master_port=57700
  group_rank=0
  group_world_size=2
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[2]
  global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /log/none_ghgeo4qi/attempt_0/0/error.json
Namespace(arch='resnet18', batch_size=32, checkpoint_file='/log/checkpoint.pth.tar', data='/data/tiny-imagenet-200', dist_backend='gloo', epochs=20, lr=0.1, momentum=0.9, print_freq=10, weight_decay=0.0001, workers=0)
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][   0/1563]	Time  7.108 ( 7.108)	Data  0.796 ( 0.796)	Loss 6.9429e+00 (6.9429e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)
Epoch: [0][  10/1563]	Time  5.229 ( 5.409)	Data  0.227 ( 0.338)	Loss 6.6570e+00 (6.7249e+00)	Acc@1   0.00 (  0.85)	Acc@5   6.25 (  2.56)


3、 Scale up to 3 nodes by editing Worker.replicas to 3

 kubectl get pods
NAME                                  READY   STATUS              RESTARTS   AGE
elastic-example-imagenet-2-worker-0   1/1     Running             1          5m23s
elastic-example-imagenet-2-worker-1   1/1     Running             1          5m23s
elastic-example-imagenet-2-worker-2   0/1     ContainerCreating   0          5s


 kubectl logs elastic-example-imagenet-2-worker-0
.............
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /log/none_ghgeo4qi
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=elastic-example-imagenet-2-worker-0
  master_port=57700
  group_rank=0
  group_world_size=2
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[2]
  global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /log/none_ghgeo4qi/attempt_0/0/error.json
Namespace(arch='resnet18', batch_size=32, checkpoint_file='/log/checkpoint.pth.tar', data='/data/tiny-imagenet-200', dist_backend='gloo', epochs=20, lr=0.1, momentum=0.9, print_freq=10, weight_decay=0.0001, workers=0)
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][   0/1563]	Time  7.108 ( 7.108)	Data  0.796 ( 0.796)	Loss 6.9429e+00 (6.9429e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)
Epoch: [0][  10/1563]	Time  5.229 ( 5.409)	Data  0.227 ( 0.338)	Loss 6.6570e+00 (6.7249e+00)	Acc@1   0.00 (  0.85)	Acc@5   6.25 (  2.56)
INFO:torch.distributed.elastic.agent.server.api:[default] Detected 1 new nodes from group_rank=0; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13 closing signal SIGTERM
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=elastic-example-imagenet-2-worker-0
  master_port=44591
  group_rank=0
  group_world_size=3
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[3]
  global_world_sizes=[3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /log/none_ghgeo4qi/attempt_0/0/error.json
Namespace(arch='resnet18', batch_size=32, checkpoint_file='/log/checkpoint.pth.tar', data='/data/tiny-imagenet-200', dist_backend='gloo', epochs=20, lr=0.1, momentum=0.9, print_freq=10, weight_decay=0.0001, workers=0)
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][   0/1042]	Time  6.091 ( 6.091)	Data  0.380 ( 0.380)	Loss 6.9973e+00 (6.9973e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)
Epoch: [0][  10/1042]	Time  5.451 ( 5.355)	Data  0.254 ( 0.286)	Loss 6.3559e+00 (6.4264e+00)	Acc@1   3.12 (  0.57)	Acc@5   6.25 (  2.84)


Is scale down cause workers pod to restart the original design?What are the considerations that are different from scale up
the max times of scale down operation is limited to maxRestarts?

@zw0610
Copy link
Member

zw0610 commented Dec 14, 2021

@gaocegege

@gaocegege
Copy link
Member

Is scale down cause workers pod to restart the original design

Yes, the PyTorch elastic relies on checkpointing and restarting.

What are the considerations that are different from scale up the max times of scale down operation is limited to maxRestarts?

Sometimes the agent deals with the restart, thus the pod does not restart.

@tingweiwu
Copy link
Author

Yes, the PyTorch elastic relies on checkpointing and restarting.

Yeah, It's reasonable that process restart while scaling down happen, but can't figure it out why pod needs to restart.

@gaocegege
Copy link
Member

Yes, the PyTorch elastic relies on checkpointing and restarting.

Yeah, It's reasonable that process restart while scaling down happen, but can't figure it out why pod needs to restart.

There are two processes in the pod: agent and worker. The agent will not restart in the most case, and the worker will restart.

@tingweiwu
Copy link
Author

Sometimes the agent deals with the restart, thus the pod does not restart.

Could you be more specific about trigger condition of agent restarting(agent restart processes actually ) or pod restarting.
and what is the relationship between with elasticPolicy.maxRestarts

@gaocegege
Copy link
Member

For example, if the worker cannot find the other workers, the agent and the work will both exit. And the pod will restart. It does not affect the elasticPolicy.maxRestarts

elasticPolicy.maxRestarts equals to https://github.com/pytorch/pytorch/blob/df11e2d6f9782fc3995e17ef09a5ef3812da041d/torch/distributed/run.py#L205

@gaocegege
Copy link
Member

What are the considerations that are different from scale up
the max times of scale down operation is limited to maxRestarts

It is a feature of PyTorch elastic. I think it is to avoid permanent failure. For example, if the worker always fail, we should not always retry.

@gaocegege gaocegege self-assigned this Dec 14, 2021
@tingweiwu
Copy link
Author

For example, if the worker cannot find the other workers, the agent and the work will both exit. And the pod will restart. It does not affect the elasticPolicy.maxRestarts

Appreciate for your reply,I got it. Thus ,pod restarting is inevitable while scaling down under this mechanism.

@gaocegege
Copy link
Member

Appreciate for your reply,I got it. Thus ,pod restarting is inevitable while scaling down under this mechanism.

I think it may be possible. But in the current implementation, it is inevitable. I am not sure if the behavior is the same in the next PyTorch release (1.11 or 1.10.2)

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Apr 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants