Scale down of pytorchJob cause workers pod to restart #1509

tingweiwu · 2021-12-14T02:17:18Z

Hi, I run imagenet example with pytorch elastic

1、Firstly, I create PytorchJob with the following elasticPolicy

spec:
  elasticPolicy:
    rdzvBackend: c10d
    minReplicas: 2
    maxReplicas: 10
    maxRestarts: 1

NAME                                  READY   STATUS      RESTARTS   AGE
elastic-example-imagenet-2-worker-0   1/1     Running     0          15s
elastic-example-imagenet-2-worker-1   1/1     Running     0          15s
elastic-example-imagenet-2-worker-2   1/1     Running     0          15s

 kubectl logs elastic-example-imagenet-2-worker-2
Namespace(log_dir='/log', master_addr='127.0.0.1', master_port=29500, max_restarts=1, module=False, monitor_interval=5, nnodes='2:10', no_python=False, node_rank=0, nproc_per_node='1', rdzv_backend='c10d', rdzv_conf='', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_id='none', redirects='0', role='default', run_path=False, standalone=False, start_method='spawn', tee='0', training_script='/code/imagenet.py', training_script_args=['--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200'])
args Namespace(log_dir='/log', master_addr='127.0.0.1', master_port=29500, max_restarts=1, module=False, monitor_interval=5, nnodes='2:10', no_python=False, node_rank=0, nproc_per_node='1', rdzv_backend='c10d', rdzv_conf='', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_id='none', redirects='0', role='default', run_path=False, standalone=False, start_method='spawn', tee='0', training_script='/code/imagenet.py', training_script_args=['--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200']) 
config: LaunchConfig(min_nodes=2, max_nodes=10, nproc_per_node=1, run_id='none', role='default', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_backend='c10d', rdzv_configs={'timeout': 900}, rdzv_timeout=-1, max_restarts=1, monitor_interval=5, start_method='spawn', log_dir='/log', redirects=<Std.NONE: 0>, tee=<Std.NONE: 0>, metrics_cfg={}), cmd:  /opt/conda/bin/python , cmd_args ['-u', '/code/imagenet.py', '--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200']
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : /code/imagenet.py
  min_nodes        : 2
  max_nodes        : 10
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : c10d
  rdzv_endpoint    : elastic-example-imagenet-2-worker-0:23456
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 1
  monitor_interval : 5
  log_dir          : /log
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /log/none_unuhjzmc
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=elastic-example-imagenet-2-worker-0
  master_port=34818
  group_rank=2
  group_world_size=3
  local_ranks=[0]
  role_ranks=[2]
  global_ranks=[2]
  role_world_sizes=[3]
  global_world_sizes=[3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /log/none_unuhjzmc/attempt_0/0/error.json
Namespace(arch='resnet18', batch_size=32, checkpoint_file='/log/checkpoint.pth.tar', data='/data/tiny-imagenet-200', dist_backend='gloo', epochs=20, lr=0.1, momentum=0.9, print_freq=10, weight_decay=0.0001, workers=0)
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][   0/1042]	Time  6.849 ( 6.849)	Data  0.430 ( 0.430)	Loss 6.9316e+00 (6.9316e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)

2、Scale down to 2 nodes by editing Worker.replicas to 2

# kubectl logs elastic-example-imagenet-2-worker-0
.............
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /log/none_4wzushrn
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=elastic-example-imagenet-2-worker-0
  master_port=34818
  group_rank=0
  group_world_size=3
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[3]
  global_world_sizes=[3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /log/none_4wzushrn/attempt_0/0/error.json
Namespace(arch='resnet18', batch_size=32, checkpoint_file='/log/checkpoint.pth.tar', data='/data/tiny-imagenet-200', dist_backend='gloo', epochs=20, lr=0.1, momentum=0.9, print_freq=10, weight_decay=0.0001, workers=0)
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][   0/1042]	Time  6.850 ( 6.850)	Data  0.795 ( 0.795)	Loss 7.2169e+00 (7.2169e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)
Epoch: [0][  10/1042]	Time  5.161 ( 5.376)	Data  0.297 ( 0.328)	Loss 6.2373e+00 (6.4351e+00)	Acc@1   3.12 (  0.85)	Acc@5   3.12 (  2.27)
Traceback (most recent call last):
  File "/code/imagenet.py", line 589, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/code/imagenet.py", line 190, in main
    train(train_loader, model, criterion, optimizer, epoch, print_freq)
  File "/code/imagenet.py", line 462, in train
    loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [/opt/conda/conda-bld/pytorch_1634272168290/work/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.140.0.110]:62863

# kubectl get pods
NAME                                  READY   STATUS      RESTARTS   AGE
elastic-example-imagenet-2-worker-0   1/1     Running     1          2m22s
elastic-example-imagenet-2-worker-1   1/1     Running     1          2m22s


# kubectl logs elastic-example-imagenet-2-worker-0
Namespace(log_dir='/log', master_addr='127.0.0.1', master_port=29500, max_restarts=1, module=False, monitor_interval=5, nnodes='2:10', no_python=False, node_rank=0, nproc_per_node='1', rdzv_backend='c10d', rdzv_conf='', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_id='none', redirects='0', role='default', run_path=False, standalone=False, start_method='spawn', tee='0', training_script='/code/imagenet.py', training_script_args=['--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200'])
args Namespace(log_dir='/log', master_addr='127.0.0.1', master_port=29500, max_restarts=1, module=False, monitor_interval=5, nnodes='2:10', no_python=False, node_rank=0, nproc_per_node='1', rdzv_backend='c10d', rdzv_conf='', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_id='none', redirects='0', role='default', run_path=False, standalone=False, start_method='spawn', tee='0', training_script='/code/imagenet.py', training_script_args=['--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200']) 
config: LaunchConfig(min_nodes=2, max_nodes=10, nproc_per_node=1, run_id='none', role='default', rdzv_endpoint='elastic-example-imagenet-2-worker-0:23456', rdzv_backend='c10d', rdzv_configs={'timeout': 900}, rdzv_timeout=-1, max_restarts=1, monitor_interval=5, start_method='spawn', log_dir='/log', redirects=<Std.NONE: 0>, tee=<Std.NONE: 0>, metrics_cfg={}), cmd:  /opt/conda/bin/python , cmd_args ['-u', '/code/imagenet.py', '--arch=resnet18', '--epochs=20', '--batch-size=32', '--workers=0', '--checkpoint-file=/log/checkpoint.pth.tar', '/data/tiny-imagenet-200']
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : /code/imagenet.py
  min_nodes        : 2
  max_nodes        : 10
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : c10d
  rdzv_endpoint    : elastic-example-imagenet-2-worker-0:23456
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 1
  monitor_interval : 5
  log_dir          : /log
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /log/none_ghgeo4qi
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=elastic-example-imagenet-2-worker-0
  master_port=57700
  group_rank=0
  group_world_size=2
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[2]
  global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /log/none_ghgeo4qi/attempt_0/0/error.json
Namespace(arch='resnet18', batch_size=32, checkpoint_file='/log/checkpoint.pth.tar', data='/data/tiny-imagenet-200', dist_backend='gloo', epochs=20, lr=0.1, momentum=0.9, print_freq=10, weight_decay=0.0001, workers=0)
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][   0/1563]	Time  7.108 ( 7.108)	Data  0.796 ( 0.796)	Loss 6.9429e+00 (6.9429e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)
Epoch: [0][  10/1563]	Time  5.229 ( 5.409)	Data  0.227 ( 0.338)	Loss 6.6570e+00 (6.7249e+00)	Acc@1   0.00 (  0.85)	Acc@5   6.25 (  2.56)

3、 Scale up to 3 nodes by editing Worker.replicas to 3

 kubectl get pods
NAME                                  READY   STATUS              RESTARTS   AGE
elastic-example-imagenet-2-worker-0   1/1     Running             1          5m23s
elastic-example-imagenet-2-worker-1   1/1     Running             1          5m23s
elastic-example-imagenet-2-worker-2   0/1     ContainerCreating   0          5s


 kubectl logs elastic-example-imagenet-2-worker-0
.............
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /log/none_ghgeo4qi
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=elastic-example-imagenet-2-worker-0
  master_port=57700
  group_rank=0
  group_world_size=2
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[2]
  global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /log/none_ghgeo4qi/attempt_0/0/error.json
Namespace(arch='resnet18', batch_size=32, checkpoint_file='/log/checkpoint.pth.tar', data='/data/tiny-imagenet-200', dist_backend='gloo', epochs=20, lr=0.1, momentum=0.9, print_freq=10, weight_decay=0.0001, workers=0)
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][   0/1563]	Time  7.108 ( 7.108)	Data  0.796 ( 0.796)	Loss 6.9429e+00 (6.9429e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)
Epoch: [0][  10/1563]	Time  5.229 ( 5.409)	Data  0.227 ( 0.338)	Loss 6.6570e+00 (6.7249e+00)	Acc@1   0.00 (  0.85)	Acc@5   6.25 (  2.56)
INFO:torch.distributed.elastic.agent.server.api:[default] Detected 1 new nodes from group_rank=0; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13 closing signal SIGTERM
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=elastic-example-imagenet-2-worker-0
  master_port=44591
  group_rank=0
  group_world_size=3
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[3]
  global_world_sizes=[3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /log/none_ghgeo4qi/attempt_0/0/error.json
Namespace(arch='resnet18', batch_size=32, checkpoint_file='/log/checkpoint.pth.tar', data='/data/tiny-imagenet-200', dist_backend='gloo', epochs=20, lr=0.1, momentum=0.9, print_freq=10, weight_decay=0.0001, workers=0)
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][   0/1042]	Time  6.091 ( 6.091)	Data  0.380 ( 0.380)	Loss 6.9973e+00 (6.9973e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)
Epoch: [0][  10/1042]	Time  5.451 ( 5.355)	Data  0.254 ( 0.286)	Loss 6.3559e+00 (6.4264e+00)	Acc@1   3.12 (  0.57)	Acc@5   6.25 (  2.84)

Is scale down cause workers pod to restart the original design？What are the considerations that are different from scale up
the max times of scale down operation is limited to maxRestarts?

The text was updated successfully, but these errors were encountered:

zw0610 · 2021-12-14T02:18:47Z

@gaocegege

gaocegege · 2021-12-14T02:20:34Z

Is scale down cause workers pod to restart the original design

Yes, the PyTorch elastic relies on checkpointing and restarting.

What are the considerations that are different from scale up the max times of scale down operation is limited to maxRestarts?

Sometimes the agent deals with the restart, thus the pod does not restart.

tingweiwu · 2021-12-14T07:05:22Z

Yes, the PyTorch elastic relies on checkpointing and restarting.

Yeah, It's reasonable that process restart while scaling down happen, but can't figure it out why pod needs to restart.

gaocegege · 2021-12-14T07:07:20Z

Yes, the PyTorch elastic relies on checkpointing and restarting.

Yeah, It's reasonable that process restart while scaling down happen, but can't figure it out why pod needs to restart.

There are two processes in the pod: agent and worker. The agent will not restart in the most case, and the worker will restart.

tingweiwu · 2021-12-14T07:16:37Z

Sometimes the agent deals with the restart, thus the pod does not restart.

Could you be more specific about trigger condition of agent restarting(agent restart processes actually ) or pod restarting.
and what is the relationship between with elasticPolicy.maxRestarts

gaocegege · 2021-12-14T07:23:18Z

For example, if the worker cannot find the other workers, the agent and the work will both exit. And the pod will restart. It does not affect the elasticPolicy.maxRestarts

elasticPolicy.maxRestarts equals to https://github.com/pytorch/pytorch/blob/df11e2d6f9782fc3995e17ef09a5ef3812da041d/torch/distributed/run.py#L205

gaocegege · 2021-12-14T07:24:47Z

What are the considerations that are different from scale up
the max times of scale down operation is limited to maxRestarts

It is a feature of PyTorch elastic. I think it is to avoid permanent failure. For example, if the worker always fail, we should not always retry.

tingweiwu · 2021-12-14T07:40:44Z

For example, if the worker cannot find the other workers, the agent and the work will both exit. And the pod will restart. It does not affect the elasticPolicy.maxRestarts

Appreciate for your reply，I got it. Thus ,pod restarting is inevitable while scaling down under this mechanism.

gaocegege · 2021-12-14T07:43:03Z

Appreciate for your reply，I got it. Thus ,pod restarting is inevitable while scaling down under this mechanism.

I think it may be possible. But in the current implementation, it is inevitable. I am not sure if the behavior is the same in the next PyTorch release (1.11 or 1.10.2)

stale · 2022-04-16T19:57:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gaocegege self-assigned this Dec 14, 2021

gaocegege added the kind/question label Dec 14, 2021

stale bot added the lifecycle/stale label Apr 16, 2022

stale bot closed this as completed Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale down of pytorchJob cause workers pod to restart #1509

Scale down of pytorchJob cause workers pod to restart #1509

tingweiwu commented Dec 14, 2021 •

edited

zw0610 commented Dec 14, 2021

gaocegege commented Dec 14, 2021

tingweiwu commented Dec 14, 2021

gaocegege commented Dec 14, 2021

tingweiwu commented Dec 14, 2021

gaocegege commented Dec 14, 2021

gaocegege commented Dec 14, 2021

tingweiwu commented Dec 14, 2021

gaocegege commented Dec 14, 2021

stale bot commented Apr 16, 2022

Scale down of pytorchJob cause workers pod to restart #1509

Scale down of pytorchJob cause workers pod to restart #1509

Comments

tingweiwu commented Dec 14, 2021 • edited

zw0610 commented Dec 14, 2021

gaocegege commented Dec 14, 2021

tingweiwu commented Dec 14, 2021

gaocegege commented Dec 14, 2021

tingweiwu commented Dec 14, 2021

gaocegege commented Dec 14, 2021

gaocegege commented Dec 14, 2021

tingweiwu commented Dec 14, 2021

gaocegege commented Dec 14, 2021

stale bot commented Apr 16, 2022

tingweiwu commented Dec 14, 2021 •

edited