Keyerror occurs when auto-scaling happens in AdpatDL scheduler

I have installed adaptdl scheduler and try to use it to do distributed training.
When I test Cifar-10, the created adaptdljob will fail when auto-scaling happens:

**INFO:adaptdl.reducer:rank 0 of 2 connecting to 172.30.133.85 on port 47001
INFO:adaptdl.reducer:Master waiting for connections on 47001
INFO:adaptdl.torch:Initializing torch.distributed using tcp://172.30.133.85:40087?rank=0&world_size=2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
INFO:adaptdl.torch:torch.distributed initialized
Using downloaded and verified file: ./data/cifar-10-python.tar.gz
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Traceback (most recent call last):
  File "/workspace/test_adaptdl.py", line 144, in <module>
    main()
  File "/workspace/test_adaptdl.py", line 127, in main
    model = adaptdl.torch.AdaptiveDataParallel(model, optimizer)
  File "/opt/conda/lib/python3.7/site-packages/adaptdl/torch/parallel.py", line 89, in __init__
    adaptdl.checkpoint.load_state(self._state)
  File "/opt/conda/lib/python3.7/site-packages/adaptdl/checkpoint.py", line 204, in load_state
    state.load(f)
  File "/opt/conda/lib/python3.7/site-packages/adaptdl/torch/parallel.py", line 228, in load
    self.optimizer.load_state_dict(state_dicts[1])
  File "/opt/conda/lib/python3.7/site-packages/torch/optim/optimizer.py", line 214, in load_state_dict
    self.__setstate__({'state': state, 'param_groups': param_groups})
  File "/opt/conda/lib/python3.7/site-packages/torch/optim/adam.py", line 100, in __setstate__
    step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
KeyError: 'step'**

This error will not occur if I use a stateless optimizer such as SGD, instead of Adam. I have found that this error occurs because the optimizers initialized in the new pods will also initialize GradientNoiseScale, which will set some default values into the optimizers' states. However, the key "step" is missing, which makes the load from checkpoint unsuccessful. So I manually add this key and everything works fine now.

```python
class GradientNoiseScale(object):
    def __init__(self, adp, optimizer,
                 mp_scaler=None,
                 num_replicas=None,
                 accum_scale=None):
        self._adp = adp
        self._optimizer = optimizer
        self._orig_optimizer_zero_grad = optimizer.zero_grad
        self._should_zero_grad = True
        self._mp_scaler = mp_scaler
        self._local_sqr = None
        self._num_replicas = (num_replicas if num_replicas is not None
                              else torch.distributed.get_world_size())
        self._accum_scale = accum_scale or self._num_replicas
        self._prev_grads = None

        self.reset_accumulation()

        self._optimizer.state.setdefault("gns", {
            "progress": 0.0,
            "prev_scale": 0.0,
            "sqr_avg": np.ones(len(optimizer.param_groups)),
            "var_avg": np.zeros(len(optimizer.param_groups)),
            "biased": False,
            # add this line
            "step": torch.tensor(0.),
        })
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keyerror occurs when auto-scaling happens in AdpatDL scheduler #135

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Keyerror occurs when auto-scaling happens in AdpatDL scheduler #135

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions