-
Notifications
You must be signed in to change notification settings - Fork 81
Description
I have installed adaptdl scheduler and try to use it to do distributed training.
When I test Cifar-10, the created adaptdljob will fail when auto-scaling happens:
INFO:adaptdl.reducer:rank 0 of 2 connecting to 172.30.133.85 on port 47001
INFO:adaptdl.reducer:Master waiting for connections on 47001
INFO:adaptdl.torch:Initializing torch.distributed using tcp://172.30.133.85:40087?rank=0&world_size=2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
INFO:adaptdl.torch:torch.distributed initialized
Using downloaded and verified file: ./data/cifar-10-python.tar.gz
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Traceback (most recent call last):
File "/workspace/test_adaptdl.py", line 144, in
main()
File "/workspace/test_adaptdl.py", line 127, in main
model = adaptdl.torch.AdaptiveDataParallel(model, optimizer)
File "/opt/conda/lib/python3.7/site-packages/adaptdl/torch/parallel.py", line 89, in init
adaptdl.checkpoint.load_state(self._state)
File "/opt/conda/lib/python3.7/site-packages/adaptdl/checkpoint.py", line 204, in load_state
state.load(f)
File "/opt/conda/lib/python3.7/site-packages/adaptdl/torch/parallel.py", line 228, in load
self.optimizer.load_state_dict(state_dicts[1])
File "/opt/conda/lib/python3.7/site-packages/torch/optim/optimizer.py", line 214, in load_state_dict
self.setstate({'state': state, 'param_groups': param_groups})
File "/opt/conda/lib/python3.7/site-packages/torch/optim/adam.py", line 100, in setstate
step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
KeyError: 'step'
This error will not occur if I use a stateless optimizer such as SGD, instead of Adam. I have found that this error occurs because the optimizers initialized in the new pods will also initialize GradientNoiseScale, which will set some default values into the optimizers' states. However, the key "step" is missing, which makes the load from checkpoint unsuccessful. So I manually add this key and everything works fine now.
class GradientNoiseScale(object):
def __init__(self, adp, optimizer,
mp_scaler=None,
num_replicas=None,
accum_scale=None):
self._adp = adp
self._optimizer = optimizer
self._orig_optimizer_zero_grad = optimizer.zero_grad
self._should_zero_grad = True
self._mp_scaler = mp_scaler
self._local_sqr = None
self._num_replicas = (num_replicas if num_replicas is not None
else torch.distributed.get_world_size())
self._accum_scale = accum_scale or self._num_replicas
self._prev_grads = None
self.reset_accumulation()
self._optimizer.state.setdefault("gns", {
"progress": 0.0,
"prev_scale": 0.0,
"sqr_avg": np.ones(len(optimizer.param_groups)),
"var_avg": np.zeros(len(optimizer.param_groups)),
"biased": False,
# add this line
"step": torch.tensor(0.),
})