[BUG]- Training on multiple GPU's gives error in accelerator.backward(loss) #3775

ab6995 · 2023-06-20T21:38:42Z

Training on multiple GPU's gives error in accelerator.backward(loss)
/root/Finetuning/peft_qlora_example.py:374 in │
│ │
│ 371 │
│ 372 │
│ 373 if name == "main": │
│ ❱ 374 │ main() │
│ 375 │
│ │
│ /root/Finetuning/peft_qlora_example.py:268 in main │
│ │
│ 265 │ │ │ │ outputs = model(**batch) │
│ 266 │ │ │ │ loss = outputs.loss │
│ 267 │ │ │ │ total_loss += loss.detach().float() │
│ ❱ 268 │ │ │ │ accelerator.backward(loss) │
│ 269 │ │ │ │ optimizer.step() │
│ 270 │ │ │ │ lr_scheduler.step() │
│ 271 │ │ │ │ optimizer.zero_grad() │
│ │
│ /opt/conda/lib/python3.9/site-packages/accelerate/accelerator.py:1835 in backward │
│ │
│ 1832 │ │ │ # deepspeed handles loss scaling by gradient_accumulation_steps in its back │ │ 1833 │ │ │ loss = loss / self.gradient_accumulation_steps │ │ 1834 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │ │ ❱ 1835 │ │ │ self.deepspeed_engine_wrapped.backward(loss, **kwargs) │ │ 1836 │ │ elif self.distributed_type == DistributedType.MEGATRON_LM: │ │ 1837 │ │ │ return │ │ 1838 │ │ elif self.scaler is not None: │ │ │ │ /opt/conda/lib/python3.9/site-packages/accelerate/utils/deepspeed.py:167 in backward │ │ │ │ 164 │ │ │ 165 │ def backward(self, loss, **kwargs): │ │ 166 │ │ # runs backpropagation and handles mixed precision │ │ ❱ 167 │ │ self.engine.backward(loss, **kwargs) │ │ 168 │ │ │ │ 169 │ │ # Deepspeed's engine.step` performs the following operations: │
│ 170 │ │ # - gradient accumulation check │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/engine.py:1980 in backward │
│ │
│ 1977 │ │ if self.zero_optimization(): │
│ 1978 │ │ │ self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumula │
│ 1979 │ │ │ ) │
│ ❱ 1980 │ │ │ self.optimizer.backward(loss, retain_graph=retain_graph) │
│ 1981 │ │ elif self.amp_enabled(): │
│ 1982 │ │ │ # AMP requires delaying unscale when inside gradient accumulation boundaries │
│ 1983 │ │ │ # https://nvidia.github.io/apex/advanced.html#gradient-accumulation-across-i │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:2086 in backward │
│ │
│ 2083 │ │ │ scaled_loss = self.external_loss_scale * loss │
│ 2084 │ │ │ scaled_loss.backward() │
│ 2085 │ │ else: │
│ ❱ 2086 │ │ │ self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) │
│ 2087 │ │ │
│ 2088 │ │ self._get_param_coordinator(training=True).reset_step() │
│ 2089 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py:56 in backward │
│ │
│ 53 │ │
│ 54 │ def backward(self, loss, retain_graph=False): │
│ 55 │ │ scaled_loss = loss * self.loss_scale │
│ ❱ 56 │ │ scaled_loss.backward(retain_graph=retain_graph) │
│ 57 │
│ 58 │
│ 59 class LossScaler(LossScalerBase): │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/_tensor.py:487 in backward │
│ │
│ 484 │ │ │ │ create_graph=create_graph, │
│ 485 │ │ │ │ inputs=inputs, │
│ 486 │ │ │ ) │
│ ❱ 487 │ │ torch.autograd.backward( │
│ 488 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │
│ 489 │ │ ) │
│ 490 │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/autograd/init.py:197 in backward │
│ │
│ 194 │ # The reason we repeat same the comment below is that │
│ 195 │ # some Python versions print out the first line of a multi-line function │
│ 196 │ # calls in the traceback and some print out the last line │
│ ❱ 197 │ Variable.execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 198 │ │ tensors, grad_tensors, retain_graph, create_graph, inputs, │
│ 199 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 200 │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply │
│ │
│ 264 │ │ │ │ │ │ │ "Function is not allowed. You should only implement one " │
│ 265 │ │ │ │ │ │ │ "of them.") │
│ 266 │ │ user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn │
│ ❱ 267 │ │ return user_fn(self, *args) │
│ 268 │ │
│ 269 │ def apply_jvp(self, *args): │
│ 270 │ │ # _forward_cls is defined by derived class │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/utils/checkpoint.py:141 in backward │
│ │
│ 138 │ │ │ with torch.enable_grad(), \ │
│ 139 │ │ │ │ torch.cuda.amp.autocast(**ctx.gpu_autocast_kwargs), \ │
│ 140 │ │ │ │ torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): │
│ ❱ 141 │ │ │ │ outputs = ctx.run_function(*detached_inputs) │
│ 142 │ │ │
│ 143 │ │ if isinstance(outputs, torch.Tensor): │
│ 144 │ │ │ outputs = (outputs,) │
│ │
│ /opt/conda/lib/python3.9/site-packages/transformers/models/mt5/modeling_mt5.py:1045 in │
│ custom_forward │
│ │
│ 1042 │ │ │ │ │
│ 1043 │ │ │ │ def create_custom_forward(module): │
│ 1044 │ │ │ │ │ def custom_forward(*inputs): │
│ ❱ 1045 │ │ │ │ │ │ return tuple(module(*inputs, use_cache, output_attentions)) │
│ 1046 │ │ │ │ │ │
│ 1047 │ │ │ │ │ return custom_forward │
│ 1048 │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py:1208 in _call_impl │
│ │
│ 1205 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │
│ 1206 │ │ │ input = bw_hook.setup_input_hook(input) │
│ 1207 │ │ │
│ ❱ 1208 │ │ result = forward_call(input, **kwargs) │
│ 1209 │ │ if _global_forward_hooks or self._forward_hooks: │
│ 1210 │ │ │ for hook in (_global_forward_hooks.values(), *self._forward_hooks.values()) │
│ 1211 │ │ │ │ hook_result = hook(self, input, result) │
│ │
│ /opt/conda/lib/python3.9/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /opt/conda/lib/python3.9/site-packages/transformers/models/mt5/modeling_mt5.py:557 in forward │
│ │
│ 554 │ │ else: │
│ 555 │ │ │ self_attn_past_key_value, cross_attn_past_key_value = None, None │
│ 556 │ │ │
│ ❱ 557 │ │ self_attention_outputs = self.layer[0]( │
│ 558 │ │ │ hidden_states, │
│ 559 │ │ │ attention_mask=attention_mask, │
│ 560 │ │ │ position_bias=position_bias, │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py:1208 in _call_impl │
│ │
│ 1205 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │
│ 1206 │ │ │ input = bw_hook.setup_input_hook(input) │
│ 1207 │ │ │
│ ❱ 1208 │ │ result = forward_call(input, **kwargs) │
│ 1209 │ │ if _global_forward_hooks or self._forward_hooks: │
│ 1210 │ │ │ for hook in (_global_forward_hooks.values(), *self._forward_hooks.values()) │
│ 1211 │ │ │ │ hook_result = hook(self, input, result) │
│ │
│ /opt/conda/lib/python3.9/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /opt/conda/lib/python3.9/site-packages/transformers/models/mt5/modeling_mt5.py:462 in forward │
│ │
│ 459 │ │ output_attentions=False, │
│ 460 │ ): │
│ 461 │ │ normed_hidden_states = self.layer_norm(hidden_states) │
│ ❱ 462 │ │ attention_output = self.SelfAttention( │
│ 463 │ │ │ normed_hidden_states, │
│ 464 │ │ │ mask=attention_mask, │
│ 465 │ │ │ position_bias=position_bias, │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py:1208 in _call_impl │
│ │
│ 1205 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │
│ 1206 │ │ │ input = bw_hook.setup_input_hook(input) │
│ 1207 │ │ │
│ ❱ 1208 │ │ result = forward_call(input, **kwargs) │
│ 1209 │ │ if _global_forward_hooks or self._forward_hooks: │
│ 1210 │ │ │ for hook in (_global_forward_hooks.values(), *self._forward_hooks.values()) │
│ 1211 │ │ │ │ hook_result = hook(self, input, result) │
│ │
│ /opt/conda/lib/python3.9/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /opt/conda/lib/python3.9/site-packages/transformers/models/mt5/modeling_mt5.py:380 in forward │
│ │
│ 377 │ │ │ return hidden_states │
│ 378 │ │ │
│ 379 │ │ # get query states │
│ ❱ 380 │ │ query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, │
│ 381 │ │ │
│ 382 │ │ # get key/value states │
│ 383 │ │ key_states = project( │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py:1197 in _call_impl │
│ │
│ 1194 │ │ │ full_backward_hooks, non_full_backward_hooks = self._get_backward_hooks() │
│ 1195 │ │ if _global_forward_pre_hooks or self._forward_pre_hooks: │
│ 1196 │ │ │ for hook in (_global_forward_pre_hooks.values(), *self._forward_pre_hooks.v │
│ ❱ 1197 │ │ │ │ result = hook(self, input) │
│ 1198 │ │ │ │ if result is not None: │
│ 1199 │ │ │ │ │ if not isinstance(result, tuple): │
│ 1200 │ │ │ │ │ │ result = (result,) │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py:348 in │
│ _pre_forward_module_hook │
│ │
│ 345 │ │ │
│ 346 │ │ @instrument_w_nvtx │
│ 347 │ │ def _pre_forward_module_hook(module, *args): │
│ ❱ 348 │ │ │ self.pre_sub_module_forward_function(module) │
│ 349 │ │ │
│ 350 │ │ @instrument_w_nvtx │
│ 351 │ │ def _post_forward_module_hook(module, input, output): │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27 in decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ ❱ 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def _wrap_generator(self, func): │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py:478 in │
│ pre_sub_module_forward_function │
│ │
│ 475 │ │ param_coordinator.trace_prologue(sub_module) │
│ 476 │ │ if param_coordinator.is_record_trace(): │
│ 477 │ │ │ param_coordinator.record_module(sub_module) │
│ ❱ 478 │ │ param_coordinator.fetch_sub_module(sub_module) │
│ 479 │ │ │
│ 480 │ │ see_memory_usage( │
│ 481 │ │ │ f"Before sub module function {sub_module.class.name} after fetch", │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27 in decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ ❱ 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def _wrap_generator(self, func): │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py:2 │
│ 58 in fetch_sub_module │
│ │
│ 255 │ │ # kick off all gather for params in the immediately required submodule │
│ 256 │ │ for param in params_to_fetch: │
│ 257 │ │ │ debug_rank0(f"-fetch: {param.ds_summary()}") │
│ ❱ 258 │ │ self.__all_gather_params(params_to_fetch) │
│ 259 │ │ │
│ 260 │ │ # wait for parameters in the immediately needed submodule to become available │
│ 261 │ │ for param in params_to_fetch: │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py:3 │
│ 99 in __all_gather_params │
│ │
│ 396 │ │ │
│ 397 │ │ if partitioned_params: │
│ 398 │ │ │ with get_accelerator().stream(self.__allgather_stream): │
│ ❱ 399 │ │ │ │ handle = partitioned_params[0].all_gather_coalesced(partitioned_params) │
│ 400 │ │ │ │
│ 401 │ │ │ for param in partitioned_params: │
│ 402 │ │ │ │ assert param.ds_status == ZeroParamStatus.INFLIGHT, param.ds_summary() │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py:860 in │
│ all_gather_coalesced │
│ │
│ 857 │ │ │ else: │
│ 858 │ │ │ │ partition_sz = sum(p.ds_tensor.ds_numel for p in params) │
│ 859 │ │ │ │ flat_tensor = torch.empty(partition_sz * self.world_size, │
│ ❱ 860 │ │ │ │ │ │ │ │ │ │ dtype=get_only_unique_item(p.dtype │
│ 861 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ for p in params), │
│ 862 │ │ │ │ │ │ │ │ │ │ device=get_accelerator().current_device_name() │
│ 863 │ │ │ │ │ │ │ │ │ │ requires_grad=False) │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/utils.py:870 in get_only_unique_item │
│ │
│ 867 def get_only_unique_item(items): │
│ 868 │ item_set = set(items) │
│ 869 │ if len(item_set) != 1: │
│ ❱ 870 │ │ raise RuntimeError(f"expected there to be only one unique element in {items}") │
│ 871 │ unique_item, = item_set │
│ 872 │ │
│ 873 │ return unique_item │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f7019a30890>

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
Expected to start the training with multiple gpu's and loss is back propogated to update model weights.

ds_report output
[2023-06-20 21:30:31,134] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 21:30:32,061] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m

op name ................ installed .. compatible

�[93m [WARNING] �[0m async_io requires the dev libaio .so object and headers but these were not found.
�[93m [WARNING] �[0m async_io: please install the libaio-dev package with apt
�[93m [WARNING] �[0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... �[93m[NO]�[0m ....... �[93m[NO]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[92m[YES]�[0m ...... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
�[93m [WARNING] �[0m using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m
spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0
deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/DeepSpeed/deepspeed']
deepspeed info ................... 0.9.5+044dd0e2, 044dd0e, master
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

GPU count and types: g4dn.12xlarge
Python version : 3.10
Any other relevant info about your setup

Launcher context
accelerate launch --config_file peft_lora_example_config.yaml peft_qlora_example.py

Config_yaml:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 32
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false

Additional context
it runs successfully on a single gpu.

The text was updated successfully, but these errors were encountered:

ShahinNamin · 2023-07-04T14:41:58Z

I have the same problem, is there a workaround in the meantime?

dionman · 2023-07-18T01:41:56Z

This happens because with zero-3 some parameters end up having float type, while some others int8.
I've updated lines:1273-1280 as follows, and code goes through in my case:

flat_tensor = torch.empty(
                    partition_sz * world_size,
                    dtype=torch.bfloat16,
                    # dtype=get_only_unique_item(p.dtype
                    #                           for p in params) if not quant else torch.int8,
                    device=get_accelerator().current_device(),
                    requires_grad=False,
                )

I doubt that this hack could be a generic workaround for every ds_config. Also, it's unclear if this implies loss of efficiency. It would be great to get some follow up from the library develops regarding why this error is raised.

pranayj77 · 2023-08-03T16:47:05Z

This happens because with zero-3 some parameters end up having float type, while some others int8. I've updated lines:1273-1280 as follows, and code goes through in my case:
flat_tensor = torch.empty(
                    partition_sz * world_size,
                    dtype=torch.bfloat16,
                    # dtype=get_only_unique_item(p.dtype
                    #                           for p in params) if not quant else torch.int8,
                    device=get_accelerator().current_device(),
                    requires_grad=False,
                )
I doubt that this hack could be a generic workaround for every ds_config. Also, it's unclear if this implies loss of efficiency. It would be great to get some follow up from the library develops regarding why this error is raised.

Where exactly did you make these changes. I am having this same error while using the peft library to use lora for finetuning a llama2 model.

per microsoft#3775 (comment)

CODE-FOR · 2023-10-30T12:04:32Z

This is because you are using multiple torch dtype, try to unified the torch dtype you are using

ab6995 added bug Something isn't working training labels Jun 20, 2023

utensil mentioned this issue Jul 19, 2023

The error maze of deepspeed + qlora + falcon OpenAccess-AI-Collective/axolotl#207

Open

yaoching0 mentioned this issue Aug 31, 2023

MultiGPU+Deepspeed+4bitQlora git-cloner/llama2-lora-fine-tuning#9

Open

forresty added a commit to forresty/DeepSpeed that referenced this issue Oct 2, 2023

hack zero3 for 4bit QLoRA

d68a046

per microsoft#3775 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]- Training on multiple GPU's gives error in accelerator.backward(loss) #3775

[BUG]- Training on multiple GPU's gives error in accelerator.backward(loss) #3775

ab6995 commented Jun 20, 2023

ShahinNamin commented Jul 4, 2023

dionman commented Jul 18, 2023 •

edited

pranayj77 commented Aug 3, 2023

CODE-FOR commented Oct 30, 2023

[BUG]- Training on multiple GPU's gives error in accelerator.backward(loss) #3775

[BUG]- Training on multiple GPU's gives error in accelerator.backward(loss) #3775

Comments

ab6995 commented Jun 20, 2023

ds_report output [2023-06-20 21:30:31,134] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-06-20 21:30:32,061] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. �[92m[OKAY]�[0m

op name ................ installed .. compatible

ShahinNamin commented Jul 4, 2023

dionman commented Jul 18, 2023 • edited

pranayj77 commented Aug 3, 2023

CODE-FOR commented Oct 30, 2023

ds_report output
[2023-06-20 21:30:31,134] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 21:30:32,061] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m

dionman commented Jul 18, 2023 •

edited