Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]- Training on multiple GPU's gives error in accelerator.backward(loss) #3775

Open
ab6995 opened this issue Jun 20, 2023 · 4 comments
Open
Labels
bug Something isn't working training

Comments

@ab6995
Copy link

ab6995 commented Jun 20, 2023

Training on multiple GPU's gives error in accelerator.backward(loss)
/root/Finetuning/peft_qlora_example.py:374 in │
│ │
│ 371 │
│ 372 │
│ 373 if name == "main": │
│ ❱ 374 │ main() │
│ 375 │
│ │
│ /root/Finetuning/peft_qlora_example.py:268 in main │
│ │
│ 265 │ │ │ │ outputs = model(**batch) │
│ 266 │ │ │ │ loss = outputs.loss │
│ 267 │ │ │ │ total_loss += loss.detach().float() │
│ ❱ 268 │ │ │ │ accelerator.backward(loss) │
│ 269 │ │ │ │ optimizer.step() │
│ 270 │ │ │ │ lr_scheduler.step() │
│ 271 │ │ │ │ optimizer.zero_grad() │
│ │
│ /opt/conda/lib/python3.9/site-packages/accelerate/accelerator.py:1835 in backward │
│ │
│ 1832 │ │ │ # deepspeed handles loss scaling by gradient_accumulation_steps in its back │ │ 1833 │ │ │ loss = loss / self.gradient_accumulation_steps │ │ 1834 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │ │ ❱ 1835 │ │ │ self.deepspeed_engine_wrapped.backward(loss, **kwargs) │ │ 1836 │ │ elif self.distributed_type == DistributedType.MEGATRON_LM: │ │ 1837 │ │ │ return │ │ 1838 │ │ elif self.scaler is not None: │ │ │ │ /opt/conda/lib/python3.9/site-packages/accelerate/utils/deepspeed.py:167 in backward │ │ │ │ 164 │ │ │ 165 │ def backward(self, loss, **kwargs): │ │ 166 │ │ # runs backpropagation and handles mixed precision │ │ ❱ 167 │ │ self.engine.backward(loss, **kwargs) │ │ 168 │ │ │ │ 169 │ │ # Deepspeed's engine.step` performs the following operations: │
│ 170 │ │ # - gradient accumulation check │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/engine.py:1980 in backward │
│ │
│ 1977 │ │ if self.zero_optimization(): │
│ 1978 │ │ │ self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumula │
│ 1979 │ │ │ ) │
│ ❱ 1980 │ │ │ self.optimizer.backward(loss, retain_graph=retain_graph) │
│ 1981 │ │ elif self.amp_enabled(): │
│ 1982 │ │ │ # AMP requires delaying unscale when inside gradient accumulation boundaries │
│ 1983 │ │ │ # https://nvidia.github.io/apex/advanced.html#gradient-accumulation-across-i
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:2086 in backward │
│ │
│ 2083 │ │ │ scaled_loss = self.external_loss_scale * loss │
│ 2084 │ │ │ scaled_loss.backward() │
│ 2085 │ │ else: │
│ ❱ 2086 │ │ │ self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) │
│ 2087 │ │ │
│ 2088 │ │ self._get_param_coordinator(training=True).reset_step() │
│ 2089 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py:56 in backward │
│ │
│ 53 │ │
│ 54 │ def backward(self, loss, retain_graph=False): │
│ 55 │ │ scaled_loss = loss * self.loss_scale │
│ ❱ 56 │ │ scaled_loss.backward(retain_graph=retain_graph) │
│ 57 │
│ 58 │
│ 59 class LossScaler(LossScalerBase): │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/_tensor.py:487 in backward │
│ │
│ 484 │ │ │ │ create_graph=create_graph, │
│ 485 │ │ │ │ inputs=inputs, │
│ 486 │ │ │ ) │
│ ❱ 487 │ │ torch.autograd.backward( │
│ 488 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │
│ 489 │ │ ) │
│ 490 │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/autograd/init.py:197 in backward │
│ │
│ 194 │ # The reason we repeat same the comment below is that │
│ 195 │ # some Python versions print out the first line of a multi-line function │
│ 196 │ # calls in the traceback and some print out the last line │
│ ❱ 197 │ Variable.execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 198 │ │ tensors, grad_tensors
, retain_graph, create_graph, inputs, │
│ 199 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 200 │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply │
│ │
│ 264 │ │ │ │ │ │ │ "Function is not allowed. You should only implement one " │
│ 265 │ │ │ │ │ │ │ "of them.") │
│ 266 │ │ user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn │
│ ❱ 267 │ │ return user_fn(self, *args) │
│ 268 │ │
│ 269 │ def apply_jvp(self, *args): │
│ 270 │ │ # _forward_cls is defined by derived class │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/utils/checkpoint.py:141 in backward │
│ │
│ 138 │ │ │ with torch.enable_grad(), \ │
│ 139 │ │ │ │ torch.cuda.amp.autocast(**ctx.gpu_autocast_kwargs), \ │
│ 140 │ │ │ │ torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): │
│ ❱ 141 │ │ │ │ outputs = ctx.run_function(*detached_inputs) │
│ 142 │ │ │
│ 143 │ │ if isinstance(outputs, torch.Tensor): │
│ 144 │ │ │ outputs = (outputs,) │
│ │
│ /opt/conda/lib/python3.9/site-packages/transformers/models/mt5/modeling_mt5.py:1045 in │
│ custom_forward │
│ │
│ 1042 │ │ │ │ │
│ 1043 │ │ │ │ def create_custom_forward(module): │
│ 1044 │ │ │ │ │ def custom_forward(*inputs): │
│ ❱ 1045 │ │ │ │ │ │ return tuple(module(*inputs, use_cache, output_attentions)) │
│ 1046 │ │ │ │ │ │
│ 1047 │ │ │ │ │ return custom_forward │
│ 1048 │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py:1208 in _call_impl │
│ │
│ 1205 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │
│ 1206 │ │ │ input = bw_hook.setup_input_hook(input) │
│ 1207 │ │ │
│ ❱ 1208 │ │ result = forward_call(input, **kwargs) │
│ 1209 │ │ if _global_forward_hooks or self._forward_hooks: │
│ 1210 │ │ │ for hook in (
_global_forward_hooks.values(), *self._forward_hooks.values()) │
│ 1211 │ │ │ │ hook_result = hook(self, input, result) │
│ │
│ /opt/conda/lib/python3.9/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /opt/conda/lib/python3.9/site-packages/transformers/models/mt5/modeling_mt5.py:557 in forward │
│ │
│ 554 │ │ else: │
│ 555 │ │ │ self_attn_past_key_value, cross_attn_past_key_value = None, None │
│ 556 │ │ │
│ ❱ 557 │ │ self_attention_outputs = self.layer[0]( │
│ 558 │ │ │ hidden_states, │
│ 559 │ │ │ attention_mask=attention_mask, │
│ 560 │ │ │ position_bias=position_bias, │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py:1208 in _call_impl │
│ │
│ 1205 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │
│ 1206 │ │ │ input = bw_hook.setup_input_hook(input) │
│ 1207 │ │ │
│ ❱ 1208 │ │ result = forward_call(input, **kwargs) │
│ 1209 │ │ if _global_forward_hooks or self._forward_hooks: │
│ 1210 │ │ │ for hook in (
_global_forward_hooks.values(), *self._forward_hooks.values()) │
│ 1211 │ │ │ │ hook_result = hook(self, input, result) │
│ │
│ /opt/conda/lib/python3.9/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /opt/conda/lib/python3.9/site-packages/transformers/models/mt5/modeling_mt5.py:462 in forward │
│ │
│ 459 │ │ output_attentions=False, │
│ 460 │ ): │
│ 461 │ │ normed_hidden_states = self.layer_norm(hidden_states) │
│ ❱ 462 │ │ attention_output = self.SelfAttention( │
│ 463 │ │ │ normed_hidden_states, │
│ 464 │ │ │ mask=attention_mask, │
│ 465 │ │ │ position_bias=position_bias, │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py:1208 in _call_impl │
│ │
│ 1205 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │
│ 1206 │ │ │ input = bw_hook.setup_input_hook(input) │
│ 1207 │ │ │
│ ❱ 1208 │ │ result = forward_call(input, **kwargs) │
│ 1209 │ │ if _global_forward_hooks or self._forward_hooks: │
│ 1210 │ │ │ for hook in (
_global_forward_hooks.values(), *self._forward_hooks.values()) │
│ 1211 │ │ │ │ hook_result = hook(self, input, result) │
│ │
│ /opt/conda/lib/python3.9/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /opt/conda/lib/python3.9/site-packages/transformers/models/mt5/modeling_mt5.py:380 in forward │
│ │
│ 377 │ │ │ return hidden_states │
│ 378 │ │ │
│ 379 │ │ # get query states │
│ ❱ 380 │ │ query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, │
│ 381 │ │ │
│ 382 │ │ # get key/value states │
│ 383 │ │ key_states = project( │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py:1197 in _call_impl │
│ │
│ 1194 │ │ │ full_backward_hooks, non_full_backward_hooks = self._get_backward_hooks() │
│ 1195 │ │ if _global_forward_pre_hooks or self._forward_pre_hooks: │
│ 1196 │ │ │ for hook in (
_global_forward_pre_hooks.values(), *self._forward_pre_hooks.v │
│ ❱ 1197 │ │ │ │ result = hook(self, input) │
│ 1198 │ │ │ │ if result is not None: │
│ 1199 │ │ │ │ │ if not isinstance(result, tuple): │
│ 1200 │ │ │ │ │ │ result = (result,) │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py:348 in │
│ _pre_forward_module_hook │
│ │
│ 345 │ │ │
│ 346 │ │ @instrument_w_nvtx │
│ 347 │ │ def _pre_forward_module_hook(module, *args): │
│ ❱ 348 │ │ │ self.pre_sub_module_forward_function(module) │
│ 349 │ │ │
│ 350 │ │ @instrument_w_nvtx │
│ 351 │ │ def _post_forward_module_hook(module, input, output): │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27 in decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ ❱ 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def _wrap_generator(self, func): │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py:478 in │
│ pre_sub_module_forward_function │
│ │
│ 475 │ │ param_coordinator.trace_prologue(sub_module) │
│ 476 │ │ if param_coordinator.is_record_trace(): │
│ 477 │ │ │ param_coordinator.record_module(sub_module) │
│ ❱ 478 │ │ param_coordinator.fetch_sub_module(sub_module) │
│ 479 │ │ │
│ 480 │ │ see_memory_usage( │
│ 481 │ │ │ f"Before sub module function {sub_module.class.name} after fetch", │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27 in decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ ❱ 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def _wrap_generator(self, func): │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py:2 │
│ 58 in fetch_sub_module │
│ │
│ 255 │ │ # kick off all gather for params in the immediately required submodule │
│ 256 │ │ for param in params_to_fetch: │
│ 257 │ │ │ debug_rank0(f"-fetch: {param.ds_summary()}") │
│ ❱ 258 │ │ self.__all_gather_params(params_to_fetch) │
│ 259 │ │ │
│ 260 │ │ # wait for parameters in the immediately needed submodule to become available │
│ 261 │ │ for param in params_to_fetch: │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py:3 │
│ 99 in __all_gather_params │
│ │
│ 396 │ │ │
│ 397 │ │ if partitioned_params: │
│ 398 │ │ │ with get_accelerator().stream(self.__allgather_stream): │
│ ❱ 399 │ │ │ │ handle = partitioned_params[0].all_gather_coalesced(partitioned_params) │
│ 400 │ │ │ │
│ 401 │ │ │ for param in partitioned_params: │
│ 402 │ │ │ │ assert param.ds_status == ZeroParamStatus.INFLIGHT, param.ds_summary() │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py:860 in │
│ all_gather_coalesced │
│ │
│ 857 │ │ │ else: │
│ 858 │ │ │ │ partition_sz = sum(p.ds_tensor.ds_numel for p in params) │
│ 859 │ │ │ │ flat_tensor = torch.empty(partition_sz * self.world_size, │
│ ❱ 860 │ │ │ │ │ │ │ │ │ │ dtype=get_only_unique_item(p.dtype │
│ 861 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ for p in params), │
│ 862 │ │ │ │ │ │ │ │ │ │ device=get_accelerator().current_device_name() │
│ 863 │ │ │ │ │ │ │ │ │ │ requires_grad=False) │
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/utils.py:870 in get_only_unique_item │
│ │
│ 867 def get_only_unique_item(items): │
│ 868 │ item_set = set(items) │
│ 869 │ if len(item_set) != 1: │
│ ❱ 870 │ │ raise RuntimeError(f"expected there to be only one unique element in {items}") │
│ 871 │ unique_item, = item_set │
│ 872 │ │
│ 873 │ return unique_item │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f7019a30890>

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
Expected to start the training with multiple gpu's and loss is back propogated to update model weights.

ds_report output
[2023-06-20 21:30:31,134] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 21:30:32,061] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m

op name ................ installed .. compatible

�[93m [WARNING] �[0m async_io requires the dev libaio .so object and headers but these were not found.
�[93m [WARNING] �[0m async_io: please install the libaio-dev package with apt
�[93m [WARNING] �[0m If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... �[93m[NO]�[0m ....... �[93m[NO]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[92m[YES]�[0m ...... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
�[93m [WARNING] �[0m using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m
spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0
deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/DeepSpeed/deepspeed']
deepspeed info ................... 0.9.5+044dd0e2, 044dd0e, master
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • GPU count and types: g4dn.12xlarge
  • Python version : 3.10
  • Any other relevant info about your setup

Launcher context
accelerate launch --config_file peft_lora_example_config.yaml peft_qlora_example.py

Config_yaml:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 32
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false

Additional context
it runs successfully on a single gpu.

@ab6995 ab6995 added bug Something isn't working training labels Jun 20, 2023
@ShahinNamin
Copy link

I have the same problem, is there a workaround in the meantime?

@dionman
Copy link

dionman commented Jul 18, 2023

This happens because with zero-3 some parameters end up having float type, while some others int8.
I've updated lines:1273-1280 as follows, and code goes through in my case:

flat_tensor = torch.empty(
                    partition_sz * world_size,
                    dtype=torch.bfloat16,
                    # dtype=get_only_unique_item(p.dtype
                    #                           for p in params) if not quant else torch.int8,
                    device=get_accelerator().current_device(),
                    requires_grad=False,
                )

I doubt that this hack could be a generic workaround for every ds_config. Also, it's unclear if this implies loss of efficiency. It would be great to get some follow up from the library develops regarding why this error is raised.

@pranayj77
Copy link

This happens because with zero-3 some parameters end up having float type, while some others int8. I've updated lines:1273-1280 as follows, and code goes through in my case:

flat_tensor = torch.empty(
                    partition_sz * world_size,
                    dtype=torch.bfloat16,
                    # dtype=get_only_unique_item(p.dtype
                    #                           for p in params) if not quant else torch.int8,
                    device=get_accelerator().current_device(),
                    requires_grad=False,
                )

I doubt that this hack could be a generic workaround for every ds_config. Also, it's unclear if this implies loss of efficiency. It would be great to get some follow up from the library develops regarding why this error is raised.

Where exactly did you make these changes. I am having this same error while using the peft library to use lora for finetuning a llama2 model.

@CODE-FOR
Copy link

This is because you are using multiple torch dtype, try to unified the torch dtype you are using

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

5 participants