Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ValueError: max() arg is an empty sequence using bf16 zero stage3 #2820

Closed
sujithjoseph opened this issue Feb 12, 2023 · 20 comments · Fixed by #4277
Closed

[BUG] ValueError: max() arg is an empty sequence using bf16 zero stage3 #2820

sujithjoseph opened this issue Feb 12, 2023 · 20 comments · Fixed by #4277
Assignees
Labels
bug Something isn't working training

Comments

@sujithjoseph
Copy link

sujithjoseph commented Feb 12, 2023

│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py:307  │
│ in <listcomp>                                                                │
│                                                                              │
│    304 │   │   │   max([                                                     │
│    305 │   │   │   │   max(tensor.numel(),                                   │
│    306 │   │   │   │   │   tensor.ds_numel) for tensor in fp16_partitioned_g │
│ ❱  307 │   │   │   ]) for fp16_partitioned_group in self.fp16_partitioned_gr │
│    308 │   │   ])                                                            │
│    309 │   │   print_rank_0(                                                 │
│    310 │   │   │   f'Largest partitioned param numel = {largest_partitioned_ │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: max() arg is an empty sequence

To Reproduce
Steps to reproduce the behavior:
Happened during finetuning on flan 11b model . Here is the entire error gist - https://gist.github.com/sujithjoseph/c410514acfccc76974a8130a8afd2169

Here is the deepspeed config https://gist.github.com/sujithjoseph/92bf27de6bba704b57c3b9eb7aa00365

ds_report output
ds report - https://gist.github.com/sujithjoseph/c725de5fb38bb3c20e4fb6fd55f63848

System info (please complete the following information):

  • OS: Debian GNU/Linux 10 (buster)
  • GPU count and types [ 1 machine with 4 A100s - 40G*4]
  • Python version 3.7

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else? Accelerate + PEFT

deepspeed_config:

deepspeed_config_file: zero_stage3_offload_config.json
zero3_init_flag: true

Additional context

I assume that bf16 configs and fp16 configs are interchangeable

    "bf16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    }

@sujithjoseph sujithjoseph added bug Something isn't working training labels Feb 12, 2023
@sujithjoseph
Copy link
Author

Error also appears with fp16 instead of bf16 in deepspeed config and with zero3_init_flag: false in accelerate config with deepspeed as well.

@sujithjoseph
Copy link
Author

With Stage2 and no offsets , Get a different error

 /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage_1_and_2. │
│ py:323 in __init__                                                           │
│                                                                              │
│    320 │   │   │   │   self.flatten_dense_tensors_aligned(                   │
│    321 │   │   │   │   │   self.round_robin_bit16_groups[i],                 │
│    322 │   │   │   │   │   self.nccl_start_alignment_factor *                │
│ ❱  323 │   │   │   │   │   dist.get_world_size(group=self.real_dp_process_gr │
│    324 │   │   │   │   │   │   torch.cuda.current_device()))                 │
│    325 │   │   │   see_memory_usage(f"After flattening and moving param grou │
│    326 │   │   │   │   │   │   │    force=False)                             │
│                                                                              │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage_1_and_2. │
│ py:862 in flatten_dense_tensors_aligned                                      │
│                                                                              │
│    859 │                                                                     │
│    860 │   # create a flat tensor aligned at the alignment boundary          │
│    861 │   def flatten_dense_tensors_aligned(self, tensor_list, alignment):  │
│ ❱  862 │   │   return self.flatten(align_dense_tensors(tensor_list, alignmen │
│    863 │                                                                     │
│    864 │   ############### Independent Partition Gradient ################## │
│    865 │   def reduce_independent_p_g_buckets_and_remove_grads(self, param,  │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: torch.cat(): expected a non-empty list of Tensors

Could this be an issue with dataset?

@sujithjoseph
Copy link
Author

sujithjoseph commented Feb 13, 2023

Was able to sort it out using the below accelerate + DS config. Now dealing with an OOM issue, but not sure why the previous DeepSpeed config didnt work

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: true
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'bf16'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

@sujithjoseph
Copy link
Author

How can we estimate the # of GPUs Needed (each with 40 GB) . for flan-t5-11b with cpu param / optimizer offloading , 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 is what the estimate is.

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 4 GPUs per node.
SW: Model with 11003M total params, 131M largest layer params.
per CPU | per GPU | Options
276.70GB | 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
276.70GB | 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
245.95GB | 5.61GB | offload_param=none, offload_optimizer=cpu , zero_init=1
245.95GB | 5.61GB | offload_param=none, offload_optimizer=cpu , zero_init=0
2.94GB | 46.61GB | offload_param=none, offload_optimizer=none, zero_init=1
245.95GB | 46.61GB | offload_param=none, offload_optimizer=none, zero_init=0

@sujithjoseph
Copy link
Author

sujithjoseph commented Feb 13, 2023

With Batch size as 1, it works without OOM. How can we estimate the # of GPUs needed for batchsize of 4 or 8, without trial and error. With batch size as 1, I see only 26903MiB used per GPU max. For batch size as 2, It works for some time (3-4 hours) with 8 40 GB GPUs with almost all 40 GB utilized and then goes into OOM. How can I cap the GPU memory used?

@sujithjoseph
Copy link
Author

With the following deepspeed config

deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  bf16:enabled: true

and
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
set in the code, would deepspeed use tf32 or bf16 ?

@tjruwase
Copy link
Contributor

@sujithjoseph, deepspeed should use bf16. Are you observing something different?

@sujithjoseph
Copy link
Author

@tjruwase , it did work with bf16. The only Question I have is Can i use max memory to restrict the memory used by the model during fine-tuning like the below one used for inference etc.

max_memory={0: "25GIB", "cpu":"120GB"}

model = load_checkpoint_and_dispatch(
model, model_id, device_map="auto", max_memory=max_memory, no_split_module_classes=["T5Block"]
)

@tjruwase
Copy link
Contributor

Got it. I don't have experience with those memory restriction flags, which seem to be Accelerate flags. I don't think those flags are hooked into deepspeed. Can you please pose this question on their forum? I think we can work with them to enable the desired feature.

@zhenlohuang
Copy link

@sujithjoseph I faced the same issues as your memtioned above, both issue in stage 2 and stage 3 were the same to you. Did you have any workaround for this?

@shaowei-su
Copy link

Ran into exact same error when running DeepSpeed on Ray. Following this thread.

@SupetZYK
Copy link

same error RuntimeError: torch.cat(): expected a non-empty list of Tensors when accelerate.prepare. So how to solve it?

@tjruwase
Copy link
Contributor

@zhenlohuang, @shaowei-su, @SupetZYK, it seems that @sujithjoseph resolved the original issue with the following
#2820 (comment)

If the workaround does not work for you, please open a new issue and share details to help us repro. Thanks!

@shaowei-su
Copy link

@tjruwase I was able to run DS + stage 3 + fp16 by disabling optimizer section in the DS config, which I found negative impacts on the model quality.

If I switch to DS + stage 2, then it's the same runtime error @SupetZYK posted above.

  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1547, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 324, in __init__
    self.flatten_dense_tensors_aligned(
  File "/home/default_user/.conda/envs/user/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 867, in flatten_dense_tensors_aligned
    return self.flatten(align_dense_tensors(tensor_list, alignment))
RuntimeError: torch.cat(): expected a non-empty list of Tensors

@tjruwase
Copy link
Contributor

@shaowei-su and @SupetZYK, it seems you are both seeing a different error from the original posting. Can you please open a new issue and share details for repro? I will close this in the meantime. Thanks!

@bestpredicts
Copy link

same error here,any update?

@seongminp
Copy link

FWIW I got this error when I put my model accidentally on inference mode.
My peft config had inference_mode: True.

@Wesley-Jzy
Copy link

Same error when using loralib with zero2 & 3

@tjruwase
Copy link
Contributor

tjruwase commented Sep 6, 2023

@bestpredicts and @Wesley-Jzy, are you able to provide repro steps?

@awan-10
Copy link
Contributor

awan-10 commented Sep 8, 2023

Can people in this thread please downgrade to HF transformers 4.31.0 and try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants