-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ValueError: max() arg is an empty sequence using bf16 zero stage3 #2820
Comments
Error also appears with fp16 instead of bf16 in deepspeed config and with zero3_init_flag: false in accelerate config with deepspeed as well. |
With Stage2 and no offsets , Get a different error
Could this be an issue with dataset? |
Was able to sort it out using the below accelerate + DS config. Now dealing with an OOM issue, but not sure why the previous DeepSpeed config didnt work
|
How can we estimate the # of GPUs Needed (each with 40 GB) . for flan-t5-11b with cpu param / optimizer offloading , 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 is what the estimate is. Estimated memory needed for params, optim states and gradients for a: |
With Batch size as 1, it works without OOM. How can we estimate the # of GPUs needed for batchsize of 4 or 8, without trial and error. With batch size as 1, I see only 26903MiB used per GPU max. For batch size as 2, It works for some time (3-4 hours) with 8 40 GB GPUs with almost all 40 GB utilized and then goes into OOM. How can I cap the GPU memory used? |
With the following deepspeed config
and |
@sujithjoseph, deepspeed should use bf16. Are you observing something different? |
@tjruwase , it did work with bf16. The only Question I have is Can i use max memory to restrict the memory used by the model during fine-tuning like the below one used for inference etc. max_memory={0: "25GIB", "cpu":"120GB"} model = load_checkpoint_and_dispatch( |
Got it. I don't have experience with those memory restriction flags, which seem to be Accelerate flags. I don't think those flags are hooked into deepspeed. Can you please pose this question on their forum? I think we can work with them to enable the desired feature. |
@sujithjoseph I faced the same issues as your memtioned above, both issue in stage 2 and stage 3 were the same to you. Did you have any workaround for this? |
Ran into exact same error when running DeepSpeed on Ray. Following this thread. |
same error RuntimeError: torch.cat(): expected a non-empty list of Tensors when accelerate.prepare. So how to solve it? |
@zhenlohuang, @shaowei-su, @SupetZYK, it seems that @sujithjoseph resolved the original issue with the following If the workaround does not work for you, please open a new issue and share details to help us repro. Thanks! |
@tjruwase I was able to run DS + stage 3 + fp16 by disabling If I switch to DS + stage 2, then it's the same runtime error @SupetZYK posted above.
|
@shaowei-su and @SupetZYK, it seems you are both seeing a different error from the original posting. Can you please open a new issue and share details for repro? I will close this in the meantime. Thanks! |
same error here,any update? |
FWIW I got this error when I put my model accidentally on inference mode. |
Same error when using loralib with zero2 & 3 |
@bestpredicts and @Wesley-Jzy, are you able to provide repro steps? |
Can people in this thread please downgrade to HF transformers 4.31.0 and try? |
To Reproduce
Steps to reproduce the behavior:
Happened during finetuning on flan 11b model . Here is the entire error gist - https://gist.github.com/sujithjoseph/c410514acfccc76974a8130a8afd2169
Here is the deepspeed config https://gist.github.com/sujithjoseph/92bf27de6bba704b57c3b9eb7aa00365
ds_report output
ds report - https://gist.github.com/sujithjoseph/c725de5fb38bb3c20e4fb6fd55f63848
System info (please complete the following information):
Launcher context
Are you launching your experiment with the
deepspeed
launcher, MPI, or something else? Accelerate + PEFTdeepspeed_config:
deepspeed_config_file: zero_stage3_offload_config.json
zero3_init_flag: true
Additional context
I assume that bf16 configs and fp16 configs are interchangeable
The text was updated successfully, but these errors were encountered: