New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Can't load OPT-30B and OPT-66B through checkpoints.json #2616
Comments
I can confirm that I'm able to replicate this. Interestingly, I'm finding that smaller OPT models work loading with meta tensor. It appears that models that are split in the HuggingFace checkpoints are causing this error (e.g., they have multiple @RezaYazdaniAminabadi any idea the cause? I'm guessing we don't catch this in our unit tests because we use small versions of these larger models to save time. |
@anselmwang I see you mentioned you are only trying to load the models with meta tensor on your production node. One possible solution (until we determine the cause of this error) would be to create a pre-sharded version of each model on your dev node and copy that over to the production node. I'm able to properly load these models from DeepSpeed-sharded checkpoints. See my comment here on how to generate those sharded checkpoints: #2379 (comment) |
I'm experiencing the same issue with the BLOOM models |
Regarding Bloom models, downgrading deepspeed to 0.7.6 works for me. |
Also encountered this when upgrading from 0.7.6 to 0.7.7, with BLOOM 176B. |
Hi, I have fixed some bugs regarding the checkpoint loading for these model architectures. Could you please retry using this PR? You can also try our updated test-suite here. |
Hi @niumanar, @asafkar and @anselmwang, I just wanted to see if you get a chance to use this PR and see if it fixed the issue? Thanks, |
@RezaYazdaniAminabadi I can confirm that version 0.8.0 fixed the issue for me. |
@RezaYazdaniAminabadi , @njhill said version 0.8.0 fixed the issue, unfortunately this version doesn't fix for me. For PR #2662 , it fixes OPT-30B, which is command OPT-66B
|
@RezaYazdaniAminabadi apologies I spoke too soon... it's now working for BLOOM 175B with the pre-sharded fp16 weights, but not the original |
Me too: Any idea when a fix might be available? |
Also, I seem to get the same "NotImplementedError: Cannot copy out of meta tensor; no data!" error even when I roll back to 0.7.6. Is that expected? How can I get this working? P.S.: I am attempting to load a model with checkpoints that are split into two .bin files. |
same on my end. |
Same for me. Everything works on 0.7.6 now, and before it didn't. However, 0.8.0. does not resolve the issue and gives similar behavior as the others showed. |
@asafkar @felifri Have you tried with But ultimately what I did that I think got it loading correctly (on 0.8.0) was to use CPU (and thus RAM) to load the model once, and to re-save the checkpoints to a local folder in sharded form using save_pretrained, like I am using Huggingface Accelerate for handling config and initialization, so I am not using deepspeed.initialize() or deepspeed.init_inference at all, instead I'm simply passing my deepspeed config to the huggingface deepspeed config object (something like |
On DS 0.9.2, I tried with opt-350m, which only has one .bin file, and it doesn't work (it throws the |
What is low_cpu_mem_usage set to? |
If I set |
I got the same error with NousResearch/Nous-Capybara-34B,
|
Describe the bug
I can't load OPT-30B and OPT-66B through checkpoints.json. If I load them with Huggingface
from_pretrained
, everything works fine. This bug is troublesome because my production nodes have far less memory than my dev node, so they don't have enough CPU memory to load OPT-30B and OPT-66B.To Reproduce
python 3.7.7
Without checkpoints_json, this command works
date; deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name facebook/opt-30b; date
Below is the stack trace when using checkpoints.json
date; deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name facebook/opt-30b --use_checkpoints_json; date
For OPT-66B, this command works
date; deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name facebook/opt-66b; date
But when turning on checkpoints.json,
date; deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name facebook/opt-66b --use_checkpoints_json; date
, below is the stack traceExpected behavior
ds_report output
Please run
ds_report
to give us details about your setup.Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Docker context
Are you using a specific docker image that you can share?
Not use docker
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: