-
Notifications
You must be signed in to change notification settings - Fork 25.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_best_model_at_end failed due to "size mismatch" when DeepSpeed is used #14628
Comments
cc @stas00 |
Thank you for this report, @dunalduck0 - I didn't have this "path" tested (or ever used it). I can reproduce the problem with a much faster:
will work on solving it and report back when I have something to show. |
Please try with this PR #14652 and let me know if that fixes the issue for you. Thank you. |
Thank you @stas00. What do I do to get your fix into my local box? I have a Linux bot and a local copy of fork of Huggingface/transformer. It looks like your fix has not been merged into HuggingFace/transformers |
Indeed, it's not merged yet. I need to add tests first. Here are some of the ways you can try my PR's branch: if you have
Or you can clone my fork and switch to that branch:
update: The PR is ready to be merged, but I want to make sure you validate it first that it indeed solves the problem for you. |
Will do it tonight |
Perfect. I will merge as soon as you give me green light. |
The fix worked on my box. Thank you stas00 for quick fix. Minor comment: the logging confused me at first time. I saw DeepSpeed re-initialized everything and I almost thought the program restarted again :P. |
Thank you for testing, @dunalduck0 Once Deepspeed fixes this issue the full restart will go away |
Environment info
transformers
version: 4.13.0.dev0Who can help
@stas00
Information
EleutherAI/gpt-neo-1.3B and EleutherAI/gpt-j-6B
The problem arises when using: run_clm.py with DeepSpeed and --load_best_model_at_end
The tasks I am working on is: a toy fine-tuning
To reproduce
Steps to reproduce the behavior:
Error stack:
Expected behavior
--load_best_model_at_end should not crash with DeepSpeed
The text was updated successfully, but these errors were encountered: