-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getting nans with t5-large + fix #10830
Comments
Hi |
We didn't really manage to resolve the problems with t5/mt5 + mixed precision fp16 (cc @patil-suraj). I'm not sure whether anybody has tried internally to fine-tune t5/mt5 with deepspeed (@stas00 maybe?) |
the issue arises without deepspeed, just vanilla mt5-small model. Also, I see similar nans with deepspeed with a model based on mt5-small slightly modified, please see the issue here #10821 (comment), I think if the issue with fp16 option could get resolved, hopefully this will be also more stable with model changes in deepspeed as well. Thanks a lot. |
Indeed, this has nothing to do with deepspeed, other than that deepspeed trains in mixed precision and evals in full fp16 at the moment. I've started studying the bfloat16 vs. float16 numerical properties and their correlation to each other. And once I understand it well I will try to see if there some sort of magical remapping that perhaps could be done - this is my fantasy of course. I just need to finish a few other more urgent things with deepspeed stage3 integration first. But please don't let my comment prevent you from merging the proposed fix if it already solves the problem. |
I got similar issue with mt5 model, @patrickvonplaten thanks a lot in advance for your help |
@dorost1234 + @yuvalkirstain, please kindly try this branch: There is also a lot of debug statements in the branch but they will be silent unless nan/inf is detected. I tested it work on a small sample with t5-small/t5-base/t5-large/google/mt5-small. The main part of the fix is just:
and removing some code. So use the branch first. If it works I guess we could just monkey patch this version for AMP or come up with some cleaner solution. Probably with |
Dear @stas00
I got this error:
I think there is some version mismatch. I removed the model from input to the collator, as below
and then here is what I got with fp16 option:
here is loss without fp16:
So I think this is not optimizing the loss well. I greatly appreciate having a look. Thanks a lot. |
re errors - this is all on master - the source code and I solve such problems by running locally and not relying on the installed
now you never need to worry about what wrt not getting the loss going down - this is odd, I just run your code:
Must be your hardware? Try to lower the learning rate? I tried with 1 or 2 gpus and it worked in both cases. |
Hi @stas00
This is such a great, wonderful, amazing fix. Looking forward to using it when this is pushed to the repository. |
Thank you for your kind words, I'm so happy to hear that it worked, @dorost1234. I will make a proper PR after I clean this branch up. |
@yuvalkirstain, please kindly test if this PR fixes the problem: #10956 |
Thank you @stas00 ! |
Thank you for validating that, @yuvalkirstain! Indeed, I tried first local fixes but the problem would just pop-up elsewhere. I'm just thinking that perhaps we could find if it's all calls to FF that lead to the problem or only some of them, and then we could optimize the solution I proposed by only disabling If you experiment I recommend for you to try my branch, since I left the "detector" on and it'll immediately tell you when the first What I'm most interested in is some longer runs to ensure it doesn't start overflowing at a later point. Thank you for your contribution. |
Finetuned T5-Base using this branch with the standard T5 finetuning HPs on NQ (except from batch_size - used only ~26k tokens) and didn't get nans (it has been running for over 3 hours and training converged). Thanks again, I guess the issue can be closed for time being. |
Thank you for this validation, @yuvalkirstain. I still would like to see if we can find a more efficient solution before merging it, but this is great that we have one that works. This unfortunately doesn't help with deepspeed since it doesn't use pytorch AMP and has its own version, but which doesn't use context manager so can't be turned off locally like I linked this issue to the PR so it'll get closed automatically when it's merged. |
Well, the nans are back.
The model I used here was T5-large-ssm-nqo. |
Yes, please, I'm working in parallel on gpt-neo that has the same issues, so the more reproducible cases we have the higher are the chances we can find a solid fix. Also those would be good candidates for tests (hoping that we can find a quick way to get to overflow). |
Let's continue the discussion in the PR that is trying to solve this issue: #10956 |
@dorost1234 hI, Could you please tell me how you solved this loss optimization problem. I am facing same issue |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
So is this fix now in the main version of transformers? |
I found that results are different when you load like this: (first is better) model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True,torch_dtype=torch.float16).to("cuda") than when you load via: model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True) So this could be a solution, I will compare result on /CPU versus /This versus /Half |
@seems like the solution is already implemented in this call: (model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True,torch_dtype=torch.float16).to("cuda")) Probably it is trigered by torch_dtype=torch.float16. So a part of model is (likely) moved to fp32 from fp16, so it works properly, exactly the same as with FP32, and exactly the same as on CPU. Of course it does use a little bit more of memory. When you call it second way, the memory usage is around 2.5 GB for T5-large, while with first it is around 2.9GB. It is slower around 10-15 percent. |
Environment info
transformers
version: 4.5.0.dev0Who can help
@patil-suraj @patrickvonplaten
Information
Model I am using (Bert, XLNet ...): t5-large
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior (the fix I'm suggesting is very simple, so perhaps there is no reason to reproduce):
Expected behavior
Training without nans.
Possible fix
I debugged and saw that we get nans at the
modeling_t5.py
script in line 241By modifing this line to:
It seems to be solved.
BTW it happens in the last layers (this might explain why it wasn't caught in this fix)
seq2seq.zip
The text was updated successfully, but these errors were encountered: