New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training with fp16 precision gives nan in Longt5 #17978
Comments
This is due to |
@ydshieh Sorry for the late reply, I ran a few experiments and found that fixing both the clamp value and attention mask seems to work (I should mention that at first I only fixed the attention mask and still gives nan. Then I added clamp value and there is no nan). For more detail, there are lines like longt5 704 that needs to be set to |
I had also NaN loss with |
Could you provide code snippets (training script) that reproduce this issue? I am a bit surprised that |
In general T5 just doesn't work well with Note that LongT5, MT5, T5, ByT5 all use the same architecture more or less |
@ydshieh It was a bit hard to provide a training script as I'm experimenting on the SCROLLS dataset, there are a lot of setups and I also made some modifications myself. Probably one can reference this discussion stancld-longt5 and use the script by changing the @patrickvonplaten Thank you for mentioning more issues regarding this. I have trained the model using FYI: How can I open a PR? I tried making a branch from main, but when I tried to push it I found that I had no permission to do so. For a workaround, I uploaded my changed file to one of my own repository here: longt5_fix. My fixes are hardcode, so it should be changed for general purposes. |
Please see https://huggingface.co/docs/transformers/contributing#start-contributing-pull-requests (in short: you should create a branch, push to your fork, and open an PR) |
I'm actually against the |
Based on my observations, using the |
(not promoting (in inference/eval mode, large values almost implies something is wrong in the model) |
Indeed, the clamping works around the overflow but it's not really a good mathematical approach. I suppose the model learns to deal with this type of penalty, but it doesn't make things smooth for the training.
I also proposed removing clamping in t5: #10956 (comment) |
I have the same problem when training with bf16 set to True |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
It appears this is the case for mistral as well |
System Info
transformers
version: 4.10.0.dev0Who can help?
@patrickvonplaten
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm currently running the scrolls_benchmark. I'm interested to see the performance of longt5 model on scrolls, so I changed the model name to google/long-t5-tglobal-base and run training with fp16 enabled (If I run with fp32, I get CUDA OOM errors). However, the output loss is always nan. I googled for fixes and found this post: t5-fp16-fixed. I searched in the transformers repo and found that the modelling_longt5 file doesn't seem to incorporate the
clamp_value
change. I wonder if this is the problem that fp16 is not working in longt5? And if so, is there a way to fix it by a similar approach like what you guys have done for t5? Thank you very much!fyi: You probably noticed that the transformers version is 4.10.0 which does not have longt5. I manually added the longt5 files in a forked scrolls repo here longt5_folder. It indeed works properly under a small parameter setting.
Expected behavior
longt5 model not producing nan loss on fp16
The text was updated successfully, but these errors were encountered: