-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Closed
Description
System Info
This was an issue a while back but seems to have resurfaced - https://discuss.huggingface.co/t/t5-fp16-issue-is-fixed/3139
I have tested the exact following code on t5-small and t5-base and they work fine. However, when using t5-large and/or flan-t5-xl, the model produces nan outputs. This is solely a result of using half precision (ignore the multiple GPUs, strategy etc, I have tested with every other variation):
trainer = pl.Trainer(
precision="16",
accelerator='gpu',
strategy='auto',
devices=4,)
I am using transformers == 4.28.1 and lightning == 2.0.0
Any ideas/help appreciated
Thanks!
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
trainer = pl.Trainer(
precision="16",
accelerator='gpu',
strategy='auto',
devices=4,)
Expected behavior
Nans!!!
Metadata
Metadata
Assignees
Labels
No labels