Skip to content

FP-16 training producing nans on t5-large/flan-t5-xl #23918

@cassianlewis

Description

@cassianlewis

System Info

This was an issue a while back but seems to have resurfaced - https://discuss.huggingface.co/t/t5-fp16-issue-is-fixed/3139

I have tested the exact following code on t5-small and t5-base and they work fine. However, when using t5-large and/or flan-t5-xl, the model produces nan outputs. This is solely a result of using half precision (ignore the multiple GPUs, strategy etc, I have tested with every other variation):

trainer = pl.Trainer(
    precision="16",
    accelerator='gpu',
    strategy='auto',
    devices=4,)

I am using transformers == 4.28.1 and lightning == 2.0.0

Any ideas/help appreciated
Thanks!

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

trainer = pl.Trainer(
    precision="16",
    accelerator='gpu',
    strategy='auto',
    devices=4,)

Expected behavior

Nans!!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions