FP-16 training producing nans on t5-large/flan-t5-xl

### System Info

This was an issue a while back but seems to have resurfaced - https://discuss.huggingface.co/t/t5-fp16-issue-is-fixed/3139

I have tested the exact following code on `t5-small` and `t5-base` and they work fine. However, when using `t5-large` and/or `flan-t5-xl`, the model produces nan outputs. This is solely a result of using half precision (ignore the multiple GPUs, strategy etc, I have tested with every other variation):

```
trainer = pl.Trainer(
    precision="16",
    accelerator='gpu',
    strategy='auto',
    devices=4,)
```

I am using `transformers == 4.28.1` and `lightning == 2.0.0`

Any ideas/help appreciated
Thanks!


### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```
trainer = pl.Trainer(
    precision="16",
    accelerator='gpu',
    strategy='auto',
    devices=4,)
```



### Expected behavior

Nans!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FP-16 training producing nans on t5-large/flan-t5-xl #23918

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FP-16 training producing nans on t5-large/flan-t5-xl #23918

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions