getting nans with t5-large + fix #10830

yuvalkirstain · 2021-03-21T08:52:41Z

Environment info

transformers version: 4.5.0.dev0
Platform: Linux-4.15.0-65-generic-x86_64-with-glibc2.10
Python version: 3.8.8
PyTorch version (GPU?): 1.7.1+cu101 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@patil-suraj @patrickvonplaten

Information

Model I am using (Bert, XLNet ...): t5-large

The problem arises when using:

my own modified scripts: run_seq2seq with minor modifications (attached)

The tasks I am working on is:

my own task or dataset: Closed-Book Open Domain QA

To reproduce

Steps to reproduce the behavior (the fix I'm suggesting is very simple, so perhaps there is no reason to reproduce):

unzip the attached zip (below).
run

python run_seq2seq.py --model_name_or_path=t5-large
--do_train
--do_eval
--task=qa
--train_file=data/PAQ.filtered.regular.16000.json
--validation_file=data/PAQ.filtered.regular.16000.json
--output_dir=results/5e-5-t5-large-4096000-128-140-1792000-0.1-regular-true-4
--overwrite_output_dir
--per_device_train_batch_size=1
--per_device_eval_batch_size=128
--predict_with_generate
--fp16
--max_steps=1000
--evaluation_strategy=steps
--text_column=question
--summary_column=answer
--save_total_limit=5
--cache_dir=../.cache
--save_steps=500000
--learning_rate=5e-5
--eval_steps=96000
--warmup_steps=100
--run_name=5e-5-t5-large-4096000-128-140-1792000-0.1-regular-true-4
--dropout_rate=0.1
--gradient_accumulation_steps=1
--logging_steps=1

Expected behavior

Training without nans.

Possible fix

I debugged and saw that we get nans at the modeling_t5.py script in line 241

hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)

By modifing this line to:

clamp_value = torch.finfo(hidden_states.dtype).max - 1000
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) * torch.rsqrt(variance + self.variance_epsilon)

It seems to be solved.

BTW it happens in the last layers (this might explain why it wasn't caught in this fix)

seq2seq.zip

The text was updated successfully, but these errors were encountered:

dorost1234 · 2021-03-21T14:25:23Z

Hi
I also observe the similar issue with mt5 models, #10819 , deepspeed is still not working for me due to this issue with mt5 models.
I greatly appreciate having a look @patil-suraj @patrickvonplaten

patrickvonplaten · 2021-03-22T08:30:31Z

We didn't really manage to resolve the problems with t5/mt5 + mixed precision fp16 (cc @patil-suraj). I'm not sure whether anybody has tried internally to fine-tune t5/mt5 with deepspeed (@stas00 maybe?)

dorost1234 · 2021-03-22T09:15:48Z

the issue arises without deepspeed, just vanilla mt5-small model. Also, I see similar nans with deepspeed with a model based on mt5-small slightly modified, please see the issue here #10821 (comment), I think if the issue with fp16 option could get resolved, hopefully this will be also more stable with model changes in deepspeed as well. Thanks a lot.

stas00 · 2021-03-22T16:58:28Z

Indeed, this has nothing to do with deepspeed, other than that deepspeed trains in mixed precision and evals in full fp16 at the moment.

I've started studying the bfloat16 vs. float16 numerical properties and their correlation to each other. And once I understand it well I will try to see if there some sort of magical remapping that perhaps could be done - this is my fantasy of course. I just need to finish a few other more urgent things with deepspeed stage3 integration first.

But please don't let my comment prevent you from merging the proposed fix if it already solves the problem.

dorooddorood606 · 2021-03-25T22:41:47Z

I got similar issue with mt5 model, @patrickvonplaten thanks a lot in advance for your help

stas00 · 2021-03-29T05:55:27Z

@dorost1234 + @yuvalkirstain, please kindly try this branch:
https://github.com/huggingface/transformers/tree/t5-fp16-no-nans
and let me know if it solves the problem - It seems that the problem is due to autocast in T5LayerFF so this branch tries to turn off autocast just for that layer. It also disables the previously added clamping.

There is also a lot of debug statements in the branch but they will be silent unless nan/inf is detected.

I tested it work on a small sample with t5-small/t5-base/t5-large/google/mt5-small.

The main part of the fix is just:

class T5LayerFF(nn.Module):
    def forward(self, hidden_states):
        with torch.cuda.amp.autocast(enabled=False):
            forwarded_states = self.layer_norm(hidden_states)
            forwarded_states = self.DenseReluDense(forwarded_states)
            hidden_states = hidden_states + self.dropout(forwarded_states)
        return hidden_states

and removing some code. So use the branch first.

If it works I guess we could just monkey patch this version for AMP or come up with some cleaner solution. Probably with torch.is_autocast_enabled() check

dorost1234 · 2021-03-29T07:36:38Z

Dear @stas00
Thank you very much for taking time looking into this issue, this would be really awesome if this could fix the issue, I tried to test it, for this I got the branch, and then I install it locally with "python setup.py develop", then I run this command:

python run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /temp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10 --fp16

I got this error:

Traceback (most recent call last):
  File "run_translation.py", line 562, in <module>
    main()
  File "run_translation.py", line 448, in main
    pad_to_multiple_of=8 if training_args.fp16 else None,
TypeError: __init__() got an unexpected keyword argument 'model'

I think there is some version mismatch. I removed the model from input to the collator, as below

   data_collator = DataCollatorForSeq2Seq(
            tokenizer,
            #model=model,
            label_pad_token_id=label_pad_token_id,
            pad_to_multiple_of=8 if training_args.fp16 else None,
        )

and then here is what I got with fp16 option:

{'loss': 23.3523, 'learning_rate': 4.999890767684712e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 22.5557, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 25.9471, 'learning_rate': 4.999672303054136e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 23.0994, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 24.9974, 'learning_rate': 4.999453838423559e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 23.3743, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 24.2147, 'learning_rate': 4.999235373792983e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 26.7845, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 25.2277, 'learning_rate': 4.9990169091624065e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 23.3156, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 21.275, 'learning_rate': 4.99879844453183e-05, 'epoch': 0.0}                                                                                                                         
{'loss': 23.7031, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 23.8086, 'learning_rate': 4.9985799799012544e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 25.8143, 'learning_rate': 4.998470747585966e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 24.4319, 'learning_rate': 4.998361515270678e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 26.8277, 'learning_rate': 4.99825228295539e-05, 'epoch': 0.0}

here is loss without fp16:

{'loss': 27.0258, 'learning_rate': 4.999890767684712e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 23.141, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}                                                                                                                        
{'loss': 21.2312, 'learning_rate': 4.999672303054136e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 19.3567, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 18.7998, 'learning_rate': 4.999453838423559e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 17.9632, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 17.2105, 'learning_rate': 4.999235373792983e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 17.5506, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 15.2566, 'learning_rate': 4.9990169091624065e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 14.8667, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 13.7132, 'learning_rate': 4.99879844453183e-05, 'epoch': 0.0}                                                                                                                        
{'loss': 13.4058, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0

So I think this is not optimizing the loss well. I greatly appreciate having a look. Thanks a lot.

stas00 · 2021-03-29T17:19:00Z

re errors - this is all on master - the source code and run_translation.py. When you install pip install -e . sometimes conda/pip don't clean up an old install, so it helps to do pip uninstall transformers -y at least 2 times!

I solve such problems by running locally and not relying on the installed transformers, i.e.:

git clone https://github.com/huggingface/transformers
cd transformers
PYTHONPATH=src python examples/seq2seq/run_translation.py ...

now you never need to worry about what transformers version is installed in the environment.

wrt not getting the loss going down - this is odd, I just run your code:

PYTHONPATH=src python examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10 --fp16

{'loss': 29.7519, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}                                                                                           
{'loss': 26.3593, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}                                                                                          
{'loss': 23.4431, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}                                                                                           
{'loss': 21.431, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}                                                                                           
{'loss': 19.2445, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}                                                                                           
{'loss': 17.8293, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0}                                                                                          
{'loss': 16.9441, 'learning_rate': 4.998470747585966e-05, 'epoch': 0.0}                                                                                           
{'loss': 15.7572, 'learning_rate': 4.99825228295539e-05, 'epoch': 0.0}
{'loss': 15.2937, 'learning_rate': 4.9980338183248135e-05, 'epoch': 0.0}
{'loss': 14.4368, 'learning_rate': 4.997815353694237e-05, 'epoch': 0.0}
{'loss': 14.6709, 'learning_rate': 4.997596889063661e-05, 'epoch': 0.0}
{'loss': 13.2806, 'learning_rate': 4.9973784244330843e-05, 'epoch': 0.0}
{'loss': 12.9245, 'learning_rate': 4.997159959802508e-05, 'epoch': 0.0}
{'loss': 12.4647, 'learning_rate': 4.9969414951719316e-05, 'epoch': 0.0}
{'loss': 11.4738, 'learning_rate': 4.996723030541355e-05, 'epoch': 0.0}

Must be your hardware? Try to lower the learning rate?

I tried with 1 or 2 gpus and it worked in both cases.

dorost1234 · 2021-03-29T20:49:15Z

Hi @stas00
thank you very much for the pointers, I did it as you mentioned and now I see this is going down nicely

{'loss': 28.1802, 'learning_rate': 4.999890767684712e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 27.4353, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 21.3904, 'learning_rate': 4.999672303054136e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 22.8854, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 19.6943, 'learning_rate': 4.999453838423559e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 21.253, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}                                                                                                                        
{'loss': 20.1937, 'learning_rate': 4.999235373792983e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 18.6606, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 18.0337, 'learning_rate': 4.9990169091624065e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 16.1259, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 15.4007, 'learning_rate': 4.99879844453183e-05, 'epoch': 0.0}                                                                                                                        
{'loss': 15.6753, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 15.0481, 'learning_rate': 4.9985799799012544e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 14.5833, 'learning_rate': 4.998470747585966e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 14.0758, 'learning_rate': 4.998361515270678e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 13.7096, 'learning_rate': 4.99825228295539e-05, 'epoch': 0.0}                                                                                                                        
{'loss': 13.3216, 'learning_rate': 4.998143050640102e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 13.2331, 'learning_rate': 4.9980338183248135e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 12.1556, 'learning_rate': 4.997924586009525e-05, 'epoch': 0.0}

This is such a great, wonderful, amazing fix. Looking forward to using it when this is pushed to the repository.
For all the hard problems, you are our only hope @stas00
Thank you very much for this great fix.

stas00 · 2021-03-29T21:04:32Z

Thank you for your kind words, I'm so happy to hear that it worked, @dorost1234.

I will make a proper PR after I clean this branch up.

stas00 · 2021-03-29T22:45:38Z

@yuvalkirstain, please kindly test if this PR fixes the problem: #10956

yuvalkirstain · 2021-03-31T05:02:06Z

Thank you @stas00 !
It seems to work were my proposed fix failed with T5-Small. I will now run some additional experiments with T5-Large and update.

stas00 · 2021-03-31T05:55:25Z

Thank you for validating that, @yuvalkirstain!

Indeed, I tried first local fixes but the problem would just pop-up elsewhere.

I'm just thinking that perhaps we could find if it's all calls to FF that lead to the problem or only some of them, and then we could optimize the solution I proposed by only disabling autocast in some cases and not all. I haven't tested that yet.

If you experiment I recommend for you to try my branch, since I left the "detector" on and it'll immediately tell you when the first inf is encountered.

What I'm most interested in is some longer runs to ensure it doesn't start overflowing at a later point.

Thank you for your contribution.

yuvalkirstain · 2021-03-31T13:04:14Z

Finetuned T5-Base using this branch with the standard T5 finetuning HPs on NQ (except from batch_size - used only ~26k tokens) and didn't get nans (it has been running for over 3 hours and training converged). Thanks again, I guess the issue can be closed for time being.

stas00 · 2021-04-01T04:02:06Z

Thank you for this validation, @yuvalkirstain. I still would like to see if we can find a more efficient solution before merging it, but this is great that we have one that works.

This unfortunately doesn't help with deepspeed since it doesn't use pytorch AMP and has its own version, but which doesn't use context manager so can't be turned off locally like autocast. So we hope to find a different solution.

I linked this issue to the PR so it'll get closed automatically when it's merged.

yuvalkirstain · 2021-04-13T07:26:05Z

Well, the nans are back.

T5LayerFF: 1 has inf T5LayerNorm has inf T5LayerNorm variance has inf T5LayerNorm hidden_states has nans T5LayerNorm hidden_states before return has nans T5LayerFF: 2 has nans T5LayerFF: 3 has nans T5LayerFF: 5 has nans T5Block after T5LayerFF has nans T5Stack loop end has nans T5LayerNorm has nans T5LayerNorm variance has nans T5LayerNorm hidden_states has nans T5LayerNorm hidden_states before return has nans

The model I used here was T5-large-ssm-nqo.
@stas00 If you'd like to replicate I can send the relevant training file + command.

stas00 · 2021-04-13T16:30:04Z

Yes, please, I'm working in parallel on gpt-neo that has the same issues, so the more reproducible cases we have the higher are the chances we can find a solid fix.

Also those would be good candidates for tests (hoping that we can find a quick way to get to overflow).

stas00 · 2021-04-15T20:32:25Z

Let's continue the discussion in the PR that is trying to solve this issue: #10956

Sahajtomar · 2021-05-26T16:28:44Z

@dorost1234 hI, Could you please tell me how you solved this loss optimization problem. I am facing same issue

github-actions · 2021-07-15T15:02:15Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oxi84 · 2023-04-05T18:43:47Z

So is this fix now in the main version of transformers?

Oxi84 · 2023-04-05T19:20:54Z

I found that results are different when you load like this: (first is better)

model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True,torch_dtype=torch.float16).to("cuda")

than when you load via:

model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True)
model1a_CPU.half()
model1a_CPU.eval()
model1a_CPU.to("cuda")

So this could be a solution, I will compare result on /CPU versus /This versus /Half

Oxi84 · 2023-04-06T00:38:21Z

@seems like the solution is already implemented in this call: (model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True,torch_dtype=torch.float16).to("cuda"))

Probably it is trigered by torch_dtype=torch.float16. So a part of model is (likely) moved to fp32 from fp16, so it works properly, exactly the same as with FP32, and exactly the same as on CPU.

Of course it does use a little bit more of memory. When you call it second way, the memory usage is around 2.5 GB for T5-large, while with first it is around 2.9GB. It is slower around 10-15 percent.

This was referenced Mar 22, 2021

T5Model in fp16 still yield nan with more complex examples #4586

Closed

mt5 getting nans with fp16 #10819

Closed

stas00 mentioned this issue Mar 29, 2021

[T5/MT5] resolve inf/nan under amp (mixed precision) #10956

Closed

1 task

huggingface deleted a comment from github-actions bot May 12, 2021

patrickvonplaten mentioned this issue May 16, 2021

Problem with mT5 and the official Summarization notebook #11735

Closed

huggingface deleted a comment from github-actions bot Jun 20, 2021

github-actions bot closed this as completed Jul 23, 2021

patrickvonplaten mentioned this issue Oct 29, 2021

T5-v1.1 loss go to nan when fp16 training was enabled #14189

Closed

4 tasks

ViktorThink mentioned this issue Nov 10, 2021

Quantize t5 v1_1 generates nonsense #14351

Closed

Rami-Ismael mentioned this issue Feb 17, 2023

How to fine a Flan T5 model on a single GPU for our dataset Rami-Ismael/physics_qa_llm#2

Open

5 tasks

hohoCode mentioned this issue Nov 5, 2023

[BUG] Finetune/pretrain t5-large (or any t5s > t5-small) has NANs issue with AG due to fp16. autogluon/autogluon#3661

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting nans with t5-large + fix #10830

getting nans with t5-large + fix #10830

yuvalkirstain commented Mar 21, 2021

dorost1234 commented Mar 21, 2021 •

edited

Loading

patrickvonplaten commented Mar 22, 2021

dorost1234 commented Mar 22, 2021 •

edited

Loading

stas00 commented Mar 22, 2021

dorooddorood606 commented Mar 25, 2021

stas00 commented Mar 29, 2021 •

edited

Loading

dorost1234 commented Mar 29, 2021

stas00 commented Mar 29, 2021

dorost1234 commented Mar 29, 2021

stas00 commented Mar 29, 2021

stas00 commented Mar 29, 2021

yuvalkirstain commented Mar 31, 2021

stas00 commented Mar 31, 2021

yuvalkirstain commented Mar 31, 2021 •

edited

Loading

stas00 commented Apr 1, 2021

yuvalkirstain commented Apr 13, 2021

stas00 commented Apr 13, 2021 •

edited

Loading

stas00 commented Apr 15, 2021 •

edited

Loading

Sahajtomar commented May 26, 2021

github-actions bot commented Jul 15, 2021

Oxi84 commented Apr 5, 2023

Oxi84 commented Apr 5, 2023 •

edited

Loading

Oxi84 commented Apr 6, 2023

getting nans with t5-large + fix #10830

getting nans with t5-large + fix #10830

Comments

yuvalkirstain commented Mar 21, 2021

Environment info

Who can help

Information

To reproduce

Expected behavior

Possible fix

dorost1234 commented Mar 21, 2021 • edited Loading

patrickvonplaten commented Mar 22, 2021

dorost1234 commented Mar 22, 2021 • edited Loading

stas00 commented Mar 22, 2021

dorooddorood606 commented Mar 25, 2021

stas00 commented Mar 29, 2021 • edited Loading

dorost1234 commented Mar 29, 2021

stas00 commented Mar 29, 2021

dorost1234 commented Mar 29, 2021

stas00 commented Mar 29, 2021

stas00 commented Mar 29, 2021

yuvalkirstain commented Mar 31, 2021

stas00 commented Mar 31, 2021

yuvalkirstain commented Mar 31, 2021 • edited Loading

stas00 commented Apr 1, 2021

yuvalkirstain commented Apr 13, 2021

stas00 commented Apr 13, 2021 • edited Loading

stas00 commented Apr 15, 2021 • edited Loading

Sahajtomar commented May 26, 2021

github-actions bot commented Jul 15, 2021

Oxi84 commented Apr 5, 2023

Oxi84 commented Apr 5, 2023 • edited Loading

Oxi84 commented Apr 6, 2023

dorost1234 commented Mar 21, 2021 •

edited

Loading

dorost1234 commented Mar 22, 2021 •

edited

Loading

stas00 commented Mar 29, 2021 •

edited

Loading

yuvalkirstain commented Mar 31, 2021 •

edited

Loading

stas00 commented Apr 13, 2021 •

edited

Loading

stas00 commented Apr 15, 2021 •

edited

Loading

Oxi84 commented Apr 5, 2023 •

edited

Loading