Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_pretraining.py - clip gradient error: Found Inf or NaN global norm: Tensor had NaN values #82

Closed
xwzhong opened this issue Nov 8, 2018 · 22 comments

Comments

@xwzhong
Copy link

xwzhong commented Nov 8, 2018

hi, I get an InvalidArgumentError when running run_pretraining.py, it shows:
2
using my own data, I set paraments as follows:
train batch size: 32
max seq length: 64 (99% article less equal 46 word)
max predictions per seq: 10
learning rate: 2e-5

at the begging, I google it, someone said, use smaller learning rate, I find it just delay the coming of InvalidArgumentError, I thought learning rate is not the key reason. alse, I try tf.nn.softmax_cross_entropy_with_logits(labels=y_,logits=y) as it says, saddly, I still get the same error.

tracing the error, (grads, _) = tf.clip_by_gblobal_norm(grads, clip_norm=1.0) -> clip_ops.py line 259 , it shows global_norm calculation error.

what do you think the error happens ? didn't you meet yourself ?

@xwzhong xwzhong changed the title pretrain-clip gradient error: found Inf or NaN global norm: Tensor had NaN values run_pretraining.py - clip gradient error: Found Inf or NaN global norm: Tensor had NaN values Nov 8, 2018
@zkailinzhang
Copy link

set lower batch_size, will run ok

@xwzhong
Copy link
Author

xwzhong commented Nov 9, 2018

@zkl99999 did you know what is the reason make the error happen ?

@mleonrivas
Copy link

I am seeing the same error.
I tried bumping up the batch size to 8192, however it just delays the error.
lowering the batch size makes it happen faster.
any idea what's happening?

@jacobdevlin-google
Copy link
Contributor

I think I just realized what the problem might be, are you guys using a different vocabulary but the same bert_config.json file? The vocabulary size is specified in this file so if a larger vocabulary i used then this will do out-of-bounds lookups (which are unchecked on the GPU or TPU).

If you are using the same vocabulary, are you starting from the BERT checkpoint or from scratch?

@xwzhong
Copy link
Author

xwzhong commented Nov 12, 2018

fantastic, you are right, when I use "the same bert_config.json, but change the vocab file(which makes a gap between bert_config.json vocab size and true vocab size)", the error happens, after fix that, it is gone. thanks much.

@jacobdevlin-google
Copy link
Contributor

Cool, I will make sure to add this in bold font in the pre-training section of the README.

@xwzhong xwzhong closed this as completed Nov 13, 2018
@ohwe
Copy link

ohwe commented Jan 12, 2019

Hi! I get exactly the same error after global_step=110000 (and therefore, I guess, problem with misconfiguration is very unlikely).

I did shrink my vocabulary to 16k tokens. However, I did fix bert_config.json appropriately and still get the error.

  File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/run_pretraining.py", line 547, in <module>
    tf.app.run()
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/run_pretraining.py", line 509, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
    saving_listeners=saving_listeners
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1211, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2186, in _call_model_fn
    features, labels, mode, config)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2470, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1250, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1524, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/run_pretraining.py", line 192, in model_fn
    loss_scale)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/optimization.py", line 86, in create_optimizer
    (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/ops/clip_ops.py", line 259, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/ops/numerics.py", line 45, in verify_tensor_all_finite
    verify_input = array_ops.check_numerics(t, message=msg)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
	 [[{{node VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm/global_norm)]]```

@eunicechen1987
Copy link

I have the same error.
I use my own vocab with size 51722, and revise it in config.
When is use mixed float from nvidia pr, this error will happened. When is not use mixed float, this error will not happened!

@xwzhong
Copy link
Author

xwzhong commented Jan 16, 2019

it's odd, in my experiment, after fixed the config, the error don't happen again(train more than 600w step)

@minmummax
Copy link

fantastic, you are right, when I use "the same bert_config.json, but change the vocab file(which makes a gap between bert_config.json vocab size and true vocab size)", the error happens, after fix that, it is gone. thanks much.

so I recently meet this problem too ,how do you fix it ? change the vocab.txt's size or ??

@xwzhong
Copy link
Author

xwzhong commented Jan 18, 2019

fantastic, you are right, when I use "the same bert_config.json, but change the vocab file(which makes a gap between bert_config.json vocab size and true vocab size)", the error happens, after fix that, it is gone. thanks much.

so I recently meet this problem too ,how do you fix it ? change the vocab.txt's size or ??

i change the "vocab_size" in bert_config.json.

@minmummax
Copy link

but I still get this problem after change json file'vocab size that is the same with the vocab file 's size

@xwzhong
Copy link
Author

xwzhong commented Jan 18, 2019

but I still get this problem after change json file'vocab size that is the same with the vocab file 's size

for now, I can't give you the reason why it happens, I will check my code, if I get something, I will reply here

@minmummax
Copy link

but I still get this problem after change json file'vocab size that is the same with the vocab file 's size

for now, I can't give you the reason why it happens, I will check my code, if I get something, I will reply here

all right thanks !

@yunchaosuper
Copy link

still not understand! you change what parameter ?

{
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128
}

@xwzhong
Copy link
Author

xwzhong commented Jan 20, 2019

@yunchaosuper I change "vocab_size"

@yunchaosuper
Copy link

@xwzhong so you change the vocab_size from 21128 to what? kindly help on that

@PeterPanUnderhill
Copy link

I think I just realized what the problem might be, are you guys using a different vocabulary but the same bert_config.json file? The vocabulary size is specified in this file so if a larger vocabulary i used then this will do out-of-bounds lookups (which are unchecked on the GPU or TPU).

If you are using the same vocabulary, are you starting from the BERT checkpoint or from scratch?

Hi Jacob, I am using pretrained BERT together with other networks, but during finetuning, I also met this problem of NaN global norm. I wonder what do you mean by out-of-bounds lookups? The dataset I use does have OOV words, but what causes NaN global norm? Only when all the tokens in the sentence are unknown words?

Thanks in advance.

@brightmart
Copy link

@ohwe were you able to solve the problem? after 110000 steps, NaN error happend.

@brightmart
Copy link

brightmart commented Sep 4, 2019

I am pre-training Bert with large amount of data, after 110000 steps, loss is around 1.4
But after stop the trained and try to resume again( by set init point and restore from checkpoint of 110000), NaN error happend. I run serveral times, in other times, Nan came from other layer, not layer_0.

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
Gradient for bert/encoder/layer_0/output/dense/bias:0 is NaN : Tensor had NaN values
[[{{node CheckNumerics_18}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_pretraining.py", line 497, in
tf.app.run()

Can some one help, or have any idea?

@xwzhong @zkl99999 @mleonrivas @jacobdevlin-google @ohwe

@shirayu
Copy link

shirayu commented Jan 20, 2020

I faced the same error (Tensor had NaN values) in the begging of learning.
It was caused by invalid learning rate.
I wrongly set it to 5e5 instead of 5e-5.

It might be related to learning rate.

@Yiwen-Yang-666
Copy link

I think I just realized what the problem might be, are you guys using a different vocabulary but the same bert_config.json file? The vocabulary size is specified in this file so if a larger vocabulary i used then this will do out-of-bounds lookups (which are unchecked on the GPU or TPU).

If you are using the same vocabulary, are you starting from the BERT checkpoint or from scratch?

I add a tensor for additional vocabulary embedding, concatenated with original embedding tensor for tokens in the original vocab file, and problem solved

The reason might be that "tf.gather" method in "embedding_lookup" function filters additional vocabulary ids and thus there is no embedding for additional vocabulary's tokens. Predicting masked additional tokens is not matched with those input tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests