Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourceExhaustedError (see above for traceback): OOM when allocating tensor #118

Closed
jageshmaharjan opened this issue Nov 14, 2018 · 7 comments

Comments

@jageshmaharjan
Copy link

I created the pre-training tf_record from the sample data ("sample_text.txt"). And, I was trying to run run_pretraining to train on that small dataset. And, I got this error,

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4096,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: bert/encoder/layer_10/intermediate/dense/truediv = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert/encoder/layer_10/intermediate/dense/BiasAdd, ConstantFolding/gradients/bert/encoder/layer_0/intermediate/dense/truediv_grad/RealDiv_recip)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: add_1/_4159 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3653_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

my script to create tf_record data is:
python create_pretraining_data.py --input_file=sample_text.txt --output_file=/tmp/tf.examples.tfrecord --vocab_file=/home/maybe/bert/model/uncased_L-12_H-768_A-12/vocab.txt --do_lower_case=True --max_seq_length=128 --max_predictions_per_seq=20 --masked_lm_prob=0.15 --random_seed=12345 --dupe_factor=5

and script to pre-train on the sample small data is:

python run_pretraining.py --input_file=/tmp/tf.examples.tfrecord --output_dir=/tmp/pretraining_output --do_train=true --do_eval=true --bert_config_file=/home/maybe/bert/model/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=/home/maybe/bert/model/uncased_L-12_H-768_A-12/bert_model.ckpt --train_batch_size=32 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=20 --num_warmup_steps=10 --learning_rate=2e-5

I know this error, I usually used to encounter this when i am running other program that is using the same GPU resource, but at this moment nothing is running all the resources are available/free. Or when Tensorflow graph is initiated.

@jacobdevlin-google
Copy link
Contributor

The memory has almost nothing to do with the size of the input file. The example code assumes your GPU has ~12GB of memory. If it has less then you'll need to use a smaller batch size.

@jageshmaharjan
Copy link
Author

jageshmaharjan commented Nov 14, 2018

Hi @jacobdevlin-google ,
Yea, I do have more than 12GB of GPU. I am using Tesla M60 (8GB) x 4.
Also, I'll try with smaller batch size.

@jageshmaharjan
Copy link
Author

surprisingly, with the small batch size, it works. However, I have a huge GPU in my server.

INFO:tensorflow:Finished evaluation at 2018-11-14-06:07:45
INFO:tensorflow:Saving dict for global step 20: global_step = 20, loss = 0.9312667, masked_lm_accuracy = 0.8223652, masked_lm_loss = 0.9282017, next_sentence_accuracy = 1.0, next_sentence_loss = 0.004190466
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 20: /tmp/pretraining_output/model.ckpt-20
INFO:tensorflow:***** Eval results *****
INFO:tensorflow:  global_step = 20
INFO:tensorflow:  loss = 0.9312667
INFO:tensorflow:  masked_lm_accuracy = 0.8223652
INFO:tensorflow:  masked_lm_loss = 0.9282017
INFO:tensorflow:  next_sentence_accuracy = 1.0
INFO:tensorflow:  next_sentence_loss = 0.004190466
(asr) maybe@maybe1:~/bert$ 

@jacobdevlin-google
Copy link
Contributor

Each 8GB card is a separate device, right? This code doesn't support MultiGPU so you would have to modify it or look for a fork that does.

@jageshmaharjan
Copy link
Author

yea, its separate GPU card with 8GB each. I'll look for a way to use multiple GPU for this task, I'll fork it if I find a solution soon. I'll close this issue for now.
Thanks @jacob

@Gpwner
Copy link

Gpwner commented Jan 5, 2019

So Do you find a solution to use the multiple GPU?

@echan00
Copy link

echan00 commented Jan 10, 2019

Same here, also looking for multiple GPU version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants