Skip to content
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.

ResourceExhaustedError: OOM when allocating tensor with shape[59,128,195370] #301

Closed
ssokhey opened this issue Oct 4, 2017 · 4 comments
Closed

Comments

@ssokhey
Copy link

ssokhey commented Oct 4, 2017

I've tried various solutions before posting this issue like ----> #43 #48 #132

Device details:

OS: Ubuntu 16.04.3

name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
Total memory: 11.17GiB
Free memory: 11.10GiB

Some of the logs are:

INFO:tensorflow:Creating vocabulary lookup table of size 276811
INFO:tensorflow:Creating vocabulary lookup table of size 195370
INFO:tensorflow:Creating BidirectionalRNNEncoder in mode=train
INFO:tensorflow:
BidirectionalRNNEncoder:
  init_scale: 0.04
  rnn_cell:
    cell_class: LSTMCell
    cell_params: {num_units: 256}
    dropout_input_keep_prob: 0.8
    dropout_output_keep_prob: 1.0
    num_layers: 2
    residual_combiner: add
    residual_connections: false
    residual_dense: false

I have given around 6,00,000 parallel sentences for training. It is finding difficulty in allocating tensor with vocabulary size (195370). I finally reduced the training data to just 1,00,000 sentences. The training is going on now. But is it really an issue with the datasize ?? Because I am planning to train on a huge dataset in future. Please let me know what the issue is.

Here are some other logs that might help.

2017-10-05 00:15:23.506374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 
2017-10-05 00:15:23.506378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y 
2017-10-05 00:15:23.506382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:01:00.0)
2017-10-05 00:15:38.025378: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 5.50GiB.  Current allocation summary follows.
2017-10-05 00:15:38.025454: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): 	Total Chunks: 14, Chunks in use: 0 3.5KiB allocated for chunks. 178B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025486: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): 	Total Chunks: 9, Chunks in use: 0 5.5KiB allocated for chunks. 1.6KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025506: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): 	Total Chunks: 3, Chunks in use: 0 3.5KiB allocated for chunks. 12B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025525: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): 	Total Chunks: 2, Chunks in use: 0 5.5KiB allocated for chunks. 8B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025542: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): 	Total Chunks: 1, Chunks in use: 0 7.5KiB allocated for chunks. 7.5KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025556: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025579: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384): 	Total Chunks: 4, Chunks in use: 0 120.0KiB allocated for chunks. 120.0KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025595: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025611: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025638: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): 	Total Chunks: 62, Chunks in use: 0 7.75MiB allocated for chunks. 7.75MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025669: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): 	Total Chunks: 64, Chunks in use: 0 21.09MiB allocated for chunks. 14.17MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025686: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025704: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): 	Total Chunks: 1, Chunks in use: 0 1.17MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025720: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025736: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025752: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025766: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025780: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025795: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025812: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728): 	Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025830: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456): 	Total Chunks: 1, Chunks in use: 0 1000.74MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025848: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 5.50GiB was 256.00MiB, Chunk State: 
2017-10-05 00:15:38.025872: I tensorflow/core/common_runtime/bfc_allocator.cc:666]   Size: 1000.74MiB | Requested Size: 0B | in_use: 0, prev:   Size: 95.40MiB | Requested Size: 95.40MiB | in_use: 1

After this a huge number of these:

2017-10-05 00:15:38.025888: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba0000 of size 1280
2017-10-05 00:15:38.025901: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba0500 of size 256
2017-10-05 00:15:38.025911: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba0600 of size 256
2017-10-05 00:15:38.025922: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba0700 of size 4096
2017-10-05 00:15:38.025934: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba1700 of size 256
2017-10-05 00:15:38.025946: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba1800 of size 4096
2017-10-05 00:15:38.025966: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba2800 of size 256
2017-10-05 00:15:38.025978: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba2900 of size 4096
2017-10-05 00:15:38.025990: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba3900 of size 256
2017-10-05 00:15:38.026002: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba3a00 of size 4096
2017-10-05 00:15:38.026015: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba4a00 of size 256
2017-10-05 00:15:38.026026: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba4b00 of size 256
2017-10-05 00:15:38.026038: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba4c00 of size 4096
2017-10-05 00:15:38.026048: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba5c00 of size 256
2017-10-05 00:15:38.026057: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba5d00 of size 256
2017-10-05 00:15:38.026069: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba5e00 of size 256
2017-10-05 00:15:38.026078: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba5f00 of size 256
2017-10-05 00:15:38.026087: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6000 of size 256
2017-10-05 00:15:38.026101: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6100 of size 256
2017-10-05 00:15:38.026114: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6200 of size 256
2017-10-05 00:15:38.026125: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6300 of size 256
2017-10-05 00:15:38.026137: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6400 of size 256
2017-10-05 00:15:38.026150: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6500 of size 283454464
2017-10-05 00:15:38.026163: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x7169f9100 of size 2097152

In the end:

W tensorflow/core/common_runtime/bfc_allocator.cc:277] *******************************************************************************************_________

 W tensorflow/core/framework/op_kernel.cc:1152] Resource exhausted: OOM when allocating tensor with shape[59,128,195370]

 PoolAllocator: After 6333 get requests, put_count=4184 evicted_count=1000 eviction_rate=0.239006 and unsatisfied allocation rate=0.513027
2017-10-04 23:29:10.896694: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
Traceback (most recent call last):
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1039, in _do_call
    return fn(*args)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1021, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[59,128,195370]
	 [[Node: model/att_seq2seq/decode/attention_decoder/decoder/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@model/att_seq2seq/decode/attention_decoder/decoder/TensorArray"], dtype=DT_FLOAT, element_shape=[128,195370], _device="/job:localhost/replica:0/task:0/gpu:0"](model/att_seq2seq/decode/attention_decoder/decoder/TensorArray, model/att_seq2seq/decode/attention_decoder/decoder/TensorArrayStack/range, model/att_seq2seq/decode/attention_decoder/decoder/while/Exit_1)]]
	 [[Node: model/att_seq2seq/OptimizeLoss/gradients/model/att_seq2seq/decode/attention_decoder/decoder/while/CustomHelperNextInputs/TrainingHelperNextInputs/cond/Merge_grad/tuple/control_dependency/_501 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1548_model/att_seq2seq/OptimizeLoss/gradients/model/att_seq2seq/decode/attention_decoder/decoder/while/CustomHelperNextInputs/TrainingHelperNextInputs/cond/Merge_grad/tuple/control_dependency", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](^_cloopmodel/att_seq2seq/OptimizeLoss/gradients/model/att_seq2seq/decode/attention_decoder/decoder/while/CustomHelperNextInputs/TrainingHelperNextInputs/cond/TensorArrayReadV3_grad/TensorArrayWrite/TensorArrayWriteV3/Switch/_369)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/singh/project1_nmt/s2s/seq2seq/bin/train.py", line 277, in <module>
    tf.app.run()
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/singh/project1_nmt/s2s/seq2seq/bin/train.py", line 272, in main
    schedule=FLAGS.schedule)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 111, in run
    return _execute_schedule(experiment, schedule)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 46, in _execute_schedule
    return task()
  File "/home/singh/project1_nmt/s2s/seq2seq/seq2seq/contrib/experiment.py", line 104, in continuous_train_and_eval
    monitors=self._train_monitors)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 281, in new_func
    return func(*args, **kwargs)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 430, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 978, in _train_model
    _, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 484, in run
    run_metadata=run_metadata)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 820, in run
    run_metadata=run_metadata)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 776, in run
    return self._sess.run(*args, **kwargs)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 930, in run
    run_metadata=run_metadata)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 776, in run
    return self._sess.run(*args, **kwargs)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 982, in _run
    feed_dict_string, options, run_metadata)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
    target_list, options, run_metadata)
  File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[59,128,195370]
@ssokhey
Copy link
Author

ssokhey commented Oct 5, 2017

Changing the batch size to 48 from 128 works for me now. Anyway, I would like to know what caused this problem.

@serser
Copy link

serser commented Feb 23, 2018

probable memory exhaustion due to batch size.

@ssokhey ssokhey closed this as completed May 4, 2018
@BanuSelinTosun
Copy link

This is happening to me when I am trying to run an inference via app.py in the google's recommended WebApp for TF. There is no entry like batch size. Do you know where I can find it? My config file has the batch size as 1. Is it located somewhere else as well?

@andrewk0617
Copy link

thanks ssokhey! changed the batch size from 48 to 24.. not working.. and after that changed to 12 is working for me now.. hm.. I need good GPU :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants