You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.
I have given around 6,00,000 parallel sentences for training. It is finding difficulty in allocating tensor with vocabulary size (195370). I finally reduced the training data to just 1,00,000 sentences. The training is going on now. But is it really an issue with the datasize ?? Because I am planning to train on a huge dataset in future. Please let me know what the issue is.
Here are some other logs that might help.
2017-10-05 00:15:23.506374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-10-05 00:15:23.506378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-10-05 00:15:23.506382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:01:00.0)
2017-10-05 00:15:38.025378: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 5.50GiB. Current allocation summary follows.
2017-10-05 00:15:38.025454: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 14, Chunks in use: 0 3.5KiB allocated for chunks. 178B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025486: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 9, Chunks in use: 0 5.5KiB allocated for chunks. 1.6KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025506: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 3, Chunks in use: 0 3.5KiB allocated for chunks. 12B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025525: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 2, Chunks in use: 0 5.5KiB allocated for chunks. 8B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025542: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 1, Chunks in use: 0 7.5KiB allocated for chunks. 7.5KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025556: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025579: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384): Total Chunks: 4, Chunks in use: 0 120.0KiB allocated for chunks. 120.0KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025595: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025611: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025638: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): Total Chunks: 62, Chunks in use: 0 7.75MiB allocated for chunks. 7.75MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025669: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): Total Chunks: 64, Chunks in use: 0 21.09MiB allocated for chunks. 14.17MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025686: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025704: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): Total Chunks: 1, Chunks in use: 0 1.17MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025720: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025736: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025752: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025766: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025780: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025795: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025812: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025830: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456): Total Chunks: 1, Chunks in use: 0 1000.74MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-05 00:15:38.025848: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 5.50GiB was 256.00MiB, Chunk State:
2017-10-05 00:15:38.025872: I tensorflow/core/common_runtime/bfc_allocator.cc:666] Size: 1000.74MiB | Requested Size: 0B | in_use: 0, prev: Size: 95.40MiB | Requested Size: 95.40MiB | in_use: 1
After this a huge number of these:
2017-10-05 00:15:38.025888: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba0000 of size 1280
2017-10-05 00:15:38.025901: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba0500 of size 256
2017-10-05 00:15:38.025911: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba0600 of size 256
2017-10-05 00:15:38.025922: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba0700 of size 4096
2017-10-05 00:15:38.025934: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba1700 of size 256
2017-10-05 00:15:38.025946: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba1800 of size 4096
2017-10-05 00:15:38.025966: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba2800 of size 256
2017-10-05 00:15:38.025978: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba2900 of size 4096
2017-10-05 00:15:38.025990: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba3900 of size 256
2017-10-05 00:15:38.026002: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba3a00 of size 4096
2017-10-05 00:15:38.026015: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba4a00 of size 256
2017-10-05 00:15:38.026026: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba4b00 of size 256
2017-10-05 00:15:38.026038: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba4c00 of size 4096
2017-10-05 00:15:38.026048: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba5c00 of size 256
2017-10-05 00:15:38.026057: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba5d00 of size 256
2017-10-05 00:15:38.026069: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba5e00 of size 256
2017-10-05 00:15:38.026078: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba5f00 of size 256
2017-10-05 00:15:38.026087: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6000 of size 256
2017-10-05 00:15:38.026101: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6100 of size 256
2017-10-05 00:15:38.026114: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6200 of size 256
2017-10-05 00:15:38.026125: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6300 of size 256
2017-10-05 00:15:38.026137: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6400 of size 256
2017-10-05 00:15:38.026150: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x705ba6500 of size 283454464
2017-10-05 00:15:38.026163: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x7169f9100 of size 2097152
In the end:
W tensorflow/core/common_runtime/bfc_allocator.cc:277] *******************************************************************************************_________
W tensorflow/core/framework/op_kernel.cc:1152] Resource exhausted: OOM when allocating tensor with shape[59,128,195370]
PoolAllocator: After 6333 get requests, put_count=4184 evicted_count=1000 eviction_rate=0.239006 and unsatisfied allocation rate=0.513027
2017-10-04 23:29:10.896694: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
Traceback (most recent call last):
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1039, in _do_call
return fn(*args)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1021, in _run_fn
status, run_metadata)
File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[59,128,195370]
[[Node: model/att_seq2seq/decode/attention_decoder/decoder/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@model/att_seq2seq/decode/attention_decoder/decoder/TensorArray"], dtype=DT_FLOAT, element_shape=[128,195370], _device="/job:localhost/replica:0/task:0/gpu:0"](model/att_seq2seq/decode/attention_decoder/decoder/TensorArray, model/att_seq2seq/decode/attention_decoder/decoder/TensorArrayStack/range, model/att_seq2seq/decode/attention_decoder/decoder/while/Exit_1)]]
[[Node: model/att_seq2seq/OptimizeLoss/gradients/model/att_seq2seq/decode/attention_decoder/decoder/while/CustomHelperNextInputs/TrainingHelperNextInputs/cond/Merge_grad/tuple/control_dependency/_501 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1548_model/att_seq2seq/OptimizeLoss/gradients/model/att_seq2seq/decode/attention_decoder/decoder/while/CustomHelperNextInputs/TrainingHelperNextInputs/cond/Merge_grad/tuple/control_dependency", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](^_cloopmodel/att_seq2seq/OptimizeLoss/gradients/model/att_seq2seq/decode/attention_decoder/decoder/while/CustomHelperNextInputs/TrainingHelperNextInputs/cond/TensorArrayReadV3_grad/TensorArrayWrite/TensorArrayWriteV3/Switch/_369)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/singh/project1_nmt/s2s/seq2seq/bin/train.py", line 277, in <module>
tf.app.run()
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/singh/project1_nmt/s2s/seq2seq/bin/train.py", line 272, in main
schedule=FLAGS.schedule)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 111, in run
return _execute_schedule(experiment, schedule)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 46, in _execute_schedule
return task()
File "/home/singh/project1_nmt/s2s/seq2seq/seq2seq/contrib/experiment.py", line 104, in continuous_train_and_eval
monitors=self._train_monitors)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 281, in new_func
return func(*args, **kwargs)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 430, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 978, in _train_model
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 484, in run
run_metadata=run_metadata)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 820, in run
run_metadata=run_metadata)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 776, in run
return self._sess.run(*args, **kwargs)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 930, in run
run_metadata=run_metadata)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 776, in run
return self._sess.run(*args, **kwargs)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "/home/singh/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[59,128,195370]
The text was updated successfully, but these errors were encountered:
This is happening to me when I am trying to run an inference via app.py in the google's recommended WebApp for TF. There is no entry like batch size. Do you know where I can find it? My config file has the batch size as 1. Is it located somewhere else as well?
I've tried various solutions before posting this issue like ----> #43 #48 #132
Device details:
Some of the logs are:
I have given around 6,00,000 parallel sentences for training. It is finding difficulty in allocating tensor with vocabulary size (195370). I finally reduced the training data to just 1,00,000 sentences. The training is going on now. But is it really an issue with the datasize ?? Because I am planning to train on a huge dataset in future. Please let me know what the issue is.
Here are some other logs that might help.
After this a huge number of these:
In the end:
The text was updated successfully, but these errors were encountered: