Train error #39

WonTaeYeon · 2020-03-25T02:19:16Z

Hi, thank you for your hard work and open sourcing the code!
I tried training, but the following error occurred.

Command:
python main.py --training_file_pattern=tmp/train/train* --model_name=efficientdet-d0 --model_dir=train_model --hparams="use_bfloat16=false" --use_tpu=False

2020-03-25 11:18:29.528543: W tensorflow/core/common_runtime/bfc_allocator.cc:429] ****************************************************************************************************
2020-03-25 11:18:29.528632: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cwise_ops_common.h:263 : Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
WARNING:tensorflow:Reraising captured error
W0325 11:18:30.869304 140049157998336 error_handling.py:142] Reraising captured error
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call
return fn(*args)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn
target_list, run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node efficientnet-b0/model/blocks_13/Sigmoid}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[strided_slice_2/_15357]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node efficientnet-b0/model/blocks_13/Sigmoid}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 385, in
tf.app.run(main)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "main.py", line 246, in main
FLAGS.train_batch_size))
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 143, in raise_errors
six.reraise(typ, value, traceback)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1198, in _train_model_default
saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1497, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 778, in run
run_metadata=run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1283, in run
run_metadata=run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1384, in run
raise six.reraise(*original_exc_info)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1369, in run
return self._sess.run(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1442, in run
run_metadata=run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1200, in run
return self._sess.run(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 960, in run
run_metadata_ptr)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1183, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run
run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node efficientnet-b0/model/blocks_13/Sigmoid (defined at /home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py:370) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[strided_slice_2/_15357]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node efficientnet-b0/model/blocks_13/Sigmoid (defined at /home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py:370) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'efficientnet-b0/model/blocks_13/Sigmoid':
File "main.py", line 385, in
tf.app.run(main)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "main.py", line 246, in main
FLAGS.train_batch_size))
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1194, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
config)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3126, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1663, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1994, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "/home/ubuntu/project_1/automl/efficientdet/det_model_fn.py", line 567, in efficientdet_model_fn
model=efficientdet_arch.efficientdet)
File "/home/ubuntu/project_1/automl/efficientdet/det_model_fn.py", line 399, in _model_fn
cls_outputs, box_outputs = _model_outputs()
File "/home/ubuntu/project_1/automl/efficientdet/det_model_fn.py", line 389, in _model_outputs
return model(features, config=hparams_config.Config(params))
File "/home/ubuntu/project_1/automl/efficientdet/efficientdet_arch.py", line 552, in efficientdet
features = build_backbone(features, config)
File "/home/ubuntu/project_1/automl/efficientdet/efficientdet_arch.py", line 328, in build_backbone
override_params=override_params)
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_builder.py", line 324, in build_model_base
features = model(images, training=training, features_only=True)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 778, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 643, in call
for idx, block in enumerate(self._blocks):
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 339, in for_stmt
return py_for_stmt(iter, extra_test, body, get_state, set_state, init_vars)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 350, in _py_for_stmt
state = body(target, *state)
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 662, in call
outputs = block.call(
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 363, in call
if self._block_args.fused_conv:
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 920, in if_stmt
return _py_if_stmt(cond, body, orelse)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 1029, in _py_if_stmt
return body() if cond else orelse()
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 369, in call
if self._block_args.expand_ratio != 1:
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 920, in if_stmt
return _py_if_stmt(cond, body, orelse)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 1029, in _py_if_stmt
return body() if cond else orelse()
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 370, in call
x = self._relu_fn(self._bn0(expand_conv_fn(x), training=training))
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/custom_gradient.py", line 256, in call
return self._d(self._f, a, k)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/custom_gradient.py", line 212, in decorated
return _graph_mode_decorator(wrapped, args, kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/custom_gradient.py", line 316, in _graph_mode_decorator
result, grad_fn = f(*args)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_impl.py", line 534, in swish
return features * math_ops.sigmoid(features), grad
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 3154, in sigmoid
return gen_math_ops.sigmoid(x, name=name)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 8750, in sigmoid
"Sigmoid", x=x, name=name)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 742, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3322, in _create_op_internal
op_def=op_def)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1756, in init
self._traceback = tf_stack.extract_stack()

The text was updated successfully, but these errors were encountered:

CraigWang1 · 2020-03-25T03:32:30Z

Maybe try reducing the batch size with --train_batch_size 32, 16, 8, 4, or 2
eg.
--train_batch_size 32

WonTaeYeon · 2020-03-26T00:38:28Z

Solved, Thx

qtw1998 · 2020-03-31T16:08:46Z

Maybe try reducing the batch size with --train_batch_size 32, 16, 8, 4, or 2
eg.
--train_batch_size 32

Helped!thx

qtw1998 · 2020-04-01T05:12:16Z

Maybe try reducing the batch size with --train_batch_size 32, 16, 8, 4, or 2
eg.
--train_batch_size 32

but I use 8 * 2080ti use bs = 4 still have the same OOM problem

mingxingtan · 2020-04-05T22:02:54Z

This issue is similar to #85. I am going to close this one and keep that open. Feel free to add your comments to #85 if you still have problems.

mingxingtan closed this as completed Apr 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train error #39

Train error #39

WonTaeYeon commented Mar 25, 2020

CraigWang1 commented Mar 25, 2020 •

edited

WonTaeYeon commented Mar 26, 2020

qtw1998 commented Mar 31, 2020

qtw1998 commented Apr 1, 2020

mingxingtan commented Apr 5, 2020

Train error #39

Train error #39

Comments

WonTaeYeon commented Mar 25, 2020

CraigWang1 commented Mar 25, 2020 • edited

WonTaeYeon commented Mar 26, 2020

qtw1998 commented Mar 31, 2020

qtw1998 commented Apr 1, 2020

mingxingtan commented Apr 5, 2020

CraigWang1 commented Mar 25, 2020 •

edited