Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train error #39

Closed
WonTaeYeon opened this issue Mar 25, 2020 · 5 comments
Closed

Train error #39

WonTaeYeon opened this issue Mar 25, 2020 · 5 comments

Comments

@WonTaeYeon
Copy link

Hi, thank you for your hard work and open sourcing the code!
I tried training, but the following error occurred.

Command:
python main.py --training_file_pattern=tmp/train/train* --model_name=efficientdet-d0 --model_dir=train_model --hparams="use_bfloat16=false" --use_tpu=False

2020-03-25 11:18:29.528543: W tensorflow/core/common_runtime/bfc_allocator.cc:429] ****************************************************************************************************
2020-03-25 11:18:29.528632: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cwise_ops_common.h:263 : Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
WARNING:tensorflow:Reraising captured error
W0325 11:18:30.869304 140049157998336 error_handling.py:142] Reraising captured error
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call
return fn(*args)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn
target_list, run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node efficientnet-b0/model/blocks_13/Sigmoid}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[strided_slice_2/_15357]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node efficientnet-b0/model/blocks_13/Sigmoid}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 385, in
tf.app.run(main)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "main.py", line 246, in main
FLAGS.train_batch_size))
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 143, in raise_errors
six.reraise(typ, value, traceback)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1198, in _train_model_default
saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1497, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 778, in run
run_metadata=run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1283, in run
run_metadata=run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1384, in run
raise six.reraise(*original_exc_info)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1369, in run
return self._sess.run(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1442, in run
run_metadata=run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1200, in run
return self._sess.run(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 960, in run
run_metadata_ptr)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1183, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run
run_metadata)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node efficientnet-b0/model/blocks_13/Sigmoid (defined at /home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py:370) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[strided_slice_2/_15357]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[64,1152,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node efficientnet-b0/model/blocks_13/Sigmoid (defined at /home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py:370) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'efficientnet-b0/model/blocks_13/Sigmoid':
File "main.py", line 385, in
tf.app.run(main)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "main.py", line 246, in main
FLAGS.train_batch_size))
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1194, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
config)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3126, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1663, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1994, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "/home/ubuntu/project_1/automl/efficientdet/det_model_fn.py", line 567, in efficientdet_model_fn
model=efficientdet_arch.efficientdet)
File "/home/ubuntu/project_1/automl/efficientdet/det_model_fn.py", line 399, in _model_fn
cls_outputs, box_outputs = _model_outputs()
File "/home/ubuntu/project_1/automl/efficientdet/det_model_fn.py", line 389, in _model_outputs
return model(features, config=hparams_config.Config(params))
File "/home/ubuntu/project_1/automl/efficientdet/efficientdet_arch.py", line 552, in efficientdet
features = build_backbone(features, config)
File "/home/ubuntu/project_1/automl/efficientdet/efficientdet_arch.py", line 328, in build_backbone
override_params=override_params)
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_builder.py", line 324, in build_model_base
features = model(images, training=training, features_only=True)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 778, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 643, in call
for idx, block in enumerate(self._blocks):
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 339, in for_stmt
return py_for_stmt(iter, extra_test, body, get_state, set_state, init_vars)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 350, in _py_for_stmt
state = body(target, *state)
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 662, in call
outputs = block.call(
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 363, in call
if self._block_args.fused_conv:
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 920, in if_stmt
return _py_if_stmt(cond, body, orelse)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 1029, in _py_if_stmt
return body() if cond else orelse()
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 369, in call
if self._block_args.expand_ratio != 1:
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 920, in if_stmt
return _py_if_stmt(cond, body, orelse)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/autograph/operators/control_flow.py", line 1029, in _py_if_stmt
return body() if cond else orelse()
File "/home/ubuntu/project_1/automl/efficientdet/backbone/efficientnet_model.py", line 370, in call
x = self._relu_fn(self._bn0(expand_conv_fn(x), training=training))
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/custom_gradient.py", line 256, in call
return self._d(self._f, a, k)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/custom_gradient.py", line 212, in decorated
return _graph_mode_decorator(wrapped, args, kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/custom_gradient.py", line 316, in _graph_mode_decorator
result, grad_fn = f(*args)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_impl.py", line 534, in swish
return features * math_ops.sigmoid(features), grad
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 3154, in sigmoid
return gen_math_ops.sigmoid(x, name=name)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 8750, in sigmoid
"Sigmoid", x=x, name=name)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 742, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3322, in _create_op_internal
op_def=op_def)
File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1756, in init
self._traceback = tf_stack.extract_stack()

@CraigWang1
Copy link

CraigWang1 commented Mar 25, 2020

Maybe try reducing the batch size with --train_batch_size 32, 16, 8, 4, or 2
eg.
--train_batch_size 32

@WonTaeYeon
Copy link
Author

Solved, Thx

@qtw1998
Copy link

qtw1998 commented Mar 31, 2020

Maybe try reducing the batch size with --train_batch_size 32, 16, 8, 4, or 2
eg.
--train_batch_size 32

Helped!thx

@qtw1998
Copy link

qtw1998 commented Apr 1, 2020

Maybe try reducing the batch size with --train_batch_size 32, 16, 8, 4, or 2
eg.
--train_batch_size 32

but I use 8 * 2080ti use bs = 4 still have the same OOM problem

@mingxingtan
Copy link
Member

This issue is similar to #85. I am going to close this one and keep that open. Feel free to add your comments to #85 if you still have problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants