failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY #48

zzks · 2017-03-15T05:16:26Z

When i try the WMT'16 EN-DE sample, encountered the following CUDA_ERROR_OUT_OF_MEMORY:

name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:01:00.0
Total memory: 11.90GiB
Free memory: 11.39GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 11.90G (12778405888 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/media/sbai/7A9C9BED9C9BA1E5/DL/seq2seq/bin/train.py", line 251, in
tf.app.run()
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/media/sbai/7A9C9BED9C9BA1E5/DL/seq2seq/bin/train.py", line 246, in main
schedule=FLAGS.schedule)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run
return task()
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 459, in train_and_evaluate
self.train(delay_secs=0)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 281, in train
monitors=self._train_monitors + extra_hooks)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/util/deprecation.py", line 280, in new_func
return func(*args, **kwargs)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 984, in _train_model
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 462, in run
run_metadata=run_metadata)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 786, in run
run_metadata=run_metadata)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 744, in run
return self._sess.run(*args, **kwargs)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 883, in run
feed_dict, options)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 909, in _call_hook_before_run
request = hook.before_run(run_context)
File "/media/sbai/7A9C9BED9C9BA1E5/DL/seq2seq/seq2seq/training/hooks.py", line 239, in before_run
"predicted_tokens": self._pred_dict["predicted_tokens"],
KeyError: 'predicted_tokens'

Env: TF1.0 GPU & Python3.4 & ubuntu14.04
I changed the batch size and the num_units into a smaller number, but still encountered the same error.
I tried toy data, met the same error.
Is it because I am using python3.4?

############################### update ###############
I tried it on Python3.5, got the same error at the first try, and got following error when i tried again:

WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:267: BaseMonitor.init (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
*** Error in `python3.5': double free or corruption (!prev): 0x0000000002870d90 ***
Aborted (core dumped)

DaoD · 2017-03-15T08:50:33Z

Same problem, but I solve it by ignoring the file "train_seq2seq.yml" and just using the "nmt_xxxx.yml"
I guess there is some trouble in the configuration about hooks, but I'm not sure.

zzks · 2017-03-15T09:28:51Z

@DaoD Thanks for you reply!
However, I tried your method, still got CUDA_ERROR_OUT_OF_MEMORY.
Anyone succeeded to run the sample?
Could you tell us your configuration and env?

DaoD · 2017-03-15T09:58:20Z

Yes the error will still happen but the training process could be continued.
I don't know the reason.

papajohn · 2017-03-15T16:37:06Z

I read some other posts [1], [2] about this issue for other projects. It appears that such an error doesn't matter, and the GPU still gets used (just with some memory growth factor that gets adjusted over time).

You can check whether your GPU is being used with the nvidia-smi command-line tool.

[1] http://stackoverflow.com/questions/39465503/cuda-error-out-of-memory-in-tensorflow
[2] tensorflow/tensorflow#6048

dennybritz · 2017-03-15T17:37:41Z

I think this is a different error and not related to the GPU:

"predicted_tokens": self._pred_dict["predicted_tokens"],
KeyError: 'predicted_tokens'

This is probably the same as #43. I haven't been able to reproduce this and not sure what is wrong here.

dennybritz · 2017-03-15T17:38:33Z

I'm closing this one because it seems like a duplicate of #43 - please discuss there or re-open the issue if it's not a duplicate.

zzks · 2017-03-16T08:58:54Z

@dennybritz @papajohn thanks for your reply!
I updated the new seq2seq, got some new errors.
however, i tried @DaoD 's method(no buckets) again, it still CUDA_ERROR_OUT_OF_MEMORY but the training process is continued.

new errors when using buckets config are as following (python3.4):
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 6810 get requests, put_count=8013 evicted_count=2000 eviction_rate=0.249594 and unsatisfied allocation rate=0.119824
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 212 to 233
INFO:tensorflow:Performing full trace on next step.
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcupti.so.8.0. LD_LIBRARY_PATH: :/usr/local/cuda/lib64:/usr/local/cudnn/lib64
F tensorflow/core/platform/default/gpu/cupti_wrapper.cc:59] Check failed: ::tensorflow::Status::OK() == (::tensorflow::Env::Default()->GetSymbolFromLibrary( GetDsoHandle(), kName, &f)) (OK vs. Not found: /home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: cuptiActivityRegisterCallbacks)could not find cuptiActivityRegisterCallbacksin libcupti DSO
Aborted (core dumped)

Anyway, I can run the sample now.

ayushidalmia · 2017-03-20T09:31:36Z

I also run into this. Any solution? A lighter unittest will help

eugenioclrc · 2018-01-27T20:27:18Z

same here...

myagmur01 · 2018-02-22T01:29:07Z

Have you checked your operating processes on GPU ? It may arise by multiple opened environments that cause GPU overloaded. I recommend to close all terminals and open again. I hope this works for you.

dennybritz closed this as completed Mar 15, 2017

dennybritz mentioned this issue Mar 31, 2017

Issues with CUDA_OUT_OF_MEMORY #132

Closed

ssokhey mentioned this issue Oct 4, 2017

ResourceExhaustedError: OOM when allocating tensor with shape[59,128,195370] #301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY #48

failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY #48

zzks commented Mar 15, 2017 •

edited

DaoD commented Mar 15, 2017

zzks commented Mar 15, 2017

DaoD commented Mar 15, 2017

papajohn commented Mar 15, 2017 •

edited

dennybritz commented Mar 15, 2017

dennybritz commented Mar 15, 2017

zzks commented Mar 16, 2017 •

edited

ayushidalmia commented Mar 20, 2017 •

edited

eugenioclrc commented Jan 27, 2018

myagmur01 commented Feb 22, 2018

failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY #48

failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY #48

Comments

zzks commented Mar 15, 2017 • edited

DaoD commented Mar 15, 2017

zzks commented Mar 15, 2017

DaoD commented Mar 15, 2017

papajohn commented Mar 15, 2017 • edited

dennybritz commented Mar 15, 2017

dennybritz commented Mar 15, 2017

zzks commented Mar 16, 2017 • edited

ayushidalmia commented Mar 20, 2017 • edited

eugenioclrc commented Jan 27, 2018

myagmur01 commented Feb 22, 2018

zzks commented Mar 15, 2017 •

edited

papajohn commented Mar 15, 2017 •

edited

zzks commented Mar 16, 2017 •

edited

ayushidalmia commented Mar 20, 2017 •

edited