Issues with CUDA_OUT_OF_MEMORY #132

ghost · 2017-03-31T10:33:50Z

Hi

When trying out the pipeline unit test I found here, I got the two following errors:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.8095
pciBusID 0000:02:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x3863510
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.8095
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8508145664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8506048512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmp4emg5gzs/model.ckpt.

As there is nothing running on the two GPU's (I checked with nvidia-smi) and I am the only person at my internship to try stuff out on them, I don't find a reasonable explanation. Could someone point me in the right direction? As I'm a newbie to tensorflow and GPU's in general, I find it hard to know where to start.

Thanks in advance

MaksymDel · 2017-03-31T15:40:35Z

Hi,
I have about the same issue (one GPU)...

davidecaroselli · 2017-03-31T16:25:23Z

Hi I have same issue here:

E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

no matter which nmt size I choose: nmt_small.yml, nmt_medium.yml and nmt_large.yml
I have an AWS instance p2.xlarge with an NVIDIA Tesla K80

dennybritz · 2017-03-31T17:08:57Z

#48

Sandhya2207 · 2017-07-05T13:42:44Z

Hi,
while testing the seq2seq model on toy data set, I am getting the following error--

E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/development/sandhya/installs/tf-seq2seq-google/seq2seq-master/bin/train.py", line 277, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/development/sandhya/installs/tf-seq2seq-google/seq2seq-master/bin/train.py", line 272, in main
schedule=FLAGS.schedule)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run
return task()
File "seq2seq/contrib/experiment.py", line 104, in continuous_train_and_eval
monitors=self._train_monitors)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 280, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 981, in _train_model
config=self.config.tf_config) as mon_sess:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 315, in MonitoredTrainingSession
return MonitoredSession(session_creator=session_creator, hooks=all_hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 601, in init
session_creator, hooks, should_recover=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 434, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 767, in init
_WrappedSession.init(self, self._create_session())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 772, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 494, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 375, in create_session
init_fn=self._scaffold.init_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 256, in prepare_session
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 161, in _restore_checkpoint
sess = session.Session(self._target, graph=self._graph, config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1176, in init
super(Session, self).init(target, graph, config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 552, in init
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/lib/python2.7/contextlib.py", line 24, in exit
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Have tried the earlier suggested fix of resetting the flag " gpu_allow_growth" to True. Kindly suggest.

Thanks

shubhamagarwal92 · 2017-07-05T13:45:26Z

Have you tried CUDA_VISIBLE_DEVICES flag while calling train/test.py? The problem is, you are not able to create session.

dennybritz closed this as completed Mar 31, 2017

ssokhey mentioned this issue Oct 4, 2017

ResourceExhaustedError: OOM when allocating tensor with shape[59,128,195370] #301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with CUDA_OUT_OF_MEMORY #132

Issues with CUDA_OUT_OF_MEMORY #132

ghost commented Mar 31, 2017

MaksymDel commented Mar 31, 2017 •

edited

davidecaroselli commented Mar 31, 2017

dennybritz commented Mar 31, 2017

Sandhya2207 commented Jul 5, 2017

shubhamagarwal92 commented Jul 5, 2017

Issues with CUDA_OUT_OF_MEMORY #132

Issues with CUDA_OUT_OF_MEMORY #132

Comments

ghost commented Mar 31, 2017

MaksymDel commented Mar 31, 2017 • edited

davidecaroselli commented Mar 31, 2017

dennybritz commented Mar 31, 2017

Sandhya2207 commented Jul 5, 2017

shubhamagarwal92 commented Jul 5, 2017

MaksymDel commented Mar 31, 2017 •

edited