Skip to content
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.

Issues with CUDA_OUT_OF_MEMORY #132

Closed
ghost opened this issue Mar 31, 2017 · 5 comments
Closed

Issues with CUDA_OUT_OF_MEMORY #132

ghost opened this issue Mar 31, 2017 · 5 comments

Comments

@ghost
Copy link

ghost commented Mar 31, 2017

Hi

When trying out the pipeline unit test I found here, I got the two following errors:

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.8095
pciBusID 0000:02:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x3863510
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.8095
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8508145664 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8506048512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmp4emg5gzs/model.ckpt.

As there is nothing running on the two GPU's (I checked with nvidia-smi) and I am the only person at my internship to try stuff out on them, I don't find a reasonable explanation. Could someone point me in the right direction? As I'm a newbie to tensorflow and GPU's in general, I find it hard to know where to start.

Thanks in advance

@MaksymDel
Copy link
Contributor

MaksymDel commented Mar 31, 2017

Hi,
I have about the same issue (one GPU)...

@davidecaroselli
Copy link

Hi I have same issue here:

E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 11.17G (11995578368 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

no matter which nmt size I choose: nmt_small.yml, nmt_medium.yml and nmt_large.yml
I have an AWS instance p2.xlarge with an NVIDIA Tesla K80

@dennybritz
Copy link
Contributor

#48

@Sandhya2207
Copy link

Hi,
while testing the seq2seq model on toy data set, I am getting the following error--

E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/development/sandhya/installs/tf-seq2seq-google/seq2seq-master/bin/train.py", line 277, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/development/sandhya/installs/tf-seq2seq-google/seq2seq-master/bin/train.py", line 272, in main
schedule=FLAGS.schedule)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run
return task()
File "seq2seq/contrib/experiment.py", line 104, in continuous_train_and_eval
monitors=self._train_monitors)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 280, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 981, in _train_model
config=self.config.tf_config) as mon_sess:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 315, in MonitoredTrainingSession
return MonitoredSession(session_creator=session_creator, hooks=all_hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 601, in init
session_creator, hooks, should_recover=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 434, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 767, in init
_WrappedSession.init(self, self._create_session())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 772, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 494, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 375, in create_session
init_fn=self._scaffold.init_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 256, in prepare_session
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 161, in _restore_checkpoint
sess = session.Session(self._target, graph=self._graph, config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1176, in init
super(Session, self).init(target, graph, config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 552, in init
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/lib/python2.7/contextlib.py", line 24, in exit
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Have tried the earlier suggested fix of resetting the flag " gpu_allow_growth" to True. Kindly suggest.

Thanks

@shubhamagarwal92
Copy link

Have you tried CUDA_VISIBLE_DEVICES flag while calling train/test.py? The problem is, you are not able to create session.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants