failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY #48
Comments
Same problem, but I solve it by ignoring the file "train_seq2seq.yml" and just using the "nmt_xxxx.yml" |
@DaoD Thanks for you reply! |
Yes the error will still happen but the training process could be continued. |
I read some other posts [1], [2] about this issue for other projects. It appears that such an error doesn't matter, and the GPU still gets used (just with some memory growth factor that gets adjusted over time). You can check whether your GPU is being used with the [1] http://stackoverflow.com/questions/39465503/cuda-error-out-of-memory-in-tensorflow |
I think this is a different error and not related to the GPU:
This is probably the same as #43. I haven't been able to reproduce this and not sure what is wrong here. |
I'm closing this one because it seems like a duplicate of #43 - please discuss there or re-open the issue if it's not a duplicate. |
@dennybritz @papajohn thanks for your reply! new errors when using buckets config are as following (python3.4): Anyway, I can run the sample now. |
I also run into this. Any solution? A lighter unittest will help |
same here... |
Have you checked your operating processes on GPU ? It may arise by multiple opened environments that cause GPU overloaded. I recommend to close all terminals and open again. I hope this works for you. |
When i try the WMT'16 EN-DE sample, encountered the following CUDA_ERROR_OUT_OF_MEMORY:
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:01:00.0
Total memory: 11.90GiB
Free memory: 11.39GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 11.90G (12778405888 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/media/sbai/7A9C9BED9C9BA1E5/DL/seq2seq/bin/train.py", line 251, in
tf.app.run()
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/media/sbai/7A9C9BED9C9BA1E5/DL/seq2seq/bin/train.py", line 246, in main
schedule=FLAGS.schedule)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run
return task()
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 459, in train_and_evaluate
self.train(delay_secs=0)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 281, in train
monitors=self._train_monitors + extra_hooks)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/util/deprecation.py", line 280, in new_func
return func(*args, **kwargs)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 984, in _train_model
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 462, in run
run_metadata=run_metadata)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 786, in run
run_metadata=run_metadata)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 744, in run
return self._sess.run(*args, **kwargs)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 883, in run
feed_dict, options)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 909, in _call_hook_before_run
request = hook.before_run(run_context)
File "/media/sbai/7A9C9BED9C9BA1E5/DL/seq2seq/seq2seq/training/hooks.py", line 239, in before_run
"predicted_tokens": self._pred_dict["predicted_tokens"],
KeyError: 'predicted_tokens'
Env: TF1.0 GPU & Python3.4 & ubuntu14.04
I changed the batch size and the num_units into a smaller number, but still encountered the same error.
I tried toy data, met the same error.
Is it because I am using python3.4?
############################### update ###############
I tried it on Python3.5, got the same error at the first try, and got following error when i tried again:
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:267: BaseMonitor.init (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
*** Error in `python3.5': double free or corruption (!prev): 0x0000000002870d90 ***
Aborted (core dumped)
The text was updated successfully, but these errors were encountered: