Skip to content
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.

failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY #48

Closed
zzks opened this issue Mar 15, 2017 · 10 comments
Closed

failed to allocate 11.90G CUDA_ERROR_OUT_OF_MEMORY #48

zzks opened this issue Mar 15, 2017 · 10 comments

Comments

@zzks
Copy link

zzks commented Mar 15, 2017

When i try the WMT'16 EN-DE sample, encountered the following CUDA_ERROR_OUT_OF_MEMORY:

name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:01:00.0
Total memory: 11.90GiB
Free memory: 11.39GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 11.90G (12778405888 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Traceback (most recent call last):
File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/media/sbai/7A9C9BED9C9BA1E5/DL/seq2seq/bin/train.py", line 251, in
tf.app.run()
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/media/sbai/7A9C9BED9C9BA1E5/DL/seq2seq/bin/train.py", line 246, in main
schedule=FLAGS.schedule)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run
return task()
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 459, in train_and_evaluate
self.train(delay_secs=0)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 281, in train
monitors=self._train_monitors + extra_hooks)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/util/deprecation.py", line 280, in new_func
return func(*args, **kwargs)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit
loss = self._train_model(input_fn=input_fn, hooks=hooks)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 984, in _train_model
_, loss = mon_sess.run([model_fn_ops.train_op, model_fn_ops.loss])
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 462, in run
run_metadata=run_metadata)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 786, in run
run_metadata=run_metadata)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 744, in run
return self._sess.run(*args, **kwargs)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 883, in run
feed_dict, options)
File "/home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/training/monitored_session.py", line 909, in _call_hook_before_run
request = hook.before_run(run_context)
File "/media/sbai/7A9C9BED9C9BA1E5/DL/seq2seq/seq2seq/training/hooks.py", line 239, in before_run
"predicted_tokens": self._pred_dict["predicted_tokens"],
KeyError: 'predicted_tokens'

Env: TF1.0 GPU & Python3.4 & ubuntu14.04
I changed the batch size and the num_units into a smaller number, but still encountered the same error.
I tried toy data, met the same error.
Is it because I am using python3.4?

############################### update ###############
I tried it on Python3.5, got the same error at the first try, and got following error when i tried again:

WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:267: BaseMonitor.init (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
*** Error in `python3.5': double free or corruption (!prev): 0x0000000002870d90 ***
Aborted (core dumped)

@DaoD
Copy link

DaoD commented Mar 15, 2017

Same problem, but I solve it by ignoring the file "train_seq2seq.yml" and just using the "nmt_xxxx.yml"
I guess there is some trouble in the configuration about hooks, but I'm not sure.

@zzks
Copy link
Author

zzks commented Mar 15, 2017

@DaoD Thanks for you reply!
However, I tried your method, still got CUDA_ERROR_OUT_OF_MEMORY.
Anyone succeeded to run the sample?
Could you tell us your configuration and env?

@DaoD
Copy link

DaoD commented Mar 15, 2017

Yes the error will still happen but the training process could be continued.
I don't know the reason.

@papajohn
Copy link

papajohn commented Mar 15, 2017

I read some other posts [1], [2] about this issue for other projects. It appears that such an error doesn't matter, and the GPU still gets used (just with some memory growth factor that gets adjusted over time).

You can check whether your GPU is being used with the nvidia-smi command-line tool.

[1] http://stackoverflow.com/questions/39465503/cuda-error-out-of-memory-in-tensorflow
[2] tensorflow/tensorflow#6048

@dennybritz
Copy link
Contributor

I think this is a different error and not related to the GPU:

"predicted_tokens": self._pred_dict["predicted_tokens"],
KeyError: 'predicted_tokens'

This is probably the same as #43. I haven't been able to reproduce this and not sure what is wrong here.

@dennybritz
Copy link
Contributor

I'm closing this one because it seems like a duplicate of #43 - please discuss there or re-open the issue if it's not a duplicate.

@zzks
Copy link
Author

zzks commented Mar 16, 2017

@dennybritz @papajohn thanks for your reply!
I updated the new seq2seq, got some new errors.
however, i tried @DaoD 's method(no buckets) again, it still CUDA_ERROR_OUT_OF_MEMORY but the training process is continued.

new errors when using buckets config are as following (python3.4):
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 6810 get requests, put_count=8013 evicted_count=2000 eviction_rate=0.249594 and unsatisfied allocation rate=0.119824
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 212 to 233
INFO:tensorflow:Performing full trace on next step.
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcupti.so.8.0. LD_LIBRARY_PATH: :/usr/local/cuda/lib64:/usr/local/cudnn/lib64
F tensorflow/core/platform/default/gpu/cupti_wrapper.cc:59] Check failed: ::tensorflow::Status::OK() == (::tensorflow::Env::Default()->GetSymbolFromLibrary( GetDsoHandle(), kName, &f)) (OK vs. Not found: /home/sbai/tf134/lib/python3.4/site-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: cuptiActivityRegisterCallbacks)could not find cuptiActivityRegisterCallbacksin libcupti DSO
Aborted (core dumped)

Anyway, I can run the sample now.

@ayushidalmia
Copy link

ayushidalmia commented Mar 20, 2017

I also run into this. Any solution? A lighter unittest will help

@eugenioclrc
Copy link

same here...

@myagmur01
Copy link

Have you checked your operating processes on GPU ? It may arise by multiple opened environments that cause GPU overloaded. I recommend to close all terminals and open again. I hope this works for you.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants