Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #70

Closed
devindrown opened this issue Aug 8, 2019 · 8 comments
Closed

could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #70

devindrown opened this issue Aug 8, 2019 · 8 comments

Comments

@devindrown
Copy link

Setup Medaka v0.8.1 to run with GPU, but it crashes consistently get this error during runtime Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR.

I'm seeing references to gpu_options.allow_growth = True online but not sure how that would be implemented with this code.

System:
Ubuntu 18.04
Cuda 10.1
tensorflow-gpu 1.12 (also tried 1.14 and 2.0.0-beta1)

2019-08-07 22:35:13.583084: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-08-07 22:35:13.584900: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "/home/dmdrown/medaka/venv/bin/medaka", line 11, in
load_entry_point('medaka==0.8.1', 'console_scripts', 'medaka')()
File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/medaka-0.8.1-py3.6-linux-x86_64.egg/medaka/medaka.py", line 363, in main
args.func(args)
File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/medaka-0.8.1-py3.6-linux-x86_64.egg/medaka/inference.py", line 462, in predict
tag_name=args.tag_name, tag_value=args.tag_value, tag_keep_missing=args.tag_keep_missing
File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/medaka-0.8.1-py3.6-linux-x86_64.egg/medaka/inference.py", line 388, in run_prediction
class_probs = model.predict_on_batch(x_data)
File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1294, in predict_on_batch
outputs = self.predict_function(inputs)
File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3292, in call
run_metadata=self.run_metadata)
File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Fail to find the dnn implementation.
[[{{node bidirectional/CudnnRNN_1}}]]
[[classify/truediv/_123]]
(1) Unknown: Fail to find the dnn implementation.
[[{{node bidirectional/CudnnRNN_1}}]]

@cjw85
Copy link
Member

cjw85 commented Aug 8, 2019

Medaka v0.8.x will not work with tensorflow(-gpu) 1.12, you must use 1.14. Tensorflow 2.0.0 is untested and not recommended.

That being said, this is not the cause of your errors above. I would surmise one of two things is happening.

  • python is not in fact loading the tensorflow-gpu package but rather the cpu-only tensorflow. To correct this try running (within the virtual environment):

     pip uninstall tensorflow-gpu tensorflow medaka
    

    observing any messages that occur. You should get to a situation where python cannot import tensorflow, e.g. python -c "import tensorflow" fails with an import error. Then proceed by amending the requirements.txt file to list tensorflow-gpu==1.14.0 and installing medaka with python setup.py install.

  • tensorflow-gpu is being loaded correctly but loading an incorrect cudnn version (or none at all). If you were using tensorflow-gpu==1.14.0 the logging would note the versions of libraries that it had found and loaded. The tensorflow docs notes tested build configuration, which is roughly aligned to the requirements of the pypi tensorflow-gpu packages. On my development computer I have (amongst others) cudnn v7.4.

See #65 (comment)

@devindrown
Copy link
Author

Thanks for the rapid reply. I appreciate the help troubleshooting the install environment.

I'm pretty sure that tensorflow-gpu is installed in the environment. After I compiled/install medaka from source, I try to uninstall tensorflow, the system confirms it is not installed and that tensorflow-gpu is installed

(medaka) dmdrown@hikita:~/medaka$ pip uninstall tensorflow
WARNING: Skipping tensorflow as it is not installed.

(medaka) dmdrown@hikita:~/medaka$ pip show tensorflow-gpu
Name: tensorflow-gpu
Version: 1.14.0

I was running an incompatible cudnn version (v7.6). I have downgraded the cudnn version down to v7.4. Running a test case and I can confirm it is functional:

(medaka) dmdrown@hikita:~/cudnn_samples_v7/mnistCUDNN$ ./mnistCUDNN
cudnnGetVersion() : 7402 , CUDNN_VERSION from cudnn.h : 7402 (7.4.2)
Host compiler version : GCC 7.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 46  Capabilities 7.5, SmClock 1710.0 Mhz, MemSize (Mb) 7979, MemClock 7000.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

When I run medaka_consensus I do see [11:24:51 - ModelLoad] With cudnn: True

Unfortunately, I'm still seeing the same behavior as before. Medaka is crashing.

[11:24:53 - Sampler] Initializing sampler for consensus of region utg000001c:999000-2000000.
2019-08-08 11:24:54.701962: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-08-08 11:24:54.712180: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/home/dmdrown/medaka/venv/bin/medaka", line 11, in <module>
    load_entry_point('medaka==0.8.2', 'console_scripts', 'medaka')()
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/medaka-0.8.2-py3.6-linux-x86_64.egg/medaka/medaka.py", line 363, in main
    args.func(args)
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/medaka-0.8.2-py3.6-linux-x86_64.egg/medaka/inference.py", line 462, in predict
    tag_name=args.tag_name, tag_value=args.tag_value, tag_keep_missing=args.tag_keep_missing
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/medaka-0.8.2-py3.6-linux-x86_64.egg/medaka/inference.py", line 388, in run_prediction
    class_probs = model.predict_on_batch(x_data)
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1294, in predict_on_batch
    outputs = self.predict_function(inputs)
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Fail to find the dnn implementation.
         [[{{node bidirectional/CudnnRNN_1}}]]
  (1) Unknown: Fail to find the dnn implementation.
         [[{{node bidirectional/CudnnRNN_1}}]]
         [[classify/truediv/_123]]
0 successful operations.
0 derived errors ignored.
Failed to run medaka consensus.

You mentioned

If you were using tensorflow-gpu==1.14.0 the logging would note the versions of libraries that it had found and loaded.

Other than the stdout, where should I look for the logging?

@cjw85
Copy link
Member

cjw85 commented Aug 8, 2019

To see this run medaka consensus with with the debug logging:

medaka consensus <reads.bam> <output> --batch 100 --debug

On my computer this gives:

2019-08-08 20:54:53.335843: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-08-08 20:54:53.555728: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x57ca630 executing computations on platform CUDA. Devices:
2019-08-08 20:54:53.555786: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-08-08 20:54:53.555803: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-08-08 20:54:53.560464: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3311930000 Hz
2019-08-08 20:54:53.564264: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5896930 executing computations on platform Host. Devices:
2019-08-08 20:54:53.564317: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-08-08 20:54:53.567366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:17:00.0
2019-08-08 20:54:53.568435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
2019-08-08 20:54:53.568968: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-08 20:54:53.570914: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-08-08 20:54:53.572576: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-08-08 20:54:53.573060: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-08-08 20:54:53.575252: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-08-08 20:54:53.576947: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-08-08 20:54:53.580771: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-08 20:54:53.587217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
2019-08-08 20:54:53.587264: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-08 20:54:53.592100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-08 20:54:53.592123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1
2019-08-08 20:54:53.592132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y
2019-08-08 20:54:53.592139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N
2019-08-08 20:54:53.598864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10479 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
2019-08-08 20:54:53.604716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10420 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)```

@cjw85
Copy link
Member

cjw85 commented Aug 8, 2019

Looking at our code, the log message

[20:54:53 - ModelLoad] With cudnn: True

Only means that a CuDNN accelerated model is going to be built (because tensorflow has detected a CUDA GPU), not that the CuDNN libraries have been successfully loaded.

@cjw85
Copy link
Member

cjw85 commented Aug 8, 2019

I've done a little research, are you using an RTX series card? I do not have access these right now.

This issue, (which is perhaps where you came across the config.gpu_options.allow_growth = True hacks) holds some insight.

When I run medaka consensus with the default parameters on a 12Gb 1080Ti GPU I receive out of memory errors, hence me setting --batch 100 in the above, the default is 200. There's a suggestion in the above issue that an out of memory error with an RTX card can result in the error you are seeing. Can you try setting --batch 100 to see the effect?

That the default value for this parameter is inappropriate for most GPUs is a known bug. For tensorflow 1.12.0 the default value was fine, there are tensorflow github issues noting a similar memory use increase. We intend to reduce the default value.

@devindrown
Copy link
Author

Yes, I've got an RTX card. I'm running on a GeForce RTX 2080

I've tried reducing the batch size, even as low as 1 and each time it still produces the same out of error Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR.

Running medaka in debug mode produces:

[12:57:53 - Predict] Setting tensorflow threads to 1.
2019-08-08 12:57:53.315779: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-08 12:57:53.318429: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-08-08 12:57:53.398185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-08 12:57:53.398515: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x52b44b0 executing computations on platform CUDA. Devices:
2019-08-08 12:57:53.398527: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2080, Compute Capability 7.5
2019-08-08 12:57:53.418105: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2019-08-08 12:57:53.418981: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x569d780 executing computations on platform Host. Devices:
2019-08-08 12:57:53.418991: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-08-08 12:57:53.419075: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-08 12:57:53.419349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2019-08-08 12:57:53.419552: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-08 12:57:53.420211: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-08-08 12:57:53.420808: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-08-08 12:57:53.420951: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-08-08 12:57:53.421734: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-08-08 12:57:53.422338: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-08-08 12:57:53.423992: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-08 12:57:53.424038: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-08 12:57:53.424330: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-08 12:57:53.424577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-08-08 12:57:53.424610: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-08 12:57:53.425011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-08 12:57:53.425019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-08-08 12:57:53.425022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-08-08 12:57:53.425074: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-08 12:57:53.425320: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-08 12:57:53.425556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7408 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5)

Just before the crash, I see this in the debug output

2019-08-08 12:57:55.262773: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-08-08 12:57:55.499958: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-08 12:57:56.170450: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-08-08 12:57:56.170529: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1329 : Unknown: Fail to find the dnn implementation.
2019-08-08 12:57:56.172181: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-08-08 12:57:56.172213: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1329 : Unknown: Fail to find the dnn implementation.
[12:57:56 - DataStore] Skipping validation on close.
[12:57:56 - DataStore] Writing metadata.
Traceback (most recent call last):
  File "/home/dmdrown/medaka/venv/bin/medaka", line 11, in <module>
    load_entry_point('medaka==0.8.2', 'console_scripts', 'medaka')()
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/medaka-0.8.2-py3.6-linux-x86_64.egg/medaka/medaka.py", line 363, in main
    args.func(args)
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/medaka-0.8.2-py3.6-linux-x86_64.egg/medaka/inference.py", line 462, in predict
    tag_name=args.tag_name, tag_value=args.tag_value, tag_keep_missing=args.tag_keep_missing
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/medaka-0.8.2-py3.6-linux-x86_64.egg/medaka/inference.py", line 388, in run_prediction
    class_probs = model.predict_on_batch(x_data)
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1294, in predict_on_batch
    outputs = self.predict_function(inputs)
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "/home/dmdrown/medaka/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Fail to find the dnn implementation.
         [[{{node bidirectional/CudnnRNN_1}}]]
  (1) Unknown: Fail to find the dnn implementation.
         [[{{node bidirectional/CudnnRNN_1}}]]
         [[classify/truediv/_123]]
0 successful operations.
0 derived errors ignored.

@cjw85
Copy link
Member

cjw85 commented Aug 8, 2019

I believe you can enable the behaviour of gpu_options.allow_growth = True without changing the code by setting an environment variable:

export TF_FORCE_GPU_ALLOW_GROWTH=true

Have you tried this? When I tested this it resulted in having to reduce the batch size further to --batch 80 on my 1080Ti to avoid out of memory errors.

An alternative would be to force medaka to not use the cuDNN library. To do this you will have to get your hands ever so slightly dirty with the code: at this line you should fix cudnn=False.

I'm sorry I cannot provide better guidance at this time. I will discuss this further with some contacts and find out if there are better solutions.

@devindrown
Copy link
Author

Changing the environmental variable as you suggest FIXES the crashing behavior. I had to decrease the batchsize to 50 to stop the OOM errors.

medaka_consensus -d draft_assm.fa -i basecalls.fa -t 8 -b 50
...
Polished assembly written to medaka/consensus.fasta, have a nice day.

You've been very helpful and certainly clarified this for setting up my next box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants