Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autokeras timeseries_forecaster official Tutorial : Colab script works with CPU, but not with GPU : CudnnRNN "Fail to find the dnn implementation." #1638

Open
Metawhy opened this issue Oct 13, 2021 · 2 comments

Comments

@Metawhy
Copy link

Metawhy commented Oct 13, 2021

Bug Description

When we call the ".fit()" method on the TimeSeriesForcaster autokeras model , it throws :

UnknownError:    Fail to find the dnn implementation.
	 [[{{node CudnnRNN}}]]
	 [[model/bidirectional/backward_lstm/PartitionedCall]] [Op:__inference_train_function_16009]

Function call stack:
train_function -> train_function -> train_function

Similar to issues like tensorflow/tensorflow#36508

But all solutions tested failed

Bug Reproduction

Simply switching from CPU to GPU provokes the error

Can be reproduced with Keras official tutorial colab link : https://colab.research.google.com/github/keras-team/autokeras/blob/master/docs/ipynb/timeseries_forecaster.ipynb

Code for reproducing the bug, including 4 different solutions tested independently & then together, to no success :

https://colab.research.google.com/drive/1HOpCzGvjU3t3Mg1Ptscshr2rHvWJobHX?usp=sharing

Data used by the code:

The standard data from the AutoKearas tutorial example https://archive.ics.uci.edu/ml/machine-learning-databases/00360/AirQualityUCI.zip

Manually downloaded & uploaded when "tf.keras.utils.get_file()" does not work : AirQualityUCI.csv

Expected Behavior

Runs without the error like when the session uses a CPU

Setup Details

Include the details about the versions of:

  • OS type and version: Colab
  • Python: 3.7
  • autokeras: 1.0.16
  • keras-tuner: 1.0.4
  • scikit-learn: 0.22.2.post1
  • numpy: 1.19.5
  • pandas: 1.1.5
  • tensorflow: 2.5.0
  • cuda : 11.1.105
  • cudnn : 7.6.5

Additional context

Tried & included code of these solutions but did not work

Solution 0 : try to install tensorflow GPU adapted verison for autokeras

!pip3 install tensorflow==2.5.0 --upgrade 
!pip3 install tensorflow-gpu==2.5.0 --upgrade #https://pypi.org/project/tensorflow-gpu/#history

Solution test # 1 : seen many times on github & stackoverflow

From https://stackoverflow.com/questions/54473254/cudnnlstm-unknownerror-fail-to-find-the-dnn-implementation

# 
import tensorflow as tf
gpus_exp = tf.config.experimental.list_physical_devices('GPU')
if len(gpus_exp) > 0:
    print(f'Len gpus_exp={len(gpus_exp)}, changing memory growth param')
    try:
        # From https://blog.csdn.net/ljyljyok/article/details/107619881
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except Exception as e:
        # Invalid device or cannot modify virtual devices once initialized.
        print(f'Did not manage to change memory growth param. Error :')
        print(e)
    pass

print('\n')

Solution test # 1 bis

From https://blog.csdn.net/ljyljyok/article/details/107619881

# Try alternative if did not find with first method, but should be equivalent
gpus = tf.config.list_physical_devices('GPU')
if len(gpus) > 0:
    print(f'Len gpus={len(gpus)}, changing memory growth param')
    try:
        for gpu in gpus:
            tf.config.set_memory_growth(gpu, True)
    except Exception as e:
        # Invalid device or cannot modify virtual devices once initialized.
        print(f'Did not manage to change memory growth param. Error :')
        print(e)
    pass

Solution # 2

From https://soowankim.github.io/2020-05-29/Keras-Cudnn-failed-initialize/

import os
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

Solution # 3

From https://leimao.github.io/blog/TensorFlow-cuDNN-Failure/ | https://www.titanwolf.org/Network/q/7812eb9a-c361-44c4-ad9f-dd4a437ba164/y

# PS: Seems to be for TF 1.X
tf_config = tf.compat.v1.ConfigProto()
tf_config.gpu_options.allow_growth = True
tf_interactive_session = tf.compat.v1.InteractiveSession(config=tf_config)
tf_session = tf.compat.v1.Session(config=tf_config)

ak.session = tf_session
@AschHarwood
Copy link

I'm having the exact same problem in colab

I used the exact code from the timeseries tutorial with my tabular data:

predict_from = 1
predict_until = 10
lookback = 3
clf = ak.TimeseriesForecaster(
    lookback=lookback,
    predict_from=predict_from,
    predict_until=predict_until,
    max_trials=1,
    objective="val_loss",
)

clf.fit(
    x=x_train,
    y=y_train,
    validation_data=(x_test, y_test),
    batch_size=32,
    epochs=10,
)

Output:

Search: Running Trial #1

Hyperparameter    |Value             |Best Value So Far 
timeseries_bloc...|True              |?                 
timeseries_bloc...|lstm              |?                 
timeseries_bloc...|2                 |?                 
regression_head...|0                 |?                 
optimizer         |adam              |?                 
learning_rate     |0.001             |?                 

Epoch 1/10
2021-10-20 11:51:51.322633: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-10-20 11:51:52.437468: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2021-10-20 11:51:52.438853: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cudnn_rnn_ops.cc:1553 : Unknown: Fail to find the dnn implementation.
2021-10-20 11:51:52.441772: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2021-10-20 11:51:52.442871: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cudnn_rnn_ops.cc:1553 : Unknown: Fail to find the dnn implementation.
---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
/tmp/ipykernel_8334/638650037.py in <module>
      4     validation_data=(x_test, y_test),
      5     batch_size=32,
----> 6     epochs=10,
      7 )

17 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

UnknownError:    Fail to find the dnn implementation.
	 [[{{node CudnnRNN}}]]
	 [[model/bidirectional/backward_lstm/PartitionedCall]] [Op:__inference_train_function_477706]

Function call stack:
train_function -> train_function -> train_function

@younader
Copy link

Same issue for me, I can run on CPU but not in gpu, without any code modifications to the text classification notebook, running all on Colab throws

UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node model/conv1d/conv1d (defined at /usr/local/lib/python3.7/dist-packages/autokeras/utils/utils.py:88) ]]
[[gradient_tape/model/embedding/embedding_lookup/Reshape/_76]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node model/conv1d/conv1d (defined at /usr/local/lib/python3.7/dist-packages/autokeras/utils/utils.py:88) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_5467]

Function call stack:
train_function -> train_function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants