Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: CUDA RT call cudaFuncSetAttribute. Failed with invalid device function (98). #27

Closed
wang-zerui opened this issue Jul 28, 2023 · 2 comments

Comments

@wang-zerui
Copy link

I run into this error when I train the model with use_fast_fftconv.

ERROR: CUDA RT call "cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size )" in line 809 of file /***********/H3/csrc/fftconv/fftconv_cuda.cu failed with invalid device function (98).

This error actually doesn't stop the training process, but the result of the conv op is wrong. I also run PYTHONPATH=$(pwd) pytest tests/

PYTHONPATH=$(pwd) pytest tests/
========================================================================================================================= test session starts =========================================================================================================================
platform linux -- Python 3.8.16, pytest-7.4.0, pluggy-1.2.0
rootdir: /mnt/cache/wangzerui/H3-origin/H3
plugins: anyio-3.6.2
collected 4160 items                                                                                                                                                                                                                                                  

tests/ops/test_fftconv.py FFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFF [  5%]
FFFFKilled
  • Envs:
    • torch: 2.0.0+cu117
    • gpu: A100-80G × 8
    • cuda: 11.7

I have installed the fftconv by running

cd csrc/cauchy && pip install . && cd ../../ \
    && cd csrc/fftconv && pip install . && cd ../../ \
    && cd .. && rm -rf csrc
@tridao
Copy link
Contributor

tridao commented Jul 28, 2023

I haven't seen this, but googling seems to suggest it's because there’s a mismatch between the CUDA version the binary was compiled to and the CUDA version of the device. Maybe the solution is to uninstall the fftconv extension, then make sure to reinstall it with the right CUDA version.

@wang-zerui
Copy link
Author

Solved after I use another cluster with a newer driver.

current output of nvidia-smi:

Every 1.0s: nvidia-smi                                                                                    Mon Aug  7 19:15:39 2023

Mon Aug  7 19:15:39 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+

Previous output:

Every 1.0s: nvidia-smi                                                                                    Mon Aug  7 19:19:25 2023

Mon Aug  7 19:19:25 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants