ERROR: CUDA RT call cudaFuncSetAttribute. Failed with invalid device function (98). #27

wang-zerui · 2023-07-28T18:08:35Z

I run into this error when I train the model with use_fast_fftconv.

ERROR: CUDA RT call "cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size )" in line 809 of file /***********/H3/csrc/fftconv/fftconv_cuda.cu failed with invalid device function (98).

This error actually doesn't stop the training process, but the result of the conv op is wrong. I also run PYTHONPATH=$(pwd) pytest tests/

PYTHONPATH=$(pwd) pytest tests/
========================================================================================================================= test session starts =========================================================================================================================
platform linux -- Python 3.8.16, pytest-7.4.0, pluggy-1.2.0
rootdir: /mnt/cache/wangzerui/H3-origin/H3
plugins: anyio-3.6.2
collected 4160 items                                                                                                                                                                                                                                                  

tests/ops/test_fftconv.py FFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFF [  5%]
FFFFKilled

Envs:
- torch: 2.0.0+cu117
- gpu: A100-80G × 8
- cuda: 11.7

I have installed the fftconv by running

cd csrc/cauchy && pip install . && cd ../../ \
    && cd csrc/fftconv && pip install . && cd ../../ \
    && cd .. && rm -rf csrc

The text was updated successfully, but these errors were encountered:

tridao · 2023-07-28T18:13:29Z

I haven't seen this, but googling seems to suggest it's because there’s a mismatch between the CUDA version the binary was compiled to and the CUDA version of the device. Maybe the solution is to uninstall the fftconv extension, then make sure to reinstall it with the right CUDA version.

wang-zerui · 2023-08-07T11:20:08Z

Solved after I use another cluster with a newer driver.

current output of nvidia-smi:

Every 1.0s: nvidia-smi                                                                                    Mon Aug  7 19:15:39 2023

Mon Aug  7 19:15:39 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+

Previous output:

Every 1.0s: nvidia-smi                                                                                    Mon Aug  7 19:19:25 2023

Mon Aug  7 19:19:25 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |

wang-zerui closed this as completed Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR: CUDA RT call cudaFuncSetAttribute. Failed with invalid device function (98). #27

ERROR: CUDA RT call cudaFuncSetAttribute. Failed with invalid device function (98). #27

wang-zerui commented Jul 28, 2023

tridao commented Jul 28, 2023

wang-zerui commented Aug 7, 2023

ERROR: CUDA RT call cudaFuncSetAttribute. Failed with invalid device function (98). #27

ERROR: CUDA RT call cudaFuncSetAttribute. Failed with invalid device function (98). #27

Comments

wang-zerui commented Jul 28, 2023

tridao commented Jul 28, 2023

wang-zerui commented Aug 7, 2023