-
Notifications
You must be signed in to change notification settings - Fork 15.1k
Description
My application fails with wrong numbers when using cuda >= 11.3 toolchain. Checked up to 11.6
source and assembly code
badcubin.zip
psi_list_ptr[iw] = psi_local;
both sides are std::complex. Bad binary caused the imaginary part of the left hand side has value 0.
--save-temps
assembly files from CUDA 11.2 and 11.3, they differ only by
diff cuda11.3/MultiSlaterDetTableMethod-openmp-nvptx64-nvidia-cuda.s cuda11.2/
5c5
< .version 7.3
---
> .version 7.2
If I compile the whole application with CUDA 11.3 toolchain, test fails. Since my application is OpenMP offload, the nvptx pass invokes ptxas, If I use ptxas from CUDA 11.2 to generate cubin for the failing file and all the rest uses CUDA 11.3. my test passes.
So my guess is the nvptx backend and ptxas (>=7.3) have some incompatibility and caused bad binary. I just leave my analysis here, hopefully backend experts will have more ideas.
Q: Is there a way to force clang to generate assembly files with a different PTX version? In this way combined with --ptxas-path
, I can use an alternative ptxas while the rest remains with the primary CUDA toolkit I need to use.