Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVPTX] bad binary since cuda-11.3 #54633

Closed
ye-luo opened this issue Mar 30, 2022 · 16 comments
Closed

[NVPTX] bad binary since cuda-11.3 #54633

ye-luo opened this issue Mar 30, 2022 · 16 comments
Assignees

Comments

@ye-luo
Copy link
Contributor

ye-luo commented Mar 30, 2022

My application fails with wrong numbers when using cuda >= 11.3 toolchain. Checked up to 11.6
source and assembly code
badcubin.zip

psi_list_ptr[iw] = psi_local;

both sides are std::complex. Bad binary caused the imaginary part of the left hand side has value 0.

--save-temps assembly files from CUDA 11.2 and 11.3, they differ only by

diff cuda11.3/MultiSlaterDetTableMethod-openmp-nvptx64-nvidia-cuda.s cuda11.2/
5c5
< .version 7.3
---
> .version 7.2

If I compile the whole application with CUDA 11.3 toolchain, test fails. Since my application is OpenMP offload, the nvptx pass invokes ptxas, If I use ptxas from CUDA 11.2 to generate cubin for the failing file and all the rest uses CUDA 11.3. my test passes.
So my guess is the nvptx backend and ptxas (>=7.3) have some incompatibility and caused bad binary. I just leave my analysis here, hopefully backend experts will have more ideas.

Q: Is there a way to force clang to generate assembly files with a different PTX version? In this way combined with --ptxas-path, I can use an alternative ptxas while the rest remains with the primary CUDA toolkit I need to use.

@Artem-B
Copy link
Member

Artem-B commented Mar 30, 2022

Is there a way to force clang to generate assembly files with a different PTX version?

I do not think that's not going to help you. The issue is apparently with ptxas. PTX version itself is inconsequential.
Using a different CUDA version (and thus different ptxas) or explicitly specified ptxas from a different CUDA version are your only practical options.

NVIDIA almost never fixes bugs in released CUDA versions, so, if the bug is not fixed in CUDA-11.6.2, you will need to wait for a new CUDA release and hope that they fix it there.

If you could create a small reproducer that could be used as a test one could compile and run, that would help me to file a bug with NVIDIA and get it fixed. Ideally a pure CUDA source, without having to use openmp. A reasonably small PTX source which assembles to something obviously wrong on the SASS level would work, too.

@Artem-B Artem-B self-assigned this Mar 30, 2022
@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 30, 2022

@Artem-B thank you for the comment. Yes as a workaround, I'd like to explicitly specify ptxas from a different CUDA version, namely point to the 11.2 one while using 11.4 with all the rest. However right now if Clang recogonize 11.4, it outputs the assembly file with ptx version 7.4 and ptxas from 11.2 rejects it saying 7.4 is not supported. So If I can force outputing the PTX assembly files with 7.2 version label, I can implement a workaround for now forcing ptxas from 11.2 and using CUDA 11.4 in all the rest.

Not much luck when I tried to create a small reproducer. Will try SASS.

@Artem-B
Copy link
Member

Artem-B commented Mar 30, 2022

You could try overriding the PTX version with -Xclang -target-feature -Xclang -ptx74 -Xclang -target-feature -Xclang +ptx72.
It would need to be tweaked if you use CUDA and ptxas versions other than 11.4 and 11.2.

If you use a newer ptxas extra options should not be needed, but the produced cubin may not be compatible with the older CUDA runtime.

@Artem-B
Copy link
Member

Artem-B commented Jan 28, 2023

Would you be able to reduce the reproducer? It would be great to figure out what triggers the issue and, if it's indeed a miscompile in ptxas, report it to NVIDIA so they can get it fixed.

@markdewing
Copy link
Contributor

markdewing commented Oct 12, 2023

This issue affects complex number types and structures.
I'm trying to create a smaller reproducer for this issue, and so far have not been successful.
The failure mode is the assignment of the reduction variable after the reduction fails to copy the second item (for the type with 3 elements, the third element copies okay).
What is important is the reduction. (That is, if the pragma with the reduction is removed, the code works)

A snippet of code with the issue

#pragma omp target teams distribute is_device_ptr(psiV_list_devptr, psiMinv_temp_list_devptr) \
                              map(always, from:curRatio_list_ptr[:nw])
      for (size_t iw = 0; iw < nw; iw++)
      {
        GradType ratioGradRef_local(0);
#pragma omp parallel for reduction(+ : ratioGradRef_local)
        for (size_t i = 0; i < NumPtcls; i++)
        {
          const size_t J            = confgListOccup_ptr[i];
          psiV_temp_list_ptr[iw][i] = psiV_list_devptr[iw][J];
          ratioGradRef_local += psiMinv_temp_list_devptr[iw][i * psiMinv_cols + WorkingIndex] * dpsiV_list_ptr[iw][J];
        }
  
        // Workaround for https://github.com/QMCPACK/qmcpack/issues/4767
        // Compiler-generated assignment fails for the second component.
        // Workaround is to assign each component individually.
        // This assignment fails
        //ratioGradRef_list_ptr[iw] = ratioGradRef_local;
        // Assigning individual components works
        ratioGradRef_list_ptr[iw][0] = ratioGradRef_local[0];
        ratioGradRef_list_ptr[iw][1] = ratioGradRef_local[1];
        ratioGradRef_list_ptr[iw][2] = ratioGradRef_local[2];
        ...

GradType is basically a struct with 3 doubles.
More precisely, it is of type TinyVector<double, 3>, where TinyVector is defined as

template<class T, unsigned D>
struct TinyVector
{
  using Type_t = T;
  enum
  {
    Size = D
  };
  T X[Size];
  ...

And, as originally posted, the assignment succeeds with CUDA 11.2.2, and fails with any later version of CUDA.

The issue can be seen by adding printf's immediately after the assignment

      ratioGradRef_list_ptr[iw] = ratioGradRef_local;
      printf("iw = %lu ratioGradRef_local = %g %g %g\n",iw,ratioGradRef_local[0],ratioGradRef_local[1],ratioGradRef_loc  al[2]);
      printf("iw = %lu ratioGradRef_list = %g %g %g\n",iw,ratioGradRef_list_ptr[iw][0],ratioGradRef_list_ptr[iw][1],rat  ioGradRef_list_ptr[iw][2]);

Will produce the output

iw = 0 ratioGradRef_local = 4.64154 3.48115 2.18973
iw = 1 ratioGradRef_local = 0.701702 0.801945 0.443089
iw = 0 ratioGradRef_list = 4.64154 0 2.18973
iw = 1 ratioGradRef_list = 0.701702 0 0.443089

@markdewing
Copy link
Contributor

The type of the index in the outer loop over nw seems to matter. If the type is changed from size_t to uint32_t, the problem seems to go away. Still can't reproduce in a smaller, standalone test, though.

@markdewing
Copy link
Contributor

The PTX looks fine, but the incorrect code is visible in the SASS.

One oddity is the load of the index variable to do the offset computation for ratioGradRef_list_ptr. With a 32 bit index variable, it does a 32 bit load. With a 64 bit index variable, it does a 128 bit load (???). In some versions of CUDA, a 64 bit load here would result in correct behavior because it wouldn't overwrite a certain register, but I think that might be accidental. (At least assuming semantics of SASS based on mnemonic name and structure)

The PTX from clang, for the 'good' case (32 bit index variable):

    .loc    1 795 29                        // qmcpack/src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.2.cpp:795:29
    ld.shared.u32   %r352, [iw_shared5_$_0];
    .loc    1 795 7                         // qmcpack/src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.2.cpp:795:7
    mul.wide.u32    %rd352, %r352, 24;
    add.s64     %rd353, %rd3, %rd352;
    .loc    1 795 33                        // qmcpack/src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.2.cpp:795:33
    ld.shared.u64   %rd354, [ratioGradRef_local_shared+16];
    st.global.u64   [%rd353+16], %rd354;
    ld.shared.u64   %rd355, [ratioGradRef_local_shared+8];
    st.global.u64   [%rd353+8], %rd355;
    ld.shared.u64   %rd356, [ratioGradRef_local_shared];
    st.global.u64   [%rd353], %rd356;
    .loc    1 803 17                        // qmcpack/src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.2.cpp:803:17

For the 'bad' case (64 bit index variable), the index variable iw_shared5_ is loaded and multiplied with 64 bit ops, as expected, and the rest is the same.

     ld.shared.u64   %rd379, [iw_shared5_$_0];
    .loc    1 795 7                         // qmcpack/src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.2.cpp:795:7
    mul.lo.s64  %rd380, %rd379, 24;
    add.s64     %rd381, %rd3, %rd380;
    .loc    1 795 33                        // qmcpack/src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.2.cpp:795:33
    ld.shared.u64   %rd382, [ratioGradRef_local_shared+16];
    st.global.u64   [%rd381+16], %rd382;
    ld.shared.u64   %rd383, [ratioGradRef_local_shared+8];
    st.global.u64   [%rd381+8], %rd383;
    ld.shared.u64   %rd384, [ratioGradRef_local_shared];
    st.global.u64   [%rd381], %rd384;
    .loc    1 803 17                        // qmcpack/src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.2.cpp:803:17

Now, the SASS for the 'bad' case (64 bit index variable) (Using CUDA 11.6):

.L_x_646:
        /*5060*/                   LDS.128 R4, [`($__ratioGradRef_local_shared__2660)] ;
        /*5070*/                   LDS.64 R2, [`(($__ratioGradRef_local_shared__2660 + 0x10))] ;
        /*5080*/                   DADD R4, R4, R16 ;
        /*5090*/                   DADD R6, R6, R18 ;
        /*50a0*/                   DADD R2, R2, R22 ;
        /*50b0*/                   STS.128 [`($__ratioGradRef_local_shared__2660)], R4 ;
        /*50c0*/                   STS.64 [`(($__ratioGradRef_local_shared__2660 + 0x10))], R2 ;
.L_x_672:
        /*50d0*/                   BSYNC B8 ;
.L_x_635:
        /*50e0*/                   STS [`(_ZN4ompx5state9TeamStateE_$_0)], RZ ;
        /*50f0*/                   BRA `(.L_x_633) ;
.L_x_634:
        /*5100*/                   IMAD.MOV.U32 R2, RZ, RZ, 0x1 ;
        /*5110*/                   STS [`(_ZN4ompx5state9TeamStateE_$_2)], R0 ;
        /*5120*/                   STS [`(_ZN4ompx5state9TeamStateE_$_1)], R2 ;
        /*5130*/                   STS [`(_ZN4ompx5state9TeamStateE_$_0)], R2 ;
        /*5140*/                   BAR.SYNC 0x8 ;
        /*5150*/                   BAR.SYNC 0x8 ;
        /*5160*/                   STS [`(_ZN4ompx5state9TeamStateE_$_0)], RZ ;
        /*5170*/                   STS [`(_ZN4ompx5state9TeamStateE_$_1)], RZ ;
        /*5180*/                   STS [`(_ZN4ompx5state9TeamStateE_$_2)], R2 ;
.L_x_633:
        /*5190*/                   LDS.128 R4, [`($__iw_shared5_$_0__2656)] ;
        /*51a0*/                   IMAD.MOV.U32 R3, RZ, RZ, 0x18 ;
        /*51b0*/                   ISETP.NE.AND P0, PT, R24, RZ, PT ;
        /*51c0*/                   ULDC.64 UR4, c[0x0][0x118] ;
        /*51d0*/                   LDS.64 R8, [`(($__ratioGradRef_local_shared__2660 + 0x10))] ;
        /*51e0*/                   LDS.64 R10, [`($__ratioGradRef_local_shared__2660)] ;
        /*51f0*/              @!P0 STS.64 [`($__c_ratio_shared8_$_0__2670)], RZ ;
        /*5200*/                   IMAD.WIDE.U32 R2, R4, R3, c[0x0][0x1a8] ;
        /*5210*/                   IMAD R5, R5, 0x18, RZ ;
        /*5220*/                   IMAD.IADD R3, R3, 0x1, R5 ;
        /*5230*/                   STG.E.64 [R2.64+0x8], R6 ;
        /*5240*/                   STG.E.64 [R2.64+0x10], R8 ;
        /*5250*/                   STG.E.64 [R2.64], R10 ;
        /*5260*/                   WARPSYNC 0xffffffff ;

Notice at the end, there are 2 64-bit loads into R8 and R10, but 3 64-bit stores from R6, R8, and R10. The load into R6 is missing.
Also note the 128 bit load of iw_shared5_ into R4. At the top, the addition from the reduction loads into R2, R4, and R6, and so it appears that R6 would contain the desired value if it were not overwritten by the 128 bit load into R4 (which I assume also loads R6). That may be accidental, as a version compiled with CUDA 12.2 accumulates into different registers (though my guess is the manifestation of the issue would be that the reduction misses the last iteration)

Now the SASS for the 'good' case (32 bit index variable) (using nvdisasm from CUDA 11.6):

.L_x_644:
        /*4c50*/                   LDS.128 R4, [`($__ratioGradRef_local_shared__2660)] ;
        /*4c60*/                   LDS.64 R8, [`(($__ratioGradRef_local_shared__2660 + 0x10))] ;
        /*4c70*/                   DADD R4, R4, R16 ;
        /*4c80*/                   DADD R6, R6, R18 ;
        /*4c90*/                   DADD R8, R8, R22 ;
        /*4ca0*/                   STS.128 [`($__ratioGradRef_local_shared__2660)], R4 ;
        /*4cb0*/                   STS.64 [`(($__ratioGradRef_local_shared__2660 + 0x10))], R8 ;
.L_x_670:
        /*4cc0*/                   BSYNC B8 ;
.L_x_633:
        /*4cd0*/                   STS [`(_ZN4ompx5state9TeamStateE_$_0)], RZ ;
        /*4ce0*/                   BRA `(.L_x_631) ;
.L_x_632:
        /*4cf0*/                   IMAD.MOV.U32 R3, RZ, RZ, 0x1 ;
        /*4d00*/                   STS [`(_ZN4ompx5state9TeamStateE_$_2)], R0 ;
        /*4d10*/                   STS [`(_ZN4ompx5state9TeamStateE_$_1)], R3 ;
        /*4d20*/                   STS [`(_ZN4ompx5state9TeamStateE_$_0)], R3 ;
        /*4d30*/                   BAR.SYNC 0x8 ;
        /*4d40*/                   BAR.SYNC 0x8 ;
        /*4d50*/                   STS [`(_ZN4ompx5state9TeamStateE_$_0)], RZ ;
        /*4d60*/                   STS [`(_ZN4ompx5state9TeamStateE_$_1)], RZ ;
        /*4d70*/                   STS [`(_ZN4ompx5state9TeamStateE_$_2)], R3 ;
.L_x_631:
        /*4d80*/                   LDS R4, [`($__iw_shared5_$_0__2656)] ;
        /*4d90*/                   ISETP.NE.AND P0, PT, R30, RZ, PT ;
        /*4da0*/                   IMAD.MOV.U32 R5, RZ, RZ, 0x18 ;
        /*4db0*/                   WARPSYNC 0xffffffff ;
        /*4dc0*/                   LDS.64 R6, [`(($__ratioGradRef_local_shared__2660 + 0x10))] ;
        /*4dd0*/                   ULDC UR4, c[0x0][0x0] ;
        /*4de0*/                   UIADD3 UR4, UR4, -0x20, URZ ;
        /*4df0*/                   LDS.128 R8, [`($__ratioGradRef_local_shared__2660)] ;
        /*4e00*/                   IMAD.U32 R31, RZ, RZ, UR4 ;
        /*4e10*/              @!P0 STS.64 [`($__c_ratio_shared8_$_0__2670)], RZ ;
        /*4e20*/                   IMAD.WIDE.U32 R4, R4, R5, c[0x0][0x1a8] ;
        /*4e30*/                   STG.E.64 [R4.64+0x10], R6 ;
        /*4e40*/                   STG.E.64 [R4.64+0x8], R10 ;
        /*4e50*/                   STG.E.64 [R4.64], R8 ;
        /*4e60*/                   BAR.SYNC 0x0 ;

At the end, there is a 64 bit load into R6 and a 128 bit load into R8 and R10, and 3 64-bit stores from R6, R8, and R10.

@markdewing
Copy link
Contributor

@Artem-B
Copy link
Member

Artem-B commented Oct 13, 2023

This may be a ptxas bug.
I'm also concerned by the rearranged WARPSYNC and BAR.SYNC 0x0 .

The SASS snippets appear to do a bit more than the PTX snippets. Can you post complete PTX and SASS disassembly for the function for good/bad cases on gist.github.com ? Maybe LLVM IR as well -- makes it easier to reproduce the issue if it turns out to be on the LLVM side.

Just in case, which clang/LLVM version are you using?

@markdewing
Copy link
Contributor

These code fragments used clang version 18.0.0 (https://github.com/llvm/llvm-project.git 80c01dd)

@markdewing
Copy link
Contributor

PTX and SASS for the function MultiDiracDeterminant::mw_evaluateDetsAndGradsForPtclMove

https://gist.github.com/markdewing/12143bb6679c977a5191280fc909f31e

Look for the variable ratioGradRef_local

These use CUDA 12.2 and clang version 18.0.0 (https://github.com/llvm/llvm-project.git 6f5b372)

@markdewing
Copy link
Contributor

Additional data point: A version of the code that uses floats instead of doubles has a similar issue. Assigning the first two values (64 bits worth of data) after the reduction works correctly, but assigning the third value in the array does not (value ends up as zero).
The really interesting part is a similar workaround can work Just as reducing the size of outer loop index variable from 64 to 32 bits fixes the issue with doubles, reducing the size of the outer loop index variable from 32 to 16 bits also fixes the issue with floats.

@Artem-B
Copy link
Member

Artem-B commented Oct 23, 2023

A few more things to try:

  • Nvidia has recently released cuda-12.3. It may be worth checking whether the issue is still present there.
  • Try disabling ptxas optimizations with -Xcuda-ptxas -O0 and see whether that has any effect on the problem. If unoptimized build works, increase optimization level until it breaks. The default for ptxas is -O3. It the issue is affected by ptxas optimization level, then it would be another data point suggesting that it's ptxas that's the culprit.

On a side note, generated ptxas appears to use local stores. It's possible that with more aggressive optimization settings (specifically with increased loop unroll threshold) it may be possible to simplify the function control flow, which may avoid triggering the problem in ptxas.

@markdewing
Copy link
Contributor

The issue appears to be fixed in cuda 12.3.

My current guess is it is an issue with combining loads - two consecutive 64 bit loads get combined into a single 128 bit load. But it picks the wrong variable or register. When the index variable is the same size as the underlying data type of the other variable, they are both in the same bucket of loads to consider combining, and the optimization pass picks the wrong one. If the index variable is not the same size, then the index variable is no longer in the same bucket, and the optimization pass can't pick the wrong one.

I did create a reproducer. The essential feature was demoting the loop index variable iw, and using it as an index in the reduction loop made that happen.

repo_assign_after_reduction.tar.gz

@Artem-B
Copy link
Member

Artem-B commented Oct 25, 2023

Thank you for figuring it out.

I wish NVIDIA would provide more public info about known issues and which ones are fixed in a particular release, so we don't stumble around in the dark debugging the issues they already know about.

@ye-luo
Copy link
Contributor Author

ye-luo commented Oct 25, 2023

My original issues has been resolved. Thank @markdewing for figuring out all the low level stuff.

@ye-luo ye-luo closed this as completed Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants