Skip to content

Conversation

@dakersnar
Copy link
Contributor

Whoops, I made the same mistake as https://github.com/llvm/llvm-project/pull/169535/files again, so 5c8c7f3 is causing build issues. This should fix that, I updated the tests to use sm_90 and ptx 8.8, where the used bytes mask feature is approved.

@llvmbot
Copy link
Member

llvmbot commented Dec 8, 2025

@llvm/pr-subscribers-backend-nvptx

Author: Drew Kersnar (dakersnar)

Changes

Whoops, I made the same mistake as https://github.com/llvm/llvm-project/pull/169535/files again, so 5c8c7f3 is causing build issues. This should fix that, I updated the tests to use sm_90 and ptx 8.8, where the used bytes mask feature is approved.


Full diff: https://github.com/llvm/llvm-project/pull/171220.diff

2 Files Affected:

  • (modified) llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll (+5-5)
  • (modified) llvm/test/CodeGen/NVPTX/param-vectorize-device.ll (+2-2)
diff --git a/llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll b/llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll
index a75ddd032d4c0..a0dfe15db2ba9 100644
--- a/llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll
+++ b/llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc < %s | FileCheck -check-prefix=ENABLED %s
-; RUN: llc -disable-nvptx-load-store-vectorizer < %s | FileCheck -check-prefix=DISABLED %s
-; RUN: %if ptxas %{ llc < %s | %ptxas-verify %}
-; RUN: %if ptxas %{ llc -disable-nvptx-load-store-vectorizer < %s | %ptxas-verify %}
+; RUN: llc -march=nvptx64 -mcpu=sm_90 -mattr=+ptx88 < %s | FileCheck -check-prefix=ENABLED %s
+; RUN: llc -march=nvptx64 -mcpu=sm_90 -mattr=+ptx88 -disable-nvptx-load-store-vectorizer < %s | FileCheck -check-prefix=DISABLED %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.8 %{ llc -march=nvptx64 -mcpu=sm_90 -mattr=+ptx88 < %s | %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.8 %{ llc -march=nvptx64 -mcpu=sm_90 -mattr=+ptx88 -disable-nvptx-load-store-vectorizer < %s | %ptxas-verify -arch=sm_90 %}
 
 target triple = "nvptx64-nvidia-cuda"
 
@@ -52,7 +52,7 @@ define half @fh(ptr %p) {
 ; ENABLED-NEXT:    ld.param.b64 %rd1, [fh_param_0];
 ; ENABLED-NEXT:    .pragma "used_bytes_mask 0x3ff";
 ; ENABLED-NEXT:    ld.v4.b32 {%r1, %r2, %r3, %r4}, [%rd1];
-; ENABLED-NEXT:    { .reg .b16 tmp; mov.b32 {%rs1, tmp}, %r3; }
+; ENABLED-NEXT:    mov.b32 {%rs1, _}, %r3;
 ; ENABLED-NEXT:    mov.b32 {%rs2, %rs3}, %r2;
 ; ENABLED-NEXT:    mov.b32 {%rs4, %rs5}, %r1;
 ; ENABLED-NEXT:    cvt.f32.f16 %r5, %rs5;
diff --git a/llvm/test/CodeGen/NVPTX/param-vectorize-device.ll b/llvm/test/CodeGen/NVPTX/param-vectorize-device.ll
index 643de006f14c4..a2658ac5873a4 100644
--- a/llvm/test/CodeGen/NVPTX/param-vectorize-device.ll
+++ b/llvm/test/CodeGen/NVPTX/param-vectorize-device.ll
@@ -1,5 +1,5 @@
-; RUN: llc < %s -mtriple=nvptx64-unknown-unknown | FileCheck %s
-; RUN: %if ptxas %{ llc < %s -mtriple=nvptx64-unknown-unknown | %ptxas-verify %}
+; RUN: llc -march=nvptx64 -mcpu=sm_90 -mattr=+ptx88 < %s | FileCheck %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.8 %{ llc -march=nvptx64 -mcpu=sm_90 -mattr=+ptx88 < %s | %ptxas-verify -arch=sm_90 %}
 ;
 ; Check that parameters of a __device__ function with private or internal
 ; linkage called from a __global__ (kernel) function get increased alignment,

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, there's still something odd here. Why does llc with the default flags generates something that is not supported by a recent ptxas?

@dakersnar
Copy link
Contributor Author

On a second thought, there's still something odd here. Why does llc with the default flags generates something that is not supported by a recent ptxas?

Hmm, I see what you mean. I might have missed something in my original masked load lowering implementation. Perhaps there needs to be a guard for this feature based on sm and ptx version.

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

🐧 Linux x64 Test Results

  • 187154 tests passed
  • 4929 tests skipped

✅ The build succeeded and all tests passed.

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

🪟 Windows x64 Test Results

  • 128438 tests passed
  • 2793 tests skipped

✅ The build succeeded and all tests passed.

@boomanaiden154
Copy link
Contributor

Hmm, I see what you mean. I might have missed something in my original masked load lowering implementation. Perhaps there needs to be a guard for this feature based on sm and ptx version.

Should we revert the original patch in the mean time? Or is fixing forward the best option here?

@Artem-B
Copy link
Member

Artem-B commented Dec 8, 2025

Perhaps there needs to be a guard for this feature based on sm and ptx version.

This sounds plausible. AFAICT, all failures we see follow the same pattern, complaining about used_bytes_mask. For older PTX/SM we should not lower to masked ops.

@dakersnar
Copy link
Contributor Author

dakersnar commented Dec 8, 2025

This sounds plausible. AFAICT, all failures we see follow the same pattern, complaining about used_bytes_mask. For older PTX/SM we should not lower to masked ops.

FWIW as far as PTX is concerned the correctness of the PTX is not impacted by the existence of the pragma, the pragma is a feature that exists to be a hint for the sanitizer.

So for older versions, we should be able to create masked loads and simply not emit the used_bytes_mask pragma if the subtarget check fails.

Should we revert the original patch in the mean time? Or is fixing forward the best option here?

I can have a fix up in 5-10 mins, which I think might be easier.

@Artem-B
Copy link
Member

Artem-B commented Dec 8, 2025

Should we revert the original patch in the mean time? Or is fixing forward the best option here?

If it's just a missing predicate, fixing forward may be the easiest. If it's going to take more time, then we should revert it, as it will break everyone with the older GPU (though I doubt there are many users of cuda older than 12.3 or pre-sm_50 GPUs).

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@dakersnar
Copy link
Contributor Author

I think this should do it. Trying to get my build built locally to confirm. I also think we should keep the changes to the llc commands in these tests so that they are testing more recent versions, rather than having them call llc with no arguments and not generate the used_bytes_mask feature. But I can confirm locally that without the llc arguments the used_bytes_mask is not generated.

@Artem-B
Copy link
Member

Artem-B commented Dec 8, 2025

also think we should keep the changes to the llc commands in these tests so that they are testing more recent versions, rather than having them call llc with no arguments and not generate the used_bytes_mask feature. But I can confirm locally that without the llc arguments the used_bytes_mask is not generated.

I would suggest adding a dedicated test that verifies that the pragma is generated or not.

@dakersnar
Copy link
Contributor Author

Shoot, @Artem-B any chance you know the fix to this off the top of your head? I was hoping I could just add this argument to printUsedBytesMaskPragma but it doesn't seem to be that simple.

image

@Artem-B
Copy link
Member

Artem-B commented Dec 8, 2025

Looks like tablegen back-end assumes a particular list of arguments passed to the operand PrintMethod.
Off the top of my head, I'm not sure how to plumb additional MCSubtargetInfo argument into it.
Instead of passing it directly, can we recover MCSubtargetInfo via MCInst ?

@dakersnar
Copy link
Contributor Author

Added new test here: 8cd79d7

This commit: e180119 technically fixes my local build, and the test appears to be working as intended. However, it feels a little hacky. It's basically caching the STI from the call to printInst. Am I correct to hesitate with this?

@dakersnar
Copy link
Contributor Author

An alternative solution is to filter these masks away in convertMLOADToLoadWithUsedBytesMask in NVPTXISelLowering, where we have access to STI, like this:

  // If our subtarget does not support the used bytes mask pragma, "drop" the
  // mask by setting it to UINT32_MAX
  if (!STI.hasUsedBytesMaskPragma())
    UsedBytesMask = UINT32_MAX;

@Artem-B
Copy link
Member

Artem-B commented Dec 9, 2025

This commit: e180119 technically fixes my local build,

That's not great. :-/

If the change is still revertable cleanly, let's revert it for now and try again later.

If reverting is not doable cleanly, we can comment out pragma generation altogether for now, as a stop-gap workaround and fix it forward later.

@Artem-B
Copy link
Member

Artem-B commented Dec 9, 2025

An alternative solution is to filter these masks away in convertMLOADToLoadWithUsedBytesMask in NVPTXISelLowering,

This may work.

@dakersnar
Copy link
Contributor Author

Let me put up the alternate solution, and if we decide it's a no go, I'm ok with reverting.

@dakersnar dakersnar force-pushed the github/dkersnar/fix-tests-lsv-gap-fill branch from 8cd79d7 to b6e1e51 Compare December 9, 2025 00:20
@dakersnar
Copy link
Contributor Author

To minimize churn, I'm setting the existing tests to run on sm_50 and ptx 8.3

@dakersnar
Copy link
Contributor Author

(setting it to sm_90 unlocks some other optimizations in LoadStoreVectorizer.ll that I don't think we should include in this fix change)

@dakersnar
Copy link
Contributor Author

Ok @Artem-B this looks good locally. Sorry again for missing this, but assuming we are ok with this solution I think we could merge this as soon as checks are green and it should solve the build break.

Comment on lines 2 to 5
; RUN: llc -march=nvptx64 -mcpu=sm_90 -mattr=+ptx88 < %s | FileCheck -check-prefix=ENABLED %s
; RUN: llc -march=nvptx64 -mcpu=sm_90 -mattr=+ptx88 -disable-nvptx-load-store-vectorizer < %s | FileCheck -check-prefix=DISABLED %s
; RUN: %if ptxas-sm_90 && ptxas-isa-8.8 %{ llc -march=nvptx64 -mcpu=sm_90 -mattr=+ptx88 < %s | %ptxas-verify -arch=sm_90 %}
; RUN: %if ptxas-sm_90 && ptxas-isa-8.8 %{ llc -march=nvptx64 -mcpu=sm_90 -mattr=+ptx88 -disable-nvptx-load-store-vectorizer < %s | %ptxas-verify -arch=sm_90 %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be left with the defaults.

Pragma's own tests should live in a separate test. They are coincidental here.

Copy link
Contributor Author

@dakersnar dakersnar Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've removed the version specifiers and with the addition of the new gate all the pragmas as no longer generated, as we expect.

@@ -0,0 +1,36 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
; RUN: llc < %s -march=nvptx64 -mcpu=sm_50 -mattr=+ptx82 | FileCheck %s -check-prefixes=NOMASK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do bother to explicitly specify a GPU variant, I'd use a more recent one, so the tests continue to work with cuda-13 which no longer supports sm_50. E.g. sm_80 or sm_90 would be a choice reasonably stable long term.

I'd also add a comment statiing the actual constrint on the pragma use, and that we're using PTX to gate it, and GPU is fixed for stability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with sm_90 for this test. Done.

Copy link
Member

Artem-B commented Dec 9, 2025

setting it to sm_90 unlocks some other optimizations

Up to you. sm_50 is OK to use for now.

@dakersnar dakersnar force-pushed the github/dkersnar/fix-tests-lsv-gap-fill branch from 117fe03 to d01d16c Compare December 9, 2025 00:46
@dakersnar
Copy link
Contributor Author

@boomanaiden154 this should be good to merge whenever, although I'm not sure whether you would prefer to wait for the final two in-progress checks.

@boomanaiden154
Copy link
Contributor

this should be good to merge whenever, although I'm not sure whether you would prefer to wait for the final two in-progress checks.

Should be good to merge now. AArch64 is non-blocking and the libcxx-ci checks usually never finish.

@dakersnar dakersnar merged commit cdb525d into llvm:main Dec 9, 2025
8 of 9 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 9, 2025

LLVM Buildbot has detected a new failure on builder sanitizer-x86_64-linux-android running on sanitizer-buildbot-android while building llvm at step 2 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/186/builds/14563

Here is the relevant piece of the build log for the reference
Step 2 (annotate) failure: 'python ../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py' (failure)
...
[       OK ] AddressSanitizer.AtoiAndFriendsOOBTest (2284 ms)
[ RUN      ] AddressSanitizer.HasFeatureAddressSanitizerTest
[       OK ] AddressSanitizer.HasFeatureAddressSanitizerTest (0 ms)
[ RUN      ] AddressSanitizer.CallocReturnsZeroMem
[       OK ] AddressSanitizer.CallocReturnsZeroMem (8 ms)
[ DISABLED ] AddressSanitizer.DISABLED_TSDTest
[ RUN      ] AddressSanitizer.IgnoreTest
[       OK ] AddressSanitizer.IgnoreTest (0 ms)
[ RUN      ] AddressSanitizer.SignalTest
[       OK ] AddressSanitizer.SignalTest (190 ms)
[ RUN      ] AddressSanitizer.ReallocTest
[       OK ] AddressSanitizer.ReallocTest (21 ms)
[ RUN      ] AddressSanitizer.WrongFreeTest
[       OK ] AddressSanitizer.WrongFreeTest (107 ms)
[ RUN      ] AddressSanitizer.LongJmpTest
[       OK ] AddressSanitizer.LongJmpTest (0 ms)
[ RUN      ] AddressSanitizer.ThreadStackReuseTest
[       OK ] AddressSanitizer.ThreadStackReuseTest (11 ms)
[ DISABLED ] AddressSanitizer.DISABLED_MemIntrinsicUnalignedAccessTest
[ DISABLED ] AddressSanitizer.DISABLED_LargeFunctionSymbolizeTest
[ DISABLED ] AddressSanitizer.DISABLED_MallocFreeUnwindAndSymbolizeTest
[ RUN      ] AddressSanitizer.UseThenFreeThenUseTest
[       OK ] AddressSanitizer.UseThenFreeThenUseTest (119 ms)
[ RUN      ] AddressSanitizer.FileNameInGlobalReportTest
[       OK ] AddressSanitizer.FileNameInGlobalReportTest (131 ms)
[ DISABLED ] AddressSanitizer.DISABLED_StressStackReuseAndExceptionsTest
[ RUN      ] AddressSanitizer.MlockTest
[       OK ] AddressSanitizer.MlockTest (0 ms)
[ DISABLED ] AddressSanitizer.DISABLED_DemoThreadedTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoStackTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoThreadStackTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowIn
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowLeft
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowRight
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFHigh
[ DISABLED ] AddressSanitizer.DISABLED_DemoOOM
[ DISABLED ] AddressSanitizer.DISABLED_DemoDoubleFreeTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoNullDerefTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoFunctionStaticTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoTooMuchMemoryTest
[ RUN      ] AddressSanitizer.LongDoubleNegativeTest
[       OK ] AddressSanitizer.LongDoubleNegativeTest (0 ms)
[----------] 19 tests from AddressSanitizer (28118 ms total)

[----------] Global test environment tear-down
[==========] 22 tests from 2 test suites ran. (28123 ms total)
[  PASSED  ] 22 tests.

  YOU HAVE 1 DISABLED TEST

Step 34 (run instrumented asan tests [aarch64/bluejay-userdebug/TQ3A.230805.001]) failure: run instrumented asan tests [aarch64/bluejay-userdebug/TQ3A.230805.001] (failure)
...
[ RUN      ] AddressSanitizer.HasFeatureAddressSanitizerTest
[       OK ] AddressSanitizer.HasFeatureAddressSanitizerTest (0 ms)
[ RUN      ] AddressSanitizer.CallocReturnsZeroMem
[       OK ] AddressSanitizer.CallocReturnsZeroMem (8 ms)
[ DISABLED ] AddressSanitizer.DISABLED_TSDTest
[ RUN      ] AddressSanitizer.IgnoreTest
[       OK ] AddressSanitizer.IgnoreTest (0 ms)
[ RUN      ] AddressSanitizer.SignalTest
[       OK ] AddressSanitizer.SignalTest (190 ms)
[ RUN      ] AddressSanitizer.ReallocTest
[       OK ] AddressSanitizer.ReallocTest (21 ms)
[ RUN      ] AddressSanitizer.WrongFreeTest
[       OK ] AddressSanitizer.WrongFreeTest (107 ms)
[ RUN      ] AddressSanitizer.LongJmpTest
[       OK ] AddressSanitizer.LongJmpTest (0 ms)
[ RUN      ] AddressSanitizer.ThreadStackReuseTest
[       OK ] AddressSanitizer.ThreadStackReuseTest (11 ms)
[ DISABLED ] AddressSanitizer.DISABLED_MemIntrinsicUnalignedAccessTest
[ DISABLED ] AddressSanitizer.DISABLED_LargeFunctionSymbolizeTest
[ DISABLED ] AddressSanitizer.DISABLED_MallocFreeUnwindAndSymbolizeTest
[ RUN      ] AddressSanitizer.UseThenFreeThenUseTest
[       OK ] AddressSanitizer.UseThenFreeThenUseTest (119 ms)
[ RUN      ] AddressSanitizer.FileNameInGlobalReportTest
[       OK ] AddressSanitizer.FileNameInGlobalReportTest (131 ms)
[ DISABLED ] AddressSanitizer.DISABLED_StressStackReuseAndExceptionsTest
[ RUN      ] AddressSanitizer.MlockTest
[       OK ] AddressSanitizer.MlockTest (0 ms)
[ DISABLED ] AddressSanitizer.DISABLED_DemoThreadedTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoStackTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoThreadStackTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowIn
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowLeft
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFLowRight
[ DISABLED ] AddressSanitizer.DISABLED_DemoUAFHigh
[ DISABLED ] AddressSanitizer.DISABLED_DemoOOM
[ DISABLED ] AddressSanitizer.DISABLED_DemoDoubleFreeTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoNullDerefTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoFunctionStaticTest
[ DISABLED ] AddressSanitizer.DISABLED_DemoTooMuchMemoryTest
[ RUN      ] AddressSanitizer.LongDoubleNegativeTest
[       OK ] AddressSanitizer.LongDoubleNegativeTest (0 ms)
[----------] 19 tests from AddressSanitizer (28118 ms total)

[----------] Global test environment tear-down
[==========] 22 tests from 2 test suites ran. (28123 ms total)
[  PASSED  ] 22 tests.

  YOU HAVE 1 DISABLED TEST
program finished with exit code 0
elapsedTime=2343.102972

@dakersnar
Copy link
Contributor Author

I don't think the above is caused by me?

honeygoyal pushed a commit to honeygoyal/llvm-project that referenced this pull request Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants