[NVPTX] Teach NVPTX about predicates #67468

ldrumm · 2023-09-26T17:57:33Z

PTX is fully predicated1, and Maxwell through Ampere take predicate registers at the ISA level[2]. However, we've not been utilizing this in LLVM, but only manually specifying branch predicates for the CBranch instruction. As mentioned in 1, all PTX instructions can be predicated, and there are several forms of a predicated instruction in use:

@<predicate_reg>
@!<predicate_reg>

The first form enables the instruction if <predicate_reg> is nonzero, the second if it is zero.

In this part-the-first, we add such two-part predicates to the NVPTX backend: a predicate register operand which defaults to $nopred i.e. always-true, and a predicate inversion "switch" for inverting the condition, which defaults to zero

e.g.

 ADDi64ri %0, 1, $noreg, 0

is unpredicated, but

%2:int1regs = IMPLICIT_DEF
 ADDi64ri %0, 1, %2, 0

is predicated on %2, e.g.

 @%p1 add.s64 %rd3, %rd1, 1;

Finally:

%2:int1regs = IMPLICIT_DEF
StoreRetvalI64 %4, 0, %2, 1

is the "inverted version" e.g.

 @!%p1 add.s64   %rd3, %rd1, 1;

where the last two MOs are a default "no predicate", and "uninverted predicate" register and switch.

The changes here are logically fairly minimal, not really affecting the generated code that much, but add the machinery for better optimization opportunities, such as if-conversion which I'm working on. Also missing here are some useful target hooks which I'll add in due course:

getPredicationCost
optimizeCondBranch
reverseBranchCondition
isPredicated et al.

I also intend to add more AllowModify cases to analyzeBranch to enable better machine block placement and other generic machine optimizations.

Since the branching logic is significantly affected here I've renamed the branch instructions to make it clear their implementation has changed.

[2]: https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf Chapter 6

Artem-B

MI part of NVPTX back-end is largely unfamiliar territory for me, so we'll need someone who has better familiarity with MI than myself.

Meanwhile, I have a few general questions.

Maxwell through Ampere take predicate registers at the ISA level[2]

That document does mention that the GPUs do have predicate registers. It does not necessarily imply that using them to predicate individual instructions is free. Perhaps I'm missing something. Can you elaborate on why you think wider use of predicates will likely be beneficial?

I can see how it may help getting rid of tons of small jumps LLVM tends to generate now and then, but I'm somewhat skeptical that it will have much of a practical impact as most of those cases can be easily optimized into predicated execution by ptxas, if it deems it beneficial.

It would help if you could demonstrate on a bit of PTX how using predicates instead of branches generates better SASS for GPU X. Bonus points for examining SASS for a range of GPUs, so we may have a better idea whether predication should be applied across the board, or only on particular GPU variants.

If wider use of predication gives us no measurable improvements in generated SASS, there's probably no point complicating things.

Artem-B · 2023-09-26T19:06:33Z

llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp

@@ -2095,6 +2168,8 @@ bool NVPTXDAGToDAGISel::tryStoreRetval(SDNode *N) {
  for (unsigned i = 0; i < NumElts; ++i)
    Ops.push_back(N->getOperand(i + 2));
  Ops.push_back(CurDAG->getTargetConstant(OffsetVal, DL, MVT::i32));
+  Ops.push_back(PRED(0));


Nit: we could fold all those push_back into Ops.append({....})

Artem-B · 2023-09-26T19:09:57Z

llvm/lib/Target/NVPTX/NVPTXInstrFormats.td

  field bits<14> Inst;

  let Namespace = "NVPTX";
  dag OutOperandList = outs;
-  dag InOperandList = ins;
-  let AsmString = asmstr;
+  dag InOperandList = !if(!and(defaultPreds, isPredicable), !con(


Please reformat for easier reading. E.g. something like this may work:

!if(<condition>, <if-true>, <if-false>)

done and for AsmStr

PTX is fully predicated[1], and Maxwell through Ampere take predicate registers at the ISA level[2]. However, we've not been utilizing this in LLVM, but only manually specifying branch predicates for the `CBranch` instruction. As mentioned in [1], all PTX instructions can be predicated, and there are several forms of a predicated instruction in use: @<predicate_reg> @!<predicate_reg> The first form enables the instruction if <predicate_reg> is nonzero, the second if it is zero. In this part-the-first, we add such two-part predicates to the NVPTX backend: a predicate register operand which defaults to `$nopred` i.e. always-true, and a predicate inversion "switch" for inverting the condition, which defaults to zero e.g. ADDi64ri %0, 1, $noreg, 0 is unpredicated, but %2:int1regs = IMPLICIT_DEF ADDi64ri %0, 1, %2, 0 is predicated on `%2`, e.g. ```asm @%p1 add.s64 %rd3, %rd1, 1; ``` Finally: %2:int1regs = IMPLICIT_DEF StoreRetvalI64 %4, 0, %2, 1 is the "inverted version" e.g. ```asm @!%p1 add.s64 %rd3, %rd1, 1; ``` where the last two MOs are a default "no predicate", and "uninverted predicate" register and switch. The changes here are logically fairly minimal, not really affecting the generated code that much, but add the machinery for better optimization opportunities, such as if-conversion which I'm working on. Also missing here are some useful target hooks which I'll add in due course: - getPredicationCost - optimizeCondBranch - reverseBranchCondition - isPredicated et al. I also intend to add more `AllowModify` cases to `analyzeBranch` to enable better machine block placement and other generic machine optimizations. Since the branching logic is significantly affected here I've renamed the branch instructions to make it clear their implementation has changed. [1]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-instructions [2]: https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf Chapter 6

ldrumm · 2023-10-03T15:05:32Z

@Artem-B Sorry for the delay.

Your questions are definitely valid, and I'm investigating some cases reported to me where divergent branches are killing performance. I should be able to have some conclusions on that soon for several different generations.

Before I'm confident in these data, I think it's safe to make a few observations:

I don't think this complicates the NVPTX backend much. There are a couple of places where we manually need to add the default predicates for custom dagtodag lowering, but for the most part the tablegen changes mean that llvm will do the right thing with default predicates and the inversion switch
We don't know what ptxas can reliably do regarding if-conversion et al, and having the ability to control this in llvm gives us more power to affect code generation.
I have seen ptxas if-convert very trivial cases in the wild, but llvm will have more information and can better reason about the control flow graph because it has more information.
In my testing I've seen that divergent control flow can still be very expensive. NVIDIA's marketing documentation for Ampere suggests the hardware can now eliminate most of this, but the issues I'm looking at indicate that this is for simple cases only. Old hardware still suffers.
I'm guessing PTX exposes generalized predicates for a reason. Not adding them limits what we can do in the backend.

Artem-B · 2023-10-03T17:52:13Z

We don't know what ptxas can reliably do regarding if-conversion et al, and having the ability to control this in llvm gives us more power to affect code generation.

We can certainly observe what it does. You can dump SASS with cuobjdump and nvdisasm and the latter can even conveniently produce the control flow graph.
All I'm saying is that I'm yet to see a practical case where performance of LLVM-produced code was suboptimal due to the predication vs branches. I do regularly get to poke at various issues with LLVM-generated code, and predication/jumps are never the culprit, except for the ancient ptxas bug with the thread mis-convergence (https://bugs.llvm.org/show_bug.cgi?id=27738).

I have seen ptxas if-convert very trivial cases in the wild, but llvm will have more information and can better reason about the control flow graph because it has more information.

Agreed, that LLVM has more info. Do you have examples where ptxas should've used predicated execution but didn't?

In my testing I've seen that divergent control flow can still be very expensive.

Yes, divergent execution is expensive.

Yet, predication is not a universal win, either. For large enough branches predication will not solve the problem, and for small branches ptxas may already be doing a good enough job.

For what it's worth, NVCC appears to prefer generating jumps, but SASS ends up using predication: https://godbolt.org/z/4f5Phd4xq

I think it's fairly safe to assume that NVIDIA would be very interested in squeezing as much performance out of the GPUs as they can. The fact that NVCC is rather conspicuously not using predicates in PTX, even for such an obvious case as a ternary operator, suggests that there may be a good reason for it. I'll ask them.

NVIDIA's marketing documentation for Ampere suggests the hardware can now eliminate most of this, but the issues I'm looking at indicate that this is for simple cases only. Old hardware still suffers.

Can you point me to more details? I'm not sure I understand what you have in mind by ampere eliminating divergent branches. IIRC, Ampere allowed concurrent execution of all divergent branches (previously divergent branches ran sequentially) and thus guaranteeing progress, which was impossible on older GPUs, but I don't think it removes the concept of branch divergence.

I'm guessing PTX exposes generalized predicates for a reason.

That remains to be seen. Switching to predicates just because PTX syntax allows them is not a very compelling argument, by itself.

Not adding them limits what we can do in the backend.

Can you be more specific about what you need to do in the back-end that can't be done without predication?

Just to be clear -- I'm not against the patch. Being able to use predicates may potentially be useful. However, it appears to be a fundamentally invasive change (both to NVPTX back-end, and to the PTX we'll generate, with potential unforeseen consequences) and I want to have a better idea of what problems it solves, what it buys us and whether the benefits outweigh the downsides.

Artem-B · 2023-10-06T17:39:04Z

I've asked NVIDIA's compiler folks about jumps vs predicates, and they say that jumps win:

Q: [jumps vs. predicates]
A: The short answer is it is better to generate branches. Current PTXAS are more trained to do that. We had done some work where we did control dependence based analysis and put masks on various PTX instructions and put out a linearized sequence of PTX instructions and it didnt work very well. There were a few issues:

register allocator slowed down due to predicated instructions. and didn't do a good a job as the traditional register allocation

PTXAS unpredicated some cases and re-introduced control flow.

handling of loops with warp divergent loop tests was problematic, predicated branches :(

ldrumm added enhancement Improving things as opposed to bug fixing, e.g. new or missing feature backend:NVPTX labels Sep 26, 2023

ldrumm requested review from asavonic, topperc and ThomasRaoux September 26, 2023 17:57

ldrumm assigned Artem-B Sep 26, 2023

Artem-B reviewed Sep 26, 2023

View reviewed changes

ldrumm force-pushed the nvptx-predicates branch from 478c2d0 to 9849e80 Compare September 28, 2023 13:20

justinfargnoli self-requested a review March 12, 2024 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVPTX] Teach NVPTX about predicates #67468

[NVPTX] Teach NVPTX about predicates #67468

ldrumm commented Sep 26, 2023

Artem-B left a comment

Artem-B Sep 26, 2023

ldrumm Sep 28, 2023

Artem-B Sep 26, 2023

ldrumm Sep 28, 2023

ldrumm commented Oct 3, 2023

Artem-B commented Oct 3, 2023

Artem-B commented Oct 6, 2023

[NVPTX] Teach NVPTX about predicates #67468

Are you sure you want to change the base?

[NVPTX] Teach NVPTX about predicates #67468

Conversation

ldrumm commented Sep 26, 2023

Artem-B left a comment

Choose a reason for hiding this comment

Artem-B Sep 26, 2023

Choose a reason for hiding this comment

ldrumm Sep 28, 2023

Choose a reason for hiding this comment

Artem-B Sep 26, 2023

Choose a reason for hiding this comment

ldrumm Sep 28, 2023

Choose a reason for hiding this comment

ldrumm commented Oct 3, 2023

Artem-B commented Oct 3, 2023

Artem-B commented Oct 6, 2023