Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVPTX] Teach NVPTX about predicates #67468

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ldrumm
Copy link
Contributor

@ldrumm ldrumm commented Sep 26, 2023

PTX is fully predicated1, and Maxwell through Ampere take predicate registers at the ISA level[2]. However, we've not been utilizing this in LLVM, but only manually specifying branch predicates for the CBranch instruction. As mentioned in 1, all PTX instructions can be predicated, and there are several forms of a predicated instruction in use:

@<predicate_reg>
@!<predicate_reg>

The first form enables the instruction if <predicate_reg> is nonzero, the second if it is zero.

In this part-the-first, we add such two-part predicates to the NVPTX backend: a predicate register operand which defaults to $nopred i.e. always-true, and a predicate inversion "switch" for inverting the condition, which defaults to zero

e.g.

 ADDi64ri %0, 1, $noreg, 0

is unpredicated, but

%2:int1regs = IMPLICIT_DEF
 ADDi64ri %0, 1, %2, 0

is predicated on %2, e.g.

 @%p1 add.s64 %rd3, %rd1, 1;

Finally:

%2:int1regs = IMPLICIT_DEF
StoreRetvalI64 %4, 0, %2, 1

is the "inverted version" e.g.

 @!%p1 add.s64   %rd3, %rd1, 1;

where the last two MOs are a default "no predicate", and "uninverted predicate" register and switch.

The changes here are logically fairly minimal, not really affecting the generated code that much, but add the machinery for better optimization opportunities, such as if-conversion which I'm working on. Also missing here are some useful target hooks which I'll add in due course:

  • getPredicationCost
  • optimizeCondBranch
  • reverseBranchCondition
  • isPredicated et al.

I also intend to add more AllowModify cases to analyzeBranch to enable better machine block placement and other generic machine optimizations.

Since the branching logic is significantly affected here I've renamed the branch instructions to make it clear their implementation has changed.

[2]: https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf Chapter 6

@ldrumm ldrumm added enhancement Improving things as opposed to bug fixing, e.g. new or missing feature backend:NVPTX labels Sep 26, 2023
Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MI part of NVPTX back-end is largely unfamiliar territory for me, so we'll need someone who has better familiarity with MI than myself.

Meanwhile, I have a few general questions.

Maxwell through Ampere take predicate registers at the ISA level[2]

That document does mention that the GPUs do have predicate registers. It does not necessarily imply that using them to predicate individual instructions is free. Perhaps I'm missing something. Can you elaborate on why you think wider use of predicates will likely be beneficial?

I can see how it may help getting rid of tons of small jumps LLVM tends to generate now and then, but I'm somewhat skeptical that it will have much of a practical impact as most of those cases can be easily optimized into predicated execution by ptxas, if it deems it beneficial.

It would help if you could demonstrate on a bit of PTX how using predicates instead of branches generates better SASS for GPU X. Bonus points for examining SASS for a range of GPUs, so we may have a better idea whether predication should be applied across the board, or only on particular GPU variants.

If wider use of predication gives us no measurable improvements in generated SASS, there's probably no point complicating things.

@@ -2095,6 +2168,8 @@ bool NVPTXDAGToDAGISel::tryStoreRetval(SDNode *N) {
for (unsigned i = 0; i < NumElts; ++i)
Ops.push_back(N->getOperand(i + 2));
Ops.push_back(CurDAG->getTargetConstant(OffsetVal, DL, MVT::i32));
Ops.push_back(PRED(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we could fold all those push_back into Ops.append({....})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

field bits<14> Inst;

let Namespace = "NVPTX";
dag OutOperandList = outs;
dag InOperandList = ins;
let AsmString = asmstr;
dag InOperandList = !if(!and(defaultPreds, isPredicable), !con(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reformat for easier reading. E.g. something like this may work:

!if(<condition>,
   <if-true>,
   <if-false>)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done and for AsmStr

PTX is fully predicated[1], and Maxwell through Ampere take predicate
registers at the ISA level[2]. However, we've not been utilizing this
in LLVM, but only manually specifying branch predicates for the
`CBranch` instruction. As mentioned in [1], all PTX instructions can be
predicated, and there are several forms of a predicated instruction in
use:

  @<predicate_reg>
  @!<predicate_reg>

The first form enables the instruction if <predicate_reg> is nonzero,
the second if it is zero.

In this part-the-first, we add such two-part predicates to the NVPTX
backend: a predicate register operand which defaults to `$nopred` i.e.
always-true, and a predicate inversion "switch" for inverting the
condition, which defaults to zero

e.g.

     ADDi64ri %0, 1, $noreg, 0

is unpredicated, but

    %2:int1regs = IMPLICIT_DEF
     ADDi64ri %0, 1, %2, 0

is predicated on `%2`, e.g.

```asm
 @%p1 add.s64 %rd3, %rd1, 1;
```
Finally:

    %2:int1regs = IMPLICIT_DEF
    StoreRetvalI64 %4, 0, %2, 1

is the "inverted version" e.g.

```asm
 @!%p1 add.s64   %rd3, %rd1, 1;
```
where the last two MOs are a default "no predicate", and "uninverted
predicate" register and switch.

The changes here are logically fairly minimal, not really affecting
the generated code that much, but add the machinery for better
optimization opportunities, such as if-conversion which I'm working on.
Also missing here are some useful target hooks which I'll add in due
course:

  - getPredicationCost
  - optimizeCondBranch
  - reverseBranchCondition
  - isPredicated
  et al.

I also intend to add more `AllowModify` cases to `analyzeBranch` to
enable better machine block placement and other generic machine
optimizations.

Since the branching logic is significantly affected here I've renamed
the branch instructions to make it clear their implementation has
changed.

[1]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-instructions
[2]: https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf Chapter 6
@ldrumm
Copy link
Contributor Author

ldrumm commented Oct 3, 2023

@Artem-B Sorry for the delay.

Your questions are definitely valid, and I'm investigating some cases reported to me where divergent branches are killing performance. I should be able to have some conclusions on that soon for several different generations.

Before I'm confident in these data, I think it's safe to make a few observations:

  • I don't think this complicates the NVPTX backend much. There are a couple of places where we manually need to add the default predicates for custom dagtodag lowering, but for the most part the tablegen changes mean that llvm will do the right thing with default predicates and the inversion switch
  • We don't know what ptxas can reliably do regarding if-conversion et al, and having the ability to control this in llvm gives us more power to affect code generation.
  • I have seen ptxas if-convert very trivial cases in the wild, but llvm will have more information and can better reason about the control flow graph because it has more information.
  • In my testing I've seen that divergent control flow can still be very expensive. NVIDIA's marketing documentation for Ampere suggests the hardware can now eliminate most of this, but the issues I'm looking at indicate that this is for simple cases only. Old hardware still suffers.
  • I'm guessing PTX exposes generalized predicates for a reason. Not adding them limits what we can do in the backend.

@Artem-B
Copy link
Member

Artem-B commented Oct 3, 2023

  • We don't know what ptxas can reliably do regarding if-conversion et al, and having the ability to control this in llvm gives us more power to affect code generation.

We can certainly observe what it does. You can dump SASS with cuobjdump and nvdisasm and the latter can even conveniently produce the control flow graph.
All I'm saying is that I'm yet to see a practical case where performance of LLVM-produced code was suboptimal due to the predication vs branches. I do regularly get to poke at various issues with LLVM-generated code, and predication/jumps are never the culprit, except for the ancient ptxas bug with the thread mis-convergence (https://bugs.llvm.org/show_bug.cgi?id=27738).

  • I have seen ptxas if-convert very trivial cases in the wild, but llvm will have more information and can better reason about the control flow graph because it has more information.

Agreed, that LLVM has more info. Do you have examples where ptxas should've used predicated execution but didn't?

  • In my testing I've seen that divergent control flow can still be very expensive.

Yes, divergent execution is expensive.

Yet, predication is not a universal win, either. For large enough branches predication will not solve the problem, and for small branches ptxas may already be doing a good enough job.

For what it's worth, NVCC appears to prefer generating jumps, but SASS ends up using predication: https://godbolt.org/z/4f5Phd4xq

I think it's fairly safe to assume that NVIDIA would be very interested in squeezing as much performance out of the GPUs as they can. The fact that NVCC is rather conspicuously not using predicates in PTX, even for such an obvious case as a ternary operator, suggests that there may be a good reason for it. I'll ask them.

NVIDIA's marketing documentation for Ampere suggests the hardware can now eliminate most of this, but the issues I'm looking at indicate that this is for simple cases only. Old hardware still suffers.

Can you point me to more details? I'm not sure I understand what you have in mind by ampere eliminating divergent branches. IIRC, Ampere allowed concurrent execution of all divergent branches (previously divergent branches ran sequentially) and thus guaranteeing progress, which was impossible on older GPUs, but I don't think it removes the concept of branch divergence.

  • I'm guessing PTX exposes generalized predicates for a reason.

That remains to be seen. Switching to predicates just because PTX syntax allows them is not a very compelling argument, by itself.

Not adding them limits what we can do in the backend.

Can you be more specific about what you need to do in the back-end that can't be done without predication?

Just to be clear -- I'm not against the patch. Being able to use predicates may potentially be useful. However, it appears to be a fundamentally invasive change (both to NVPTX back-end, and to the PTX we'll generate, with potential unforeseen consequences) and I want to have a better idea of what problems it solves, what it buys us and whether the benefits outweigh the downsides.

@Artem-B
Copy link
Member

Artem-B commented Oct 6, 2023

I've asked NVIDIA's compiler folks about jumps vs predicates, and they say that jumps win:

Q: [jumps vs. predicates]
A: The short answer is it is better to generate branches. Current PTXAS are more trained to do that. We had done some work where we did control dependence based analysis and put masks on various PTX instructions and put out a linearized sequence of PTX instructions and it didnt work very well. There were a few issues:

  1. register allocator slowed down due to predicated instructions. and didn't do a good a job as the traditional register allocation
  2. PTXAS unpredicated some cases and re-introduced control flow.
  3. handling of loops with warp divergent loop tests was problematic, predicated branches :(

@justinfargnoli justinfargnoli self-requested a review March 12, 2024 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:NVPTX enhancement Improving things as opposed to bug fixing, e.g. new or missing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants