-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX] Teach NVPTX about predicates #67468
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MI part of NVPTX back-end is largely unfamiliar territory for me, so we'll need someone who has better familiarity with MI than myself.
Meanwhile, I have a few general questions.
Maxwell through Ampere take predicate registers at the ISA level[2]
That document does mention that the GPUs do have predicate registers. It does not necessarily imply that using them to predicate individual instructions is free. Perhaps I'm missing something. Can you elaborate on why you think wider use of predicates will likely be beneficial?
I can see how it may help getting rid of tons of small jumps LLVM tends to generate now and then, but I'm somewhat skeptical that it will have much of a practical impact as most of those cases can be easily optimized into predicated execution by ptxas, if it deems it beneficial.
It would help if you could demonstrate on a bit of PTX how using predicates instead of branches generates better SASS for GPU X. Bonus points for examining SASS for a range of GPUs, so we may have a better idea whether predication should be applied across the board, or only on particular GPU variants.
If wider use of predication gives us no measurable improvements in generated SASS, there's probably no point complicating things.
@@ -2095,6 +2168,8 @@ bool NVPTXDAGToDAGISel::tryStoreRetval(SDNode *N) { | |||
for (unsigned i = 0; i < NumElts; ++i) | |||
Ops.push_back(N->getOperand(i + 2)); | |||
Ops.push_back(CurDAG->getTargetConstant(OffsetVal, DL, MVT::i32)); | |||
Ops.push_back(PRED(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we could fold all those push_back
into Ops.append({....})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
field bits<14> Inst; | ||
|
||
let Namespace = "NVPTX"; | ||
dag OutOperandList = outs; | ||
dag InOperandList = ins; | ||
let AsmString = asmstr; | ||
dag InOperandList = !if(!and(defaultPreds, isPredicable), !con( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reformat for easier reading. E.g. something like this may work:
!if(<condition>,
<if-true>,
<if-false>)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done and for AsmStr
PTX is fully predicated[1], and Maxwell through Ampere take predicate registers at the ISA level[2]. However, we've not been utilizing this in LLVM, but only manually specifying branch predicates for the `CBranch` instruction. As mentioned in [1], all PTX instructions can be predicated, and there are several forms of a predicated instruction in use: @<predicate_reg> @!<predicate_reg> The first form enables the instruction if <predicate_reg> is nonzero, the second if it is zero. In this part-the-first, we add such two-part predicates to the NVPTX backend: a predicate register operand which defaults to `$nopred` i.e. always-true, and a predicate inversion "switch" for inverting the condition, which defaults to zero e.g. ADDi64ri %0, 1, $noreg, 0 is unpredicated, but %2:int1regs = IMPLICIT_DEF ADDi64ri %0, 1, %2, 0 is predicated on `%2`, e.g. ```asm @%p1 add.s64 %rd3, %rd1, 1; ``` Finally: %2:int1regs = IMPLICIT_DEF StoreRetvalI64 %4, 0, %2, 1 is the "inverted version" e.g. ```asm @!%p1 add.s64 %rd3, %rd1, 1; ``` where the last two MOs are a default "no predicate", and "uninverted predicate" register and switch. The changes here are logically fairly minimal, not really affecting the generated code that much, but add the machinery for better optimization opportunities, such as if-conversion which I'm working on. Also missing here are some useful target hooks which I'll add in due course: - getPredicationCost - optimizeCondBranch - reverseBranchCondition - isPredicated et al. I also intend to add more `AllowModify` cases to `analyzeBranch` to enable better machine block placement and other generic machine optimizations. Since the branching logic is significantly affected here I've renamed the branch instructions to make it clear their implementation has changed. [1]: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-instructions [2]: https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf Chapter 6
478c2d0
to
9849e80
Compare
@Artem-B Sorry for the delay. Your questions are definitely valid, and I'm investigating some cases reported to me where divergent branches are killing performance. I should be able to have some conclusions on that soon for several different generations. Before I'm confident in these data, I think it's safe to make a few observations:
|
We can certainly observe what it does. You can dump SASS with cuobjdump and nvdisasm and the latter can even conveniently produce the control flow graph.
Agreed, that LLVM has more info. Do you have examples where ptxas should've used predicated execution but didn't?
Yes, divergent execution is expensive. Yet, predication is not a universal win, either. For large enough branches predication will not solve the problem, and for small branches ptxas may already be doing a good enough job. For what it's worth, NVCC appears to prefer generating jumps, but SASS ends up using predication: https://godbolt.org/z/4f5Phd4xq I think it's fairly safe to assume that NVIDIA would be very interested in squeezing as much performance out of the GPUs as they can. The fact that NVCC is rather conspicuously not using predicates in PTX, even for such an obvious case as a ternary operator, suggests that there may be a good reason for it. I'll ask them.
Can you point me to more details? I'm not sure I understand what you have in mind by ampere eliminating divergent branches. IIRC, Ampere allowed concurrent execution of all divergent branches (previously divergent branches ran sequentially) and thus guaranteeing progress, which was impossible on older GPUs, but I don't think it removes the concept of branch divergence.
That remains to be seen. Switching to predicates just because PTX syntax allows them is not a very compelling argument, by itself.
Can you be more specific about what you need to do in the back-end that can't be done without predication? Just to be clear -- I'm not against the patch. Being able to use predicates may potentially be useful. However, it appears to be a fundamentally invasive change (both to NVPTX back-end, and to the PTX we'll generate, with potential unforeseen consequences) and I want to have a better idea of what problems it solves, what it buys us and whether the benefits outweigh the downsides. |
I've asked NVIDIA's compiler folks about jumps vs predicates, and they say that jumps win:
|
PTX is fully predicated1, and Maxwell through Ampere take predicate registers at the ISA level[2]. However, we've not been utilizing this in LLVM, but only manually specifying branch predicates for the
CBranch
instruction. As mentioned in 1, all PTX instructions can be predicated, and there are several forms of a predicated instruction in use:@<predicate_reg>
@!<predicate_reg>
The first form enables the instruction if <predicate_reg> is nonzero, the second if it is zero.
In this part-the-first, we add such two-part predicates to the NVPTX backend: a predicate register operand which defaults to
$nopred
i.e. always-true, and a predicate inversion "switch" for inverting the condition, which defaults to zeroe.g.
is unpredicated, but
is predicated on
%2
, e.g.Finally:
is the "inverted version" e.g.
where the last two MOs are a default "no predicate", and "uninverted predicate" register and switch.
The changes here are logically fairly minimal, not really affecting the generated code that much, but add the machinery for better optimization opportunities, such as if-conversion which I'm working on. Also missing here are some useful target hooks which I'll add in due course:
I also intend to add more
AllowModify
cases toanalyzeBranch
to enable better machine block placement and other generic machine optimizations.Since the branching logic is significantly affected here I've renamed the branch instructions to make it clear their implementation has changed.
[2]: https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf Chapter 6