[AMDGPU] Form V_MAD_U64_U32 from mul24 #72393

Pierre-vh · 2023-11-15T14:21:20Z

Fixes SWDEV-421067

llvmbot · 2023-11-15T14:21:53Z

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

CGP can transform a fine mul+add into a (mul24/mulhi24)+add, so add a pattern for that.

See SWDEV-421067

Full diff: https://github.com/llvm/llvm-project/pull/72393.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/VOP3Instructions.td (+9)
(modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+63)

diff --git a/llvm/lib/Target/AMDGPU/VOP3Instructions.td b/llvm/lib/Target/AMDGPU/VOP3Instructions.td
index 114d33b077866a1..06856b03a508124 100644
--- a/llvm/lib/Target/AMDGPU/VOP3Instructions.td
+++ b/llvm/lib/Target/AMDGPU/VOP3Instructions.td
@@ -676,6 +676,15 @@ multiclass IMAD32_Pats <VOP3_Pseudo inst> {
         (ThreeOpFragSDAG<mul, add> i32:$src0, i32:$src1, (i32 imm:$src2)),
         (EXTRACT_SUBREG (inst $src0, $src1, (i64 (as_i64imm $src2)), 0 /* clamp */), sub0)
         >;
+
+  // Handle cases where amdgpu-codegenprepare-mul24 made a mul24 instead of a normal mul.
+  def : GCNPat <
+      (i64 (add (bitconvert (v2i32 (build_vector
+                                     (AMDGPUmul_u24 i32:$src0, i32:$src1),
+                                     (AMDGPUmulhi_u24 i32:$src0, i32:$src1)))),
+                i64:$src2)),
+      (inst $src0, $src1, $src2, 0 /* clamp */)
+      >;
 }
 
 // exclude pre-GFX9 where it was slow
diff --git a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
index 61017e809c86365..56b6fef1b82255c 100644
--- a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
+++ b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
@@ -6928,6 +6928,69 @@ entry:
   ret <2 x i16> %add0
 }
 
+define i64 @mul_mulhi_u24(i32 %x, i32 %y, i64 %z) {
+; GFX67-LABEL: mul_mulhi_u24:
+; GFX67:       ; %bb.0:
+; GFX67-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX67-NEXT:    v_mul_u32_u24_e32 v4, v0, v1
+; GFX67-NEXT:    v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX67-NEXT:    v_add_i32_e32 v0, vcc, v4, v2
+; GFX67-NEXT:    v_addc_u32_e32 v1, vcc, v1, v3, vcc
+; GFX67-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX8-LABEL: mul_mulhi_u24:
+; GFX8:       ; %bb.0:
+; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8-NEXT:    v_mul_u32_u24_e32 v4, v0, v1
+; GFX8-NEXT:    v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX8-NEXT:    v_add_u32_e32 v0, vcc, v4, v2
+; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, v1, v3, vcc
+; GFX8-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9-SDAG-LABEL: mul_mulhi_u24:
+; GFX9-SDAG:       ; %bb.0:
+; GFX9-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-SDAG-NEXT:    v_mad_u64_u32 v[0:1], s[4:5], v0, v1, v[2:3]
+; GFX9-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9-GISEL-LABEL: mul_mulhi_u24:
+; GFX9-GISEL:       ; %bb.0:
+; GFX9-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-GISEL-NEXT:    v_mul_u32_u24_e32 v4, v0, v1
+; GFX9-GISEL-NEXT:    v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX9-GISEL-NEXT:    v_add_co_u32_e32 v0, vcc, v4, v2
+; GFX9-GISEL-NEXT:    v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
+; GFX9-GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-SDAG-LABEL: mul_mulhi_u24:
+; GFX10-SDAG:       ; %bb.0:
+; GFX10-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-SDAG-NEXT:    v_mad_u64_u32 v[0:1], null, v0, v1, v[2:3]
+; GFX10-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-GISEL-LABEL: mul_mulhi_u24:
+; GFX10-GISEL:       ; %bb.0:
+; GFX10-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-GISEL-NEXT:    v_mul_u32_u24_e32 v4, v0, v1
+; GFX10-GISEL-NEXT:    v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX10-GISEL-NEXT:    v_add_co_u32 v0, vcc_lo, v4, v2
+; GFX10-GISEL-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, v1, v3, vcc_lo
+; GFX10-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %mul = call i32 @llvm.amdgcn.mul.u24(i32 %x, i32 %y)
+  %mulhi = call i32 @llvm.amdgcn.mulhi.u24(i32 %x, i32 %y)
+  %mul.zext = zext i32 %mul to i64
+  %mulhi.zext = zext i32 %mulhi to i64
+  %mulhi.shift = shl i64 %mulhi.zext, 32
+  %mul.mulhi = or i64 %mulhi.shift, %mul.zext
+  %add = add nuw nsw i64 %mul.mulhi, %z
+  ret i64 %add
+}
+
+declare i32 @llvm.amdgcn.mul.u24(i32, i32)
+declare i32 @llvm.amdgcn.mulhi.u24(i32, i32)
+declare i32 @llvm.amdgcn.mul.i24(i32, i32)
+declare i32 @llvm.amdgcn.mulhi.i24(i32, i32)
+
 ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
 ; GFX6: {{.*}}
 ; GFX7: {{.*}}

jayfoad · 2023-11-15T14:49:53Z

CGP can transform a fine mul+add into a (mul24/mulhi24)+add, so add a pattern for that.

Typo "fine"? Not sure what you meant.

This would depend on the relative rate of mul_u24 vs mad_u64. On older ASICs, mad_u64 is "quarter rate" so two fast mul_u24 instructions should be faster. I see that gfx90a uses SIDPFullSpeedModel so mad_u64 is as fast as mul_u24.

In any case would it be better to teach CGP not to do the harmful transformation int he first place, rather than work around it in isel?

Pierre-vh · 2023-11-16T07:06:58Z

CGP can transform a fine mul+add into a (mul24/mulhi24)+add, so add a pattern for that.

Typo "fine"? Not sure what you meant.

This would depend on the relative rate of mul_u24 vs mad_u64. On older ASICs, mad_u64 is "quarter rate" so two fast mul_u24 instructions should be faster. I see that gfx90a uses SIDPFullSpeedModel so mad_u64 is as fast as mul_u24.

In any case would it be better to teach CGP not to do the harmful transformation int he first place, rather than work around it in isel?

By fine I meant "fine to transform into v_mad" but I didn't know about FullSpeed/QuarterSpeed - I edited it.
Should I add a predicate on this to only do the transform on FullSpeedModels then?

About doing this in CGP, I asked @arsenm earlier and he suggested to fix it in ISel rather than teach CGP. I tend to agree - CGP doesn't always have the full picture/full knowledge of what the DAG can do, and it may be non-obvious to fix in CGP. Having a new pattern is simpler and more stable, IMO

jayfoad · 2023-11-16T09:49:41Z

About doing this in CGP, I asked @arsenm earlier and he suggested to fix it in ISel rather than teach CGP. I tend to agree - CGP doesn't always have the full picture/full knowledge of what the DAG can do, and it may be non-obvious to fix in CGP. Having a new pattern is simpler and more stable, IMO

You might both be right (I'm not super familiar with CGP) but I don't really understand why. On a "full rate doubles" machine (i.e. SIDPFullSpeedModel or SIDPGFX940FullSpeedModel), 24-bit multiplies are no faster than 32-bit, so why would CGP need to do anything at all?

arsenm · 2023-11-17T08:05:31Z

You might both be right (I'm not super familiar with CGP) but I don't really understand why. On a "full rate doubles" machine (i.e. SIDPFullSpeedModel or SIDPGFX940FullSpeedModel), 24-bit multiplies are no faster than 32-bit, so why would CGP need to do anything at all?

I believe they also use less power. Plus better to not depend on the specific speed model, the more common looking ISA the better

jayfoad · 2023-11-20T12:58:21Z

For background: my understanding is that the only reason CGP needs to help select 24-bit multiplies is that you can't do it during instruction selection because there is no way for a target to simplify a generic opcode like ISD::MUL based on demanded bits information. See https://discourse.llvm.org/t/selectiondag-target-specific-simplification-of-generic-nodes-using-demanded-bits/56747/1

With the current patch I still don't like the fact that CGP will split a 64-bit multiply and expect isel to combine it back into a single instruction later. How about this for a cleaner design:

Change llvm.amdgcn.mul.i24 and llvm.amdgcn.mul.u24 to work for any scalar integer type, not just i32. So they are like a regular multiply but only look at the low 24 bits of their inputs. (Or the result is undefined if the inputs are not signed/unsigned 24-bit values - I'm not sure if that is a more useful definition?)
Then CGP can convert mul to llvm.amdgcn.mul.i24/u24 where possible with no need to split up 64-bit multiplies.
Isel patterns with suitable predicates based on the speed model can choose to implement a 64-bit i24/u24 multiply as either a pair of 24-bit multiplies, or a single 64-bit mad.

Pierre-vh · 2023-11-20T13:53:37Z

For background: my understanding is that the only reason CGP needs to help select 24-bit multiplies is that you can't do it during instruction selection because there is no way for a target to simplify a generic opcode like ISD::MUL based on demanded bits information. See https://discourse.llvm.org/t/selectiondag-target-specific-simplification-of-generic-nodes-using-demanded-bits/56747/1

With the current patch I still don't like the fact that CGP will split a 64-bit multiply and expect isel to combine it back into a single instruction later. How about this for a cleaner design:

Change llvm.amdgcn.mul.i24 and llvm.amdgcn.mul.u24 to work for any scalar integer type, not just i32. So they are like a regular multiply but only look at the low 24 bits of their inputs. (Or the result is undefined if the inputs are not signed/unsigned 24-bit values - I'm not sure if that is a more useful definition?)

Then CGP can convert mul to llvm.amdgcn.mul.i24/u24 where possible with no need to split up 64-bit multiplies.

Isel patterns with suitable predicates based on the speed model can choose to implement a 64-bit i24/u24 multiply as either a pair of 24-bit multiplies, or a single 64-bit mad.

Should the mulhi_24 intrinsics be gone or stay in that case?

If I understand correctly we'd only end up removing the second pattern, the first one would stay right?

Also I still don't understand the deal with the speed model. Should it affect ISel patterns really?
Or do we just do this:

V_MAD pattern can select a mul24
mul24 not selected by any other pattern are lowered to a pair of mul24/mul24hi

jayfoad · 2023-11-20T17:05:18Z

Also I still don't understand the deal with the speed model. Should it affect ISel patterns really?

Yes, I think so. The job of isel (including pattern-based isel) is to pick the "best" instructions according to some cost model, either performance or code size, so certainly changes in the cost model should affect which isel patterns are used.

Pierre-vh · 2023-11-21T12:49:02Z

@jayfoad I did the refactor you suggested, seems to be looking better indeed.

However I still haven't done the SpeedModel part because:

(Apologies but) I'm still confused. Do we need it on those new patterns (we only form the V_MAD_I64_I32 if on FullSpeed) or in [AMDGPU] Don't create mulhi_24 in CGP #72983 to emit a i64 mul instead of i24 muls on non-FullSpeed models?
We don't have patterns like that currently so I'm wondering if there is a good reason behind it? (cc @arsenm)

jayfoad · 2023-11-21T21:15:27Z

However I still haven't done the SpeedModel part because:

(Apologies but) I'm still confused. Do we need it on those new patterns (we only form the V_MAD_I64_I32 if on FullSpeed) or in [AMDGPU] Don't create mulhi_24 in CGP #72983 to emit a i64 mul instead of i24 muls on non-FullSpeed models?

The new patterns should probably have a FullRate64Ops predicate, and if you want to use V_MAD_I64_I32 for a plain 32 * 32 -> 64 bit multiply (not a multiply-add), as suggested inline, then the old pattern could also have a NotFullRate64Ops predicate. It all depends on exactly which pattern you want to be used for each speed model, and making sure there is always at least one pattern that will match each DAG node so you don't get selection failures.

Also, as an alternative to putting predicates on patterns, you could try fiddling with AddedComplexity (since that's the closest thing we have to a cost model for pattern-based selection). But it probably doesn't make much difference in practice.

We don't have patterns like that currently so I'm wondering if there is a good reason behind it? (cc @arsenm)

I guess there aren't many cases where we want to select different instructions to do the same thing for different speed models.

arsenm · 2023-11-28T03:32:20Z

For background: my understanding is that the only reason CGP needs to help select 24-bit multiplies is that you can't do it during instruction selection because there is no way for a target to simplify a generic opcode like ISD::MUL based on demanded

I think I originally moved this because computeKnownBits in the IR is significantly better than on the DAG. In particular because of the cross block copy issue. Some benchmark needed a loop backedge to be handled optimally

llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll

Pierre-vh · 2023-12-07T08:26:04Z

Small ping

llvm/lib/Target/AMDGPU/VOP3Instructions.td

Pierre-vh · 2023-12-08T07:58:00Z

@jayfoad, can you answer the speed model questions? I don't know enough about it to answer those. Thanks

llvm/lib/Target/AMDGPU/VOP3Instructions.td

See SWDEV-421067

Pierre-vh requested review from jayfoad and arsenm November 15, 2023 14:21

llvmbot added the backend:AMDGPU label Nov 15, 2023

Pierre-vh force-pushed the mad_mul24 branch from 1f341bd to 3301b58 Compare November 21, 2023 12:44

Pierre-vh mentioned this pull request Nov 21, 2023

[AMDGPU] Don't create mulhi_24 in CGP #72983

Merged

llvmbot added the llvm:ir label Nov 21, 2023

Pierre-vh changed the title ~~[AMDGPU] Form V_MAD_U64_U32 from mul24/mulhi24~~ [AMDGPU] Form V_MAD_U64_U32 from mul24 Nov 21, 2023

Pierre-vh force-pushed the mad_mul24 branch from 7740e2b to a6f2636 Compare November 30, 2023 07:33

arsenm reviewed Dec 1, 2023

View reviewed changes

llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll Show resolved Hide resolved

arsenm reviewed Dec 1, 2023

View reviewed changes

llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll Outdated Show resolved Hide resolved

Pierre-vh requested a review from arsenm December 4, 2023 07:43

arsenm reviewed Dec 7, 2023

View reviewed changes

llvm/lib/Target/AMDGPU/VOP3Instructions.td Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/VOP3Instructions.td Outdated Show resolved Hide resolved

jayfoad reviewed Dec 8, 2023

View reviewed changes

Pierre-vh added 3 commits December 8, 2023 12:59

[AMDGPU] Form V_MAD_U64_U32 from mul24

db859b6

See SWDEV-421067

Drop add flags

0a3b269

Remove unneeded pattern

c5ff15b

Pierre-vh force-pushed the mad_mul24 branch from 50d48c3 to c5ff15b Compare December 8, 2023 13:23

Pierre-vh requested review from jayfoad and arsenm December 8, 2023 13:24

Remove speedmodel

14f3dc8

jayfoad approved these changes Dec 11, 2023

View reviewed changes

Pierre-vh merged commit dd32d26 into llvm:main Dec 11, 2023
6 of 7 checks passed

Pierre-vh deleted the mad_mul24 branch December 11, 2023 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Form V_MAD_U64_U32 from mul24 #72393

[AMDGPU] Form V_MAD_U64_U32 from mul24 #72393

Pierre-vh commented Nov 15, 2023 •

edited

llvmbot commented Nov 15, 2023 •

edited

jayfoad commented Nov 15, 2023

Pierre-vh commented Nov 16, 2023

jayfoad commented Nov 16, 2023

arsenm commented Nov 17, 2023

jayfoad commented Nov 20, 2023

Pierre-vh commented Nov 20, 2023

jayfoad commented Nov 20, 2023

Pierre-vh commented Nov 21, 2023

jayfoad commented Nov 21, 2023

arsenm commented Nov 28, 2023 •

edited

Pierre-vh commented Dec 7, 2023

Pierre-vh commented Dec 8, 2023

[AMDGPU] Form V_MAD_U64_U32 from mul24 #72393

[AMDGPU] Form V_MAD_U64_U32 from mul24 #72393

Conversation

Pierre-vh commented Nov 15, 2023 • edited

llvmbot commented Nov 15, 2023 • edited

jayfoad commented Nov 15, 2023

Pierre-vh commented Nov 16, 2023

jayfoad commented Nov 16, 2023

arsenm commented Nov 17, 2023

jayfoad commented Nov 20, 2023

Pierre-vh commented Nov 20, 2023

jayfoad commented Nov 20, 2023

Pierre-vh commented Nov 21, 2023

jayfoad commented Nov 21, 2023

arsenm commented Nov 28, 2023 • edited

Pierre-vh commented Dec 7, 2023

Pierre-vh commented Dec 8, 2023

Pierre-vh commented Nov 15, 2023 •

edited

llvmbot commented Nov 15, 2023 •

edited

arsenm commented Nov 28, 2023 •

edited