Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Form V_MAD_U64_U32 from mul24 #72393

Merged
merged 4 commits into from
Dec 11, 2023
Merged

Conversation

Pierre-vh
Copy link
Contributor

@Pierre-vh Pierre-vh commented Nov 15, 2023

Fixes SWDEV-421067

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 15, 2023

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

CGP can transform a fine mul+add into a (mul24/mulhi24)+add, so add a pattern for that.

See SWDEV-421067


Full diff: https://github.com/llvm/llvm-project/pull/72393.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/VOP3Instructions.td (+9)
  • (modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+63)
diff --git a/llvm/lib/Target/AMDGPU/VOP3Instructions.td b/llvm/lib/Target/AMDGPU/VOP3Instructions.td
index 114d33b077866a1..06856b03a508124 100644
--- a/llvm/lib/Target/AMDGPU/VOP3Instructions.td
+++ b/llvm/lib/Target/AMDGPU/VOP3Instructions.td
@@ -676,6 +676,15 @@ multiclass IMAD32_Pats <VOP3_Pseudo inst> {
         (ThreeOpFragSDAG<mul, add> i32:$src0, i32:$src1, (i32 imm:$src2)),
         (EXTRACT_SUBREG (inst $src0, $src1, (i64 (as_i64imm $src2)), 0 /* clamp */), sub0)
         >;
+
+  // Handle cases where amdgpu-codegenprepare-mul24 made a mul24 instead of a normal mul.
+  def : GCNPat <
+      (i64 (add (bitconvert (v2i32 (build_vector
+                                     (AMDGPUmul_u24 i32:$src0, i32:$src1),
+                                     (AMDGPUmulhi_u24 i32:$src0, i32:$src1)))),
+                i64:$src2)),
+      (inst $src0, $src1, $src2, 0 /* clamp */)
+      >;
 }
 
 // exclude pre-GFX9 where it was slow
diff --git a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
index 61017e809c86365..56b6fef1b82255c 100644
--- a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
+++ b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
@@ -6928,6 +6928,69 @@ entry:
   ret <2 x i16> %add0
 }
 
+define i64 @mul_mulhi_u24(i32 %x, i32 %y, i64 %z) {
+; GFX67-LABEL: mul_mulhi_u24:
+; GFX67:       ; %bb.0:
+; GFX67-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX67-NEXT:    v_mul_u32_u24_e32 v4, v0, v1
+; GFX67-NEXT:    v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX67-NEXT:    v_add_i32_e32 v0, vcc, v4, v2
+; GFX67-NEXT:    v_addc_u32_e32 v1, vcc, v1, v3, vcc
+; GFX67-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX8-LABEL: mul_mulhi_u24:
+; GFX8:       ; %bb.0:
+; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8-NEXT:    v_mul_u32_u24_e32 v4, v0, v1
+; GFX8-NEXT:    v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX8-NEXT:    v_add_u32_e32 v0, vcc, v4, v2
+; GFX8-NEXT:    v_addc_u32_e32 v1, vcc, v1, v3, vcc
+; GFX8-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9-SDAG-LABEL: mul_mulhi_u24:
+; GFX9-SDAG:       ; %bb.0:
+; GFX9-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-SDAG-NEXT:    v_mad_u64_u32 v[0:1], s[4:5], v0, v1, v[2:3]
+; GFX9-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9-GISEL-LABEL: mul_mulhi_u24:
+; GFX9-GISEL:       ; %bb.0:
+; GFX9-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-GISEL-NEXT:    v_mul_u32_u24_e32 v4, v0, v1
+; GFX9-GISEL-NEXT:    v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX9-GISEL-NEXT:    v_add_co_u32_e32 v0, vcc, v4, v2
+; GFX9-GISEL-NEXT:    v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
+; GFX9-GISEL-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-SDAG-LABEL: mul_mulhi_u24:
+; GFX10-SDAG:       ; %bb.0:
+; GFX10-SDAG-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-SDAG-NEXT:    v_mad_u64_u32 v[0:1], null, v0, v1, v[2:3]
+; GFX10-SDAG-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX10-GISEL-LABEL: mul_mulhi_u24:
+; GFX10-GISEL:       ; %bb.0:
+; GFX10-GISEL-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-GISEL-NEXT:    v_mul_u32_u24_e32 v4, v0, v1
+; GFX10-GISEL-NEXT:    v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX10-GISEL-NEXT:    v_add_co_u32 v0, vcc_lo, v4, v2
+; GFX10-GISEL-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, v1, v3, vcc_lo
+; GFX10-GISEL-NEXT:    s_setpc_b64 s[30:31]
+  %mul = call i32 @llvm.amdgcn.mul.u24(i32 %x, i32 %y)
+  %mulhi = call i32 @llvm.amdgcn.mulhi.u24(i32 %x, i32 %y)
+  %mul.zext = zext i32 %mul to i64
+  %mulhi.zext = zext i32 %mulhi to i64
+  %mulhi.shift = shl i64 %mulhi.zext, 32
+  %mul.mulhi = or i64 %mulhi.shift, %mul.zext
+  %add = add nuw nsw i64 %mul.mulhi, %z
+  ret i64 %add
+}
+
+declare i32 @llvm.amdgcn.mul.u24(i32, i32)
+declare i32 @llvm.amdgcn.mulhi.u24(i32, i32)
+declare i32 @llvm.amdgcn.mul.i24(i32, i32)
+declare i32 @llvm.amdgcn.mulhi.i24(i32, i32)
+
 ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
 ; GFX6: {{.*}}
 ; GFX7: {{.*}}

@jayfoad
Copy link
Contributor

jayfoad commented Nov 15, 2023

CGP can transform a fine mul+add into a (mul24/mulhi24)+add, so add a pattern for that.

Typo "fine"? Not sure what you meant.

This would depend on the relative rate of mul_u24 vs mad_u64. On older ASICs, mad_u64 is "quarter rate" so two fast mul_u24 instructions should be faster. I see that gfx90a uses SIDPFullSpeedModel so mad_u64 is as fast as mul_u24.

In any case would it be better to teach CGP not to do the harmful transformation int he first place, rather than work around it in isel?

@Pierre-vh
Copy link
Contributor Author

CGP can transform a fine mul+add into a (mul24/mulhi24)+add, so add a pattern for that.

Typo "fine"? Not sure what you meant.

This would depend on the relative rate of mul_u24 vs mad_u64. On older ASICs, mad_u64 is "quarter rate" so two fast mul_u24 instructions should be faster. I see that gfx90a uses SIDPFullSpeedModel so mad_u64 is as fast as mul_u24.

In any case would it be better to teach CGP not to do the harmful transformation int he first place, rather than work around it in isel?

By fine I meant "fine to transform into v_mad" but I didn't know about FullSpeed/QuarterSpeed - I edited it.
Should I add a predicate on this to only do the transform on FullSpeedModels then?

About doing this in CGP, I asked @arsenm earlier and he suggested to fix it in ISel rather than teach CGP. I tend to agree - CGP doesn't always have the full picture/full knowledge of what the DAG can do, and it may be non-obvious to fix in CGP. Having a new pattern is simpler and more stable, IMO

@jayfoad
Copy link
Contributor

jayfoad commented Nov 16, 2023

About doing this in CGP, I asked @arsenm earlier and he suggested to fix it in ISel rather than teach CGP. I tend to agree - CGP doesn't always have the full picture/full knowledge of what the DAG can do, and it may be non-obvious to fix in CGP. Having a new pattern is simpler and more stable, IMO

You might both be right (I'm not super familiar with CGP) but I don't really understand why. On a "full rate doubles" machine (i.e. SIDPFullSpeedModel or SIDPGFX940FullSpeedModel), 24-bit multiplies are no faster than 32-bit, so why would CGP need to do anything at all?

@arsenm
Copy link
Contributor

arsenm commented Nov 17, 2023

You might both be right (I'm not super familiar with CGP) but I don't really understand why. On a "full rate doubles" machine (i.e. SIDPFullSpeedModel or SIDPGFX940FullSpeedModel), 24-bit multiplies are no faster than 32-bit, so why would CGP need to do anything at all?

I believe they also use less power. Plus better to not depend on the specific speed model, the more common looking ISA the better

@jayfoad
Copy link
Contributor

jayfoad commented Nov 20, 2023

For background: my understanding is that the only reason CGP needs to help select 24-bit multiplies is that you can't do it during instruction selection because there is no way for a target to simplify a generic opcode like ISD::MUL based on demanded bits information. See https://discourse.llvm.org/t/selectiondag-target-specific-simplification-of-generic-nodes-using-demanded-bits/56747/1

With the current patch I still don't like the fact that CGP will split a 64-bit multiply and expect isel to combine it back into a single instruction later. How about this for a cleaner design:

  • Change llvm.amdgcn.mul.i24 and llvm.amdgcn.mul.u24 to work for any scalar integer type, not just i32. So they are like a regular multiply but only look at the low 24 bits of their inputs. (Or the result is undefined if the inputs are not signed/unsigned 24-bit values - I'm not sure if that is a more useful definition?)
  • Then CGP can convert mul to llvm.amdgcn.mul.i24/u24 where possible with no need to split up 64-bit multiplies.
  • Isel patterns with suitable predicates based on the speed model can choose to implement a 64-bit i24/u24 multiply as either a pair of 24-bit multiplies, or a single 64-bit mad.

@Pierre-vh
Copy link
Contributor Author

For background: my understanding is that the only reason CGP needs to help select 24-bit multiplies is that you can't do it during instruction selection because there is no way for a target to simplify a generic opcode like ISD::MUL based on demanded bits information. See https://discourse.llvm.org/t/selectiondag-target-specific-simplification-of-generic-nodes-using-demanded-bits/56747/1

With the current patch I still don't like the fact that CGP will split a 64-bit multiply and expect isel to combine it back into a single instruction later. How about this for a cleaner design:

  • Change llvm.amdgcn.mul.i24 and llvm.amdgcn.mul.u24 to work for any scalar integer type, not just i32. So they are like a regular multiply but only look at the low 24 bits of their inputs. (Or the result is undefined if the inputs are not signed/unsigned 24-bit values - I'm not sure if that is a more useful definition?)
  • Then CGP can convert mul to llvm.amdgcn.mul.i24/u24 where possible with no need to split up 64-bit multiplies.
  • Isel patterns with suitable predicates based on the speed model can choose to implement a 64-bit i24/u24 multiply as either a pair of 24-bit multiplies, or a single 64-bit mad.

Should the mulhi_24 intrinsics be gone or stay in that case?

If I understand correctly we'd only end up removing the second pattern, the first one would stay right?

Also I still don't understand the deal with the speed model. Should it affect ISel patterns really?
Or do we just do this:

  • V_MAD pattern can select a mul24
  • mul24 not selected by any other pattern are lowered to a pair of mul24/mul24hi

@jayfoad
Copy link
Contributor

jayfoad commented Nov 20, 2023

Also I still don't understand the deal with the speed model. Should it affect ISel patterns really?

Yes, I think so. The job of isel (including pattern-based isel) is to pick the "best" instructions according to some cost model, either performance or code size, so certainly changes in the cost model should affect which isel patterns are used.

@Pierre-vh Pierre-vh changed the title [AMDGPU] Form V_MAD_U64_U32 from mul24/mulhi24 [AMDGPU] Form V_MAD_U64_U32 from mul24 Nov 21, 2023
@Pierre-vh
Copy link
Contributor Author

@jayfoad I did the refactor you suggested, seems to be looking better indeed.

However I still haven't done the SpeedModel part because:

  • (Apologies but) I'm still confused. Do we need it on those new patterns (we only form the V_MAD_I64_I32 if on FullSpeed) or in [AMDGPU] Don't create mulhi_24 in CGP #72983 to emit a i64 mul instead of i24 muls on non-FullSpeed models?
  • We don't have patterns like that currently so I'm wondering if there is a good reason behind it? (cc @arsenm)

@jayfoad
Copy link
Contributor

jayfoad commented Nov 21, 2023

However I still haven't done the SpeedModel part because:

  • (Apologies but) I'm still confused. Do we need it on those new patterns (we only form the V_MAD_I64_I32 if on FullSpeed) or in [AMDGPU] Don't create mulhi_24 in CGP #72983 to emit a i64 mul instead of i24 muls on non-FullSpeed models?

The new patterns should probably have a FullRate64Ops predicate, and if you want to use V_MAD_I64_I32 for a plain 32 * 32 -> 64 bit multiply (not a multiply-add), as suggested inline, then the old pattern could also have a NotFullRate64Ops predicate. It all depends on exactly which pattern you want to be used for each speed model, and making sure there is always at least one pattern that will match each DAG node so you don't get selection failures.

Also, as an alternative to putting predicates on patterns, you could try fiddling with AddedComplexity (since that's the closest thing we have to a cost model for pattern-based selection). But it probably doesn't make much difference in practice.

  • We don't have patterns like that currently so I'm wondering if there is a good reason behind it? (cc @arsenm)

I guess there aren't many cases where we want to select different instructions to do the same thing for different speed models.

@arsenm
Copy link
Contributor

arsenm commented Nov 28, 2023

For background: my understanding is that the only reason CGP needs to help select 24-bit multiplies is that you can't do it during instruction selection because there is no way for a target to simplify a generic opcode like ISD::MUL based on demanded

I think I originally moved this because computeKnownBits in the IR is significantly better than on the DAG. In particular because of the cross block copy issue. Some benchmark needed a loop backedge to be handled optimally

@Pierre-vh Pierre-vh requested a review from arsenm December 4, 2023 07:43
@Pierre-vh
Copy link
Contributor Author

Small ping

llvm/lib/Target/AMDGPU/VOP3Instructions.td Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/VOP3Instructions.td Outdated Show resolved Hide resolved
@Pierre-vh
Copy link
Contributor Author

@jayfoad, can you answer the speed model questions? I don't know enough about it to answer those. Thanks

llvm/lib/Target/AMDGPU/VOP3Instructions.td Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/VOP3Instructions.td Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/VOP3Instructions.td Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/VOP3Instructions.td Outdated Show resolved Hide resolved
@Pierre-vh Pierre-vh merged commit dd32d26 into llvm:main Dec 11, 2023
6 of 7 checks passed
@Pierre-vh Pierre-vh deleted the mad_mul24 branch December 11, 2023 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants