-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Form V_MAD_U64_U32 from mul24 #72393
Conversation
@llvm/pr-subscribers-llvm-ir @llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) ChangesCGP can transform a fine mul+add into a (mul24/mulhi24)+add, so add a pattern for that. See SWDEV-421067 Full diff: https://github.com/llvm/llvm-project/pull/72393.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/VOP3Instructions.td b/llvm/lib/Target/AMDGPU/VOP3Instructions.td
index 114d33b077866a1..06856b03a508124 100644
--- a/llvm/lib/Target/AMDGPU/VOP3Instructions.td
+++ b/llvm/lib/Target/AMDGPU/VOP3Instructions.td
@@ -676,6 +676,15 @@ multiclass IMAD32_Pats <VOP3_Pseudo inst> {
(ThreeOpFragSDAG<mul, add> i32:$src0, i32:$src1, (i32 imm:$src2)),
(EXTRACT_SUBREG (inst $src0, $src1, (i64 (as_i64imm $src2)), 0 /* clamp */), sub0)
>;
+
+ // Handle cases where amdgpu-codegenprepare-mul24 made a mul24 instead of a normal mul.
+ def : GCNPat <
+ (i64 (add (bitconvert (v2i32 (build_vector
+ (AMDGPUmul_u24 i32:$src0, i32:$src1),
+ (AMDGPUmulhi_u24 i32:$src0, i32:$src1)))),
+ i64:$src2)),
+ (inst $src0, $src1, $src2, 0 /* clamp */)
+ >;
}
// exclude pre-GFX9 where it was slow
diff --git a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
index 61017e809c86365..56b6fef1b82255c 100644
--- a/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
+++ b/llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll
@@ -6928,6 +6928,69 @@ entry:
ret <2 x i16> %add0
}
+define i64 @mul_mulhi_u24(i32 %x, i32 %y, i64 %z) {
+; GFX67-LABEL: mul_mulhi_u24:
+; GFX67: ; %bb.0:
+; GFX67-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX67-NEXT: v_mul_u32_u24_e32 v4, v0, v1
+; GFX67-NEXT: v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX67-NEXT: v_add_i32_e32 v0, vcc, v4, v2
+; GFX67-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc
+; GFX67-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX8-LABEL: mul_mulhi_u24:
+; GFX8: ; %bb.0:
+; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8-NEXT: v_mul_u32_u24_e32 v4, v0, v1
+; GFX8-NEXT: v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX8-NEXT: v_add_u32_e32 v0, vcc, v4, v2
+; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc
+; GFX8-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX9-SDAG-LABEL: mul_mulhi_u24:
+; GFX9-SDAG: ; %bb.0:
+; GFX9-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-SDAG-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v1, v[2:3]
+; GFX9-SDAG-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX9-GISEL-LABEL: mul_mulhi_u24:
+; GFX9-GISEL: ; %bb.0:
+; GFX9-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-GISEL-NEXT: v_mul_u32_u24_e32 v4, v0, v1
+; GFX9-GISEL-NEXT: v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX9-GISEL-NEXT: v_add_co_u32_e32 v0, vcc, v4, v2
+; GFX9-GISEL-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
+; GFX9-GISEL-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX10-SDAG-LABEL: mul_mulhi_u24:
+; GFX10-SDAG: ; %bb.0:
+; GFX10-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-SDAG-NEXT: v_mad_u64_u32 v[0:1], null, v0, v1, v[2:3]
+; GFX10-SDAG-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX10-GISEL-LABEL: mul_mulhi_u24:
+; GFX10-GISEL: ; %bb.0:
+; GFX10-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-GISEL-NEXT: v_mul_u32_u24_e32 v4, v0, v1
+; GFX10-GISEL-NEXT: v_mul_hi_u32_u24_e32 v1, v0, v1
+; GFX10-GISEL-NEXT: v_add_co_u32 v0, vcc_lo, v4, v2
+; GFX10-GISEL-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, v1, v3, vcc_lo
+; GFX10-GISEL-NEXT: s_setpc_b64 s[30:31]
+ %mul = call i32 @llvm.amdgcn.mul.u24(i32 %x, i32 %y)
+ %mulhi = call i32 @llvm.amdgcn.mulhi.u24(i32 %x, i32 %y)
+ %mul.zext = zext i32 %mul to i64
+ %mulhi.zext = zext i32 %mulhi to i64
+ %mulhi.shift = shl i64 %mulhi.zext, 32
+ %mul.mulhi = or i64 %mulhi.shift, %mul.zext
+ %add = add nuw nsw i64 %mul.mulhi, %z
+ ret i64 %add
+}
+
+declare i32 @llvm.amdgcn.mul.u24(i32, i32)
+declare i32 @llvm.amdgcn.mulhi.u24(i32, i32)
+declare i32 @llvm.amdgcn.mul.i24(i32, i32)
+declare i32 @llvm.amdgcn.mulhi.i24(i32, i32)
+
;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
; GFX6: {{.*}}
; GFX7: {{.*}}
|
Typo "fine"? Not sure what you meant. This would depend on the relative rate of mul_u24 vs mad_u64. On older ASICs, mad_u64 is "quarter rate" so two fast mul_u24 instructions should be faster. I see that gfx90a uses SIDPFullSpeedModel so mad_u64 is as fast as mul_u24. In any case would it be better to teach CGP not to do the harmful transformation int he first place, rather than work around it in isel? |
By fine I meant "fine to transform into v_mad" but I didn't know about FullSpeed/QuarterSpeed - I edited it. About doing this in CGP, I asked @arsenm earlier and he suggested to fix it in ISel rather than teach CGP. I tend to agree - CGP doesn't always have the full picture/full knowledge of what the DAG can do, and it may be non-obvious to fix in CGP. Having a new pattern is simpler and more stable, IMO |
You might both be right (I'm not super familiar with CGP) but I don't really understand why. On a "full rate doubles" machine (i.e. SIDPFullSpeedModel or SIDPGFX940FullSpeedModel), 24-bit multiplies are no faster than 32-bit, so why would CGP need to do anything at all? |
I believe they also use less power. Plus better to not depend on the specific speed model, the more common looking ISA the better |
For background: my understanding is that the only reason CGP needs to help select 24-bit multiplies is that you can't do it during instruction selection because there is no way for a target to simplify a generic opcode like ISD::MUL based on demanded bits information. See https://discourse.llvm.org/t/selectiondag-target-specific-simplification-of-generic-nodes-using-demanded-bits/56747/1 With the current patch I still don't like the fact that CGP will split a 64-bit multiply and expect isel to combine it back into a single instruction later. How about this for a cleaner design:
|
Should the mulhi_24 intrinsics be gone or stay in that case? If I understand correctly we'd only end up removing the second pattern, the first one would stay right? Also I still don't understand the deal with the speed model. Should it affect ISel patterns really?
|
Yes, I think so. The job of isel (including pattern-based isel) is to pick the "best" instructions according to some cost model, either performance or code size, so certainly changes in the cost model should affect which isel patterns are used. |
@jayfoad I did the refactor you suggested, seems to be looking better indeed. However I still haven't done the SpeedModel part because:
|
The new patterns should probably have a FullRate64Ops predicate, and if you want to use V_MAD_I64_I32 for a plain 32 * 32 -> 64 bit multiply (not a multiply-add), as suggested inline, then the old pattern could also have a NotFullRate64Ops predicate. It all depends on exactly which pattern you want to be used for each speed model, and making sure there is always at least one pattern that will match each DAG node so you don't get selection failures. Also, as an alternative to putting predicates on patterns, you could try fiddling with AddedComplexity (since that's the closest thing we have to a cost model for pattern-based selection). But it probably doesn't make much difference in practice.
I guess there aren't many cases where we want to select different instructions to do the same thing for different speed models. |
I think I originally moved this because computeKnownBits in the IR is significantly better than on the DAG. In particular because of the cross block copy issue. Some benchmark needed a loop backedge to be handled optimally |
Small ping |
@jayfoad, can you answer the speed model questions? I don't know enough about it to answer those. Thanks |
Fixes SWDEV-421067