VectorReduce peephole matching for Hexagon #5424

aankit-ca · 2020-11-02T10:19:13Z

Generates vrmpy/vtmpy/vdmpy for VectorReduce nodes
@dsharletg PTAL

steven-johnson · 2020-11-02T16:52:48Z

src/CodeGen_Hexagon.cpp

@@ -1003,6 +1003,18 @@ const HvxIntrinsic intrinsic_wrappers[] = {
     "acc_add_2mpy.vw.vh.b",
     {i32v1, i16v1, i16},
     HvxIntrinsic::BroadcastScalarsToWords},
+    {MAKE_ID_PAIR(Intrinsic::hexagon_V6_vdmpyhsat),


MAKE_ID_PAIR was removed when hvx_64 support was removed. This branch is a wee bit out of date :-)

Thanks Steven for pointing this. Fixed it

dsharletg

This is fantastic work, it looks great! @abadams @vksnk you might find this interesting. This looks like it should be fairly robust and reliable code generation of interesting horizontal vector ops, a lot more variety of instructions than on ARM.

I am a bit worried about the amount of shuffling that gets generated in some of the cases, do you see these in hand-written inner loops? Or are they expected to simplify if the input code is of the right form?

dsharletg · 2020-11-03T21:29:08Z

src/HexagonOptimize.cpp

+                IRMatcher::Wild<1> y;
+                auto rewrite = IRMatcher::rewriter(IRMatcher::mul(op->a, op->b), op->type);
+                if (rewrite(h_add(x, lanes) * broadcast(y, lanes), h_add(x * broadcast(y, arg_lanes), lanes)) ||
+                    rewrite(h_add(y, lanes) * broadcast(x, lanes), h_add(broadcast(x, arg_lanes) * y, lanes))) {


Are these rewrites actually different?

I don't think the second one is necessary, because a broadcast is always going to be on the right of a horizontal add (because of how the simplifier canonicalizes things)

Also this seems risky. If you don't hit vrmpy/vtmpy/vdmpy then you just turned a narrow multiply into a much wider one.

Will correct this in the next patch

dsharletg · 2020-11-03T21:56:20Z

src/HexagonOptimize.cpp

+                                                    {reinterpret(ty, a1), a1.type().bytes()},
+                                                    Call::PureExtern));
+                    } else {
+                        if (!can_prove(Shuffle::make_extract_element(b, 2) == 1)) {


Add comment about vtmpy's requirements on the '1' coefficient here.

dsharletg · 2020-11-03T22:02:41Z

src/IROperator.cpp

+    if (const Ramp *ramp = e.as<Ramp>()) {
+        Type ty = t.with_lanes(ramp->type.lanes() / t.lanes());
+        Expr first = ramp->base;
+        Expr last = simplify(first + (ramp->lanes - 1) * ramp->stride);


This seem correct but it's a little bit suspicious, and possibly is an issue to call simplify here. @abadams what do you think?

The case that's dubious would be something like lossless_cast(UInt(8), ramp(u16(5), u16(32800), 3)). The first and last lanes are 5 and 69, which are both representable as UInt8s. But the middle lane is 32805, which is not.

It's correct for a type where overflow is not defined behavior, so you could only do this if it's a ramp of int32s, in case that helps.

dsharletg · 2020-11-03T22:04:30Z

src/HexagonOptimize.cpp

+            {2, Int(32, 0), Int(16, 0), UInt(16, 2), Signature::ScalarB},  // Saturates
+            {2, Int(32, 0), Int(16, 0), Int(16, 2), Signature::ScalarB},   // Saturates
+            {2, Int(16, 0), UInt(8, 0), Int(8, 2), Signature::ScalarB},    // b_ty can be Int(8, 4).
+            // TODO: This would not match untill b_ty is changed to Int(8, 4)


It seems like this should be Int(8, 2), but the call to the intrinsic should broadcast it 2x? i.e. does the TODO belong somewhere else?

If we use Int(8, 4) instead of broadcasting to 2x, we can match instructions for patterns like:
av0 + bv1, cv2 + dv3, av4 + bv5
as well

dsharletg · 2020-11-03T22:06:32Z

src/runtime/hvx_128.ll

+define weak_odr <64 x i32> @halide.hexagon.vrmpy_odd.vub.vub.w(<128 x i8> %low_v, <128 x i8> %high_v, i32 %const) nounwind uwtable readnone {
+  %low = bitcast <128 x i8> %low_v to <32 x i32>
+  %high = bitcast <128 x i8> %high_v to <32 x i32>
+  %dv = call <64 x i32> @llvm.hexagon.V6.vcombine.128B(<32 x i32> %high, <32 x i32> %low)


Should the intrinsics in hvx_128.ll take 2x wider vectors, and should we do a Shuffle::make_concat in the compiler code instead of vcombine at runtime? Would that maybe result in simpler IR sometimes?

I wasn't sure if Shuffle::make_concat will result in a vcombine. I'll try your suggestion.

Note that if you do this, some of these wrappers would become trivial and could be defined in CodeGen_Hexagon.cpp instead of the runtime.

The problem I was facing there was that the immediate operand cannot be passed as an arguement.

abadams · 2020-11-03T22:18:31Z

src/IROperator.cpp

-        if (t.can_represent(c->value.type())) {
-            // We can recurse into widening casts.
-            return lossless_cast(t, c->value);
+        Expr v = lossless_cast(t, c->value);


I don't think you can recurse into narrowing casts though: consider
lossless_cast(UInt16(), cast(UInt8(), cast(UInt16(), e)))
It would return e, which is incorrect.

@abadams What is lossless_cast(UInt(16), cast(Int(8), Variable::make(UInt(16, e))); supposed to return?
The original code returns (uint16)e. Is that correct?

In master, it returns uint16(uint8((uint16)e))

In my head I was thinking lossless_cast is supposed to be one-to-one, so that means:

if x1 == x2 then lossless_cast(t, x1) == lossless_cast(t, x2)

and

if lossless_cast(t, x1) == lossless_cast(t, x2) then x1 == x2

The new rule violates the first of those two conditions (consider e == 0 vs e == 256), but not the second.

I think the issue here is that this isn't written down very precisely. It just says "provably not losing information". 1-to-1 is more than you need for that condition - it just needs to be invertible, so widening and adding junk to the new high bits would qualify, because you can just strip them again. Really the spec implied by the comment is that not only is it invertible, it's inverted by a cast back to the original type:

cast(x.type(), lossless_cast(t, x)) == x

That spec would permit what you've written. I'm a little nervous about making junk high bits legal though. You'd have to know it was feeding into some instruction that ignored them.

@abadams I agree that the change here is incorrect for the reason you mentioned. However, I think the existing code in the master also has a problem. The master returns
uint16(uint8((uint16)e)) for lossless_cast(UInt(16), cast(UInt(8), Variable::make(UInt(16, e)));
but for
lossless_cast(UInt(16), cast(**Int(8)**, Variable::make(UInt(16), e))); it returns (uint16)e which I believe is wrong.

Ah, you're right. That's a bug in master!

It sounds like a change is needed here? There is a bug in master, but there is also a bug here?

Should this change (and maybe the other one below) be separated into another PR to unblock this or at least make it easier to land if it has to go first?

Hey Dillon, I've moved the IROperator changes to separate PR #5459. I will remove the changes from here once the other PR is merged. These changes are however needed for this PR too. Otherwise, find_mpy_ops is not able to find some expressions which we can lossless_cast to the required type.

src/IROperator.cpp

steven-johnson · 2020-11-04T00:13:37Z

(Is this ready to clone for buildbot testing?)

steven-johnson · 2020-11-04T17:58:22Z

#5432 opened for testing

aankit-ca · 2020-11-04T18:38:08Z

Thanks Steven. The buildbot testing will be good to see if there are any failures.

steven-johnson · 2020-11-11T17:26:37Z

Clone looks green, is this ready for final review?

aankit-ca · 2020-11-11T17:57:57Z

Yeah it's ready for another round of review.

steven-johnson · 2020-11-16T17:37:16Z

Ping to @dsharlet @abadams for review

aankit-ca · 2020-12-01T01:37:49Z

@dsharlet I've removed the IROperator changes from here. Can you give this PR another round of review as well while I continue to address the comments from #5459. Please note that the correctness tests may not pass without the #5459 for now.

pranavb-ca · 2020-12-01T03:05:47Z

@dsharletg - PTAL (I think Dillon doesn't use @dsharlet)

dsharletg · 2020-12-01T03:31:46Z

src/HexagonOptimize.cpp

+Expr halide_hexagon_add_2mpy(Type result_type, const string &suffix, Expr v01, Expr c01, bool is_vdmpy) {
+    Expr call = Call::make(result_type, "halide.hexagon.add_2mpy" + suffix,
+                           {std::move(v01), std::move(c01)}, Call::PureExtern);
+    return (is_vdmpy) ? call : native_interleave(call);


I think it might be better if the caller added the interleaving, this conditional is potentially confusing.

If you're going to keep the logic, I think the argument should just be called interleave_result or something and not is_vdmpy.

dsharletg · 2020-12-01T05:23:47Z

src/HexagonOptimize.cpp

+        // Here, x can be reinterpreted as a broadcast of uint32.
+        if (const Shuffle *shuff = x.as<Shuffle>()) {
+            int lanes = shuff->type.lanes();
+            int rep = ty.lanes();


I think this would be better as a small helper function is_broadcast(int src_lanes, ...) (and maybe that should be a helper to complement https://github.com/halide/Halide/blob/master/src/IR.h#L773-L775).

dsharletg · 2020-12-01T05:25:07Z

src/HexagonOptimize.cpp

+    // .....
+    // window_size != lanes
+    // TODO: Their could be other patterns as well which we should match
+    int is_stencil_reduction(const Expr &op, int window_size) {


I think this should be named is_stencil_interleave (this isn't a reduction, it's an interleave of overlapping vectors with stride 1 between them).

Also, it should be static or a free function.

dsharletg · 2020-12-01T05:30:40Z

src/HexagonOptimize.cpp

+            }
+            // Reinterpret scalar b arg to get correct type.
+            if (sig.flags & Signature::ScalarB) {
+                b = simplify(reinterpret(Type(b_ty.code(), b_ty.lanes() * b_ty.bits(), 1), b));


Is this making a 24-bit int sometimes, e.g. for vtmpy? I guess this is OK, but this wasn't supported until very recently.

dsharletg · 2020-12-01T05:36:55Z

src/HexagonOptimize.cpp

+                    Type arg_ty = op->type.with_lanes(op->type.lanes() / 2);
+                    Expr even = halide_hexagon_add_4mpy(arg_ty, "_even" + suffix, a, b, true);
+                    Expr odd = halide_hexagon_add_4mpy(arg_ty, "_odd" + suffix, a, b, true);
+                    return Shuffle::make_interleave({even, odd});


Should we just have the runtime function be add_4mpy, and handle this even + odd + interleave in the runtime function? Might be cleaner/simpler?

I think the equivalent code looks kind of ugly because of the big shuffle vector. However I can make the change if you think this one's better.

define weak_odr <128 x i32> @halide.hexagon.add_4mpy.vub.b.dv(<256 x i8> %dv, i32 %const) nounwind uwtable readnone { %dv32 = bitcast <256 x i8> %dv to <64 x i32> %even_deinterleaved = call <64 x i32> @llvm.hexagon.V6.vrmpybusi.128B(<64 x i32> %dv32, i32 %const, i32 0) %odd_deinterleaved = call <64 x i32> @llvm.hexagon.V6.vrmpybusi.128B(<64 x i32> %dv32, i32 %const, i32 1) %even = call <64 x i32> @halide.hexagon.interleave.vw(<64 x i32> %even_deinterleaved) %odd = call <64 x i32> @halide.hexagon.interleave.vw(<64 x i32> %odd_deinterleaved) %ee = call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %even) %eo = call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %even) %oe = call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %odd) %oo = call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %odd) %res_lo = tail call <64 x i32> @llvm.hexagon.V6.vshuffvdd.128B(<32 x i32> %ee, <32 x i32> %oe, i32 -4) %res_hi = tail call <64 x i32> @llvm.hexagon.V6.vshuffvdd.128B(<32 x i32> %eo, <32 x i32> %oo, i32 -4) %res = shufflevector <64 x i32> %res_lo, <64 x i32> %res_hi, <128 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 72, i32 73, i32 74, i32 75, i32 76, i32 77, i32 78, i32 79, i32 80, i32 81, i32 82, i32 83, i32 84, i32 85, i32 86, i32 87, i32 88, i32 89, i32 90, i32 91, i32 92, i32 93, i32 94, i32 95, i32 96, i32 97, i32 98, i32 99, i32 100, i32 101, i32 102, i32 103, i32 104, i32 105, i32 106, i32 107, i32 108, i32 109, i32 110, i32 111, i32 112, i32 113, i32 114, i32 115, i32 116, i32 117, i32 118, i32 119, i32 120, i32 121, i32 122, i32 123, i32 124, i32 125, i32 126, i32 127> ret <128 x i32> %res }

I don't think the shuffle is that bad, it's just a concat. I think we should do this for now. I think the interesting thing is maybe we want to try to simplify away some of those interleaves, but the interleaving behavior of this instruction is unusual compared to the rest.

dsharletg · 2020-12-10T01:10:52Z

src/IR.h

+     * A uint8 shuffle of with 4*n lanes and indices:
+     *     0, 1, 2, 3, 0, 1, 2, 3, ....., 0, 1, 2, 3
+     * can be treated as a uint32 broadcast with n lanes.
+     * Return a Broadcast if possible, Expr() otherwise. */


This should return true/false, and just accept a broadcast factor. The caller should be responsible for making a broadcast op if it wants one.

dsharletg · 2020-12-10T01:13:13Z

src/HexagonOptimize.cpp

+        return IRMutator::visit(op);
+    }
+
+    typedef struct {


Nit: struct Signature { ... }, also move this to be a local declaration in the visit function.

aankit-ca · 2020-12-10T05:08:56Z

@steven-johnson Can you pull in the changes on #5432 for testing?

dsharletg · 2020-12-10T05:17:53Z

Whoops, I already started it on #5543

aankit-ca · 2020-12-10T08:27:52Z

@dsharletg Thanks. I've pushed one more change that needs to be pulled on the test branch. I think the current failure on the branch is unrelated to this PR. I can see the same testcase failing on other PR's too (like here #5518)

CodeGen for VectorReduce for Hexagon

d554857

steven-johnson reviewed Nov 2, 2020

View reviewed changes

aankit-quic added 3 commits November 3, 2020 15:07

Remove use of MAKE_ID_PAIR.

f50a17e

Fix clang-format errors.

214d084

Spelling correction.

2dd72db

dsharletg reviewed Nov 3, 2020

View reviewed changes

abadams reviewed Nov 3, 2020

View reviewed changes

src/IROperator.cpp Outdated Show resolved Hide resolved

steven-johnson and others added 2 commits November 5, 2020 11:55

Merge branch 'master' into pr/5424

bdfb044

Address comments from PR. Use Shuffle::make_concat instead of vcombine.

4161b53

Merge branch 'master' into pr/5424

98b8287

aankit-quic added 2 commits November 30, 2020 19:16

Remove IROperator changes.

4bdbf46

Merge branch 'master' into aankit/vr

55b37f0

dsharletg reviewed Dec 1, 2020

View reviewed changes

aankit-quic added 2 commits December 8, 2020 18:13

Merge branch 'master' into aankit/vr

71a2d0d

Address comments

d2b0bb2

dsharletg approved these changes Dec 10, 2020

View reviewed changes

aankit-quic added 5 commits December 9, 2020 21:10

Move even-odd shuffling for vrmpy to runtime .ll func

e36177d

clang-format + hvx_128 changes.ll changes

6f044f5

clang-format

98fed69

Minor changes

a29089b

Minor changes

7031daa

interchange vshuffvdd operand

833e893

dsharletg merged commit 968f6b3 into halide:master Dec 10, 2020

alexreinking added this to the v11.0.0 milestone Dec 11, 2020

dsharletg mentioned this pull request Dec 17, 2020

Fix several bugs on Hexagon and some cleanup #5570

Merged

VectorReduce peephole matching for Hexagon #5424

VectorReduce peephole matching for Hexagon #5424

Conversation

aankit-ca commented Nov 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsharletg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aankit-ca Nov 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aankit-ca Nov 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aankit-ca Nov 19, 2020 • edited

Choose a reason for hiding this comment

steven-johnson commented Nov 4, 2020

steven-johnson commented Nov 4, 2020

aankit-ca commented Nov 4, 2020

steven-johnson commented Nov 11, 2020

aankit-ca commented Nov 11, 2020

steven-johnson commented Nov 16, 2020

aankit-ca commented Dec 1, 2020

pranavb-ca commented Dec 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aankit-ca Dec 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aankit-ca commented Dec 10, 2020

dsharletg commented Dec 10, 2020

aankit-ca commented Dec 10, 2020

aankit-ca Nov 9, 2020 •

edited

aankit-ca Nov 9, 2020 •

edited

aankit-ca Nov 19, 2020 •

edited

aankit-ca Dec 9, 2020 •

edited