Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VectorReduce peephole matching for Hexagon #5424

Merged
merged 17 commits into from Dec 10, 2020
Merged

Conversation

aankit-ca
Copy link
Contributor

Generates vrmpy/vtmpy/vdmpy for VectorReduce nodes
@dsharletg PTAL

@@ -1003,6 +1003,18 @@ const HvxIntrinsic intrinsic_wrappers[] = {
"acc_add_2mpy.vw.vh.b",
{i32v1, i16v1, i16},
HvxIntrinsic::BroadcastScalarsToWords},
{MAKE_ID_PAIR(Intrinsic::hexagon_V6_vdmpyhsat),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAKE_ID_PAIR was removed when hvx_64 support was removed. This branch is a wee bit out of date :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Steven for pointing this. Fixed it

Copy link
Contributor

@dsharletg dsharletg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic work, it looks great! @abadams @vksnk you might find this interesting. This looks like it should be fairly robust and reliable code generation of interesting horizontal vector ops, a lot more variety of instructions than on ARM.

I am a bit worried about the amount of shuffling that gets generated in some of the cases, do you see these in hand-written inner loops? Or are they expected to simplify if the input code is of the right form?

IRMatcher::Wild<1> y;
auto rewrite = IRMatcher::rewriter(IRMatcher::mul(op->a, op->b), op->type);
if (rewrite(h_add(x, lanes) * broadcast(y, lanes), h_add(x * broadcast(y, arg_lanes), lanes)) ||
rewrite(h_add(y, lanes) * broadcast(x, lanes), h_add(broadcast(x, arg_lanes) * y, lanes))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these rewrites actually different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the second one is necessary, because a broadcast is always going to be on the right of a horizontal add (because of how the simplifier canonicalizes things)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this seems risky. If you don't hit vrmpy/vtmpy/vdmpy then you just turned a narrow multiply into a much wider one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will correct this in the next patch

{reinterpret(ty, a1), a1.type().bytes()},
Call::PureExtern));
} else {
if (!can_prove(Shuffle::make_extract_element(b, 2) == 1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment about vtmpy's requirements on the '1' coefficient here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (const Ramp *ramp = e.as<Ramp>()) {
Type ty = t.with_lanes(ramp->type.lanes() / t.lanes());
Expr first = ramp->base;
Expr last = simplify(first + (ramp->lanes - 1) * ramp->stride);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem correct but it's a little bit suspicious, and possibly is an issue to call simplify here. @abadams what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case that's dubious would be something like lossless_cast(UInt(8), ramp(u16(5), u16(32800), 3)). The first and last lanes are 5 and 69, which are both representable as UInt8s. But the middle lane is 32805, which is not.

It's correct for a type where overflow is not defined behavior, so you could only do this if it's a ramp of int32s, in case that helps.

{2, Int(32, 0), Int(16, 0), UInt(16, 2), Signature::ScalarB}, // Saturates
{2, Int(32, 0), Int(16, 0), Int(16, 2), Signature::ScalarB}, // Saturates
{2, Int(16, 0), UInt(8, 0), Int(8, 2), Signature::ScalarB}, // b_ty can be Int(8, 4).
// TODO: This would not match untill b_ty is changed to Int(8, 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this should be Int(8, 2), but the call to the intrinsic should broadcast it 2x? i.e. does the TODO belong somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use Int(8, 4) instead of broadcasting to 2x, we can match instructions for patterns like:
av0 + bv1, cv2 + dv3, av4 + bv5
as well

define weak_odr <64 x i32> @halide.hexagon.vrmpy_odd.vub.vub.w(<128 x i8> %low_v, <128 x i8> %high_v, i32 %const) nounwind uwtable readnone {
%low = bitcast <128 x i8> %low_v to <32 x i32>
%high = bitcast <128 x i8> %high_v to <32 x i32>
%dv = call <64 x i32> @llvm.hexagon.V6.vcombine.128B(<32 x i32> %high, <32 x i32> %low)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the intrinsics in hvx_128.ll take 2x wider vectors, and should we do a Shuffle::make_concat in the compiler code instead of vcombine at runtime? Would that maybe result in simpler IR sometimes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if Shuffle::make_concat will result in a vcombine. I'll try your suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if you do this, some of these wrappers would become trivial and could be defined in CodeGen_Hexagon.cpp instead of the runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I was facing there was that the immediate operand cannot be passed as an arguement.

if (t.can_represent(c->value.type())) {
// We can recurse into widening casts.
return lossless_cast(t, c->value);
Expr v = lossless_cast(t, c->value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can recurse into narrowing casts though: consider
lossless_cast(UInt16(), cast(UInt8(), cast(UInt16(), e)))
It would return e, which is incorrect.

Copy link
Contributor Author

@aankit-ca aankit-ca Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abadams What is lossless_cast(UInt(16), cast(Int(8), Variable::make(UInt(16, e))); supposed to return?
The original code returns (uint16)e. Is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In master, it returns uint16(uint8((uint16)e))

In my head I was thinking lossless_cast is supposed to be one-to-one, so that means:

if x1 == x2 then lossless_cast(t, x1) == lossless_cast(t, x2)

and

if lossless_cast(t, x1) == lossless_cast(t, x2) then x1 == x2

The new rule violates the first of those two conditions (consider e == 0 vs e == 256), but not the second.

I think the issue here is that this isn't written down very precisely. It just says "provably not losing information". 1-to-1 is more than you need for that condition - it just needs to be invertible, so widening and adding junk to the new high bits would qualify, because you can just strip them again. Really the spec implied by the comment is that not only is it invertible, it's inverted by a cast back to the original type:

cast(x.type(), lossless_cast(t, x)) == x

That spec would permit what you've written. I'm a little nervous about making junk high bits legal though. You'd have to know it was feeding into some instruction that ignored them.

Copy link
Contributor Author

@aankit-ca aankit-ca Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abadams I agree that the change here is incorrect for the reason you mentioned. However, I think the existing code in the master also has a problem. The master returns
uint16(uint8((uint16)e)) for lossless_cast(UInt(16), cast(UInt(8), Variable::make(UInt(16, e)));
but for
lossless_cast(UInt(16), cast(**Int(8)**, Variable::make(UInt(16), e))); it returns (uint16)e which I believe is wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. That's a bug in master!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like a change is needed here? There is a bug in master, but there is also a bug here?

Should this change (and maybe the other one below) be separated into another PR to unblock this or at least make it easier to land if it has to go first?

Copy link
Contributor Author

@aankit-ca aankit-ca Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Dillon, I've moved the IROperator changes to separate PR #5459. I will remove the changes from here once the other PR is merged. These changes are however needed for this PR too. Otherwise, find_mpy_ops is not able to find some expressions which we can lossless_cast to the required type.

src/IROperator.cpp Outdated Show resolved Hide resolved
@steven-johnson
Copy link
Contributor

(Is this ready to clone for buildbot testing?)

@steven-johnson
Copy link
Contributor

#5432 opened for testing

@aankit-ca
Copy link
Contributor Author

Thanks Steven. The buildbot testing will be good to see if there are any failures.

@steven-johnson
Copy link
Contributor

Clone looks green, is this ready for final review?

@aankit-ca
Copy link
Contributor Author

Yeah it's ready for another round of review.

@steven-johnson
Copy link
Contributor

Ping to @dsharlet @abadams for review

@aankit-ca
Copy link
Contributor Author

@dsharlet I've removed the IROperator changes from here. Can you give this PR another round of review as well while I continue to address the comments from #5459. Please note that the correctness tests may not pass without the #5459 for now.

@pranavb-ca
Copy link
Contributor

@dsharletg - PTAL (I think Dillon doesn't use @dsharlet)

Expr halide_hexagon_add_2mpy(Type result_type, const string &suffix, Expr v01, Expr c01, bool is_vdmpy) {
Expr call = Call::make(result_type, "halide.hexagon.add_2mpy" + suffix,
{std::move(v01), std::move(c01)}, Call::PureExtern);
return (is_vdmpy) ? call : native_interleave(call);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better if the caller added the interleaving, this conditional is potentially confusing.

If you're going to keep the logic, I think the argument should just be called interleave_result or something and not is_vdmpy.

// Here, x can be reinterpreted as a broadcast of uint32.
if (const Shuffle *shuff = x.as<Shuffle>()) {
int lanes = shuff->type.lanes();
int rep = ty.lanes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better as a small helper function is_broadcast(int src_lanes, ...) (and maybe that should be a helper to complement https://github.com/halide/Halide/blob/master/src/IR.h#L773-L775).

// .....
// window_size != lanes
// TODO: Their could be other patterns as well which we should match
int is_stencil_reduction(const Expr &op, int window_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be named is_stencil_interleave (this isn't a reduction, it's an interleave of overlapping vectors with stride 1 between them).

Also, it should be static or a free function.

}
// Reinterpret scalar b arg to get correct type.
if (sig.flags & Signature::ScalarB) {
b = simplify(reinterpret(Type(b_ty.code(), b_ty.lanes() * b_ty.bits(), 1), b));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this making a 24-bit int sometimes, e.g. for vtmpy? I guess this is OK, but this wasn't supported until very recently.

Type arg_ty = op->type.with_lanes(op->type.lanes() / 2);
Expr even = halide_hexagon_add_4mpy(arg_ty, "_even" + suffix, a, b, true);
Expr odd = halide_hexagon_add_4mpy(arg_ty, "_odd" + suffix, a, b, true);
return Shuffle::make_interleave({even, odd});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just have the runtime function be add_4mpy, and handle this even + odd + interleave in the runtime function? Might be cleaner/simpler?

Copy link
Contributor Author

@aankit-ca aankit-ca Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the equivalent code looks kind of ugly because of the big shuffle vector. However I can make the change if you think this one's better.

define weak_odr <128 x i32> @halide.hexagon.add_4mpy.vub.b.dv(<256 x i8> %dv, i32 %const) nounwind uwtable readnone {
  %dv32 = bitcast <256 x i8> %dv to <64 x i32>
  %even_deinterleaved = call <64 x i32> @llvm.hexagon.V6.vrmpybusi.128B(<64 x i32> %dv32, i32 %const, i32 0)
  %odd_deinterleaved = call <64 x i32> @llvm.hexagon.V6.vrmpybusi.128B(<64 x i32> %dv32, i32 %const, i32 1)
  %even = call <64 x i32> @halide.hexagon.interleave.vw(<64 x i32> %even_deinterleaved)
  %odd = call <64 x i32> @halide.hexagon.interleave.vw(<64 x i32> %odd_deinterleaved)
  %ee = call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %even)
  %eo = call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %even)
  %oe = call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %odd)
  %oo = call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %odd)
  %res_lo = tail call <64 x i32> @llvm.hexagon.V6.vshuffvdd.128B(<32 x i32> %ee, <32 x i32> %oe, i32 -4)
  %res_hi = tail call <64 x i32> @llvm.hexagon.V6.vshuffvdd.128B(<32 x i32> %eo, <32 x i32> %oo, i32 -4)
  %res = shufflevector <64 x i32> %res_lo, <64 x i32> %res_hi, <128 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 72, i32 73, i32 74, i32 75, i32 76, i32 77, i32 78, i32 79, i32 80, i32 81, i32 82, i32 83, i32 84, i32 85, i32 86, i32 87, i32 88, i32 89, i32 90, i32 91, i32 92, i32 93, i32 94, i32 95, i32 96, i32 97, i32 98, i32 99, i32 100, i32 101, i32 102, i32 103, i32 104, i32 105, i32 106, i32 107, i32 108, i32 109, i32 110, i32 111, i32 112, i32 113, i32 114, i32 115, i32 116, i32 117, i32 118, i32 119, i32 120, i32 121, i32 122, i32 123, i32 124, i32 125, i32 126, i32 127>
  ret <128 x i32> %res
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the shuffle is that bad, it's just a concat. I think we should do this for now. I think the interesting thing is maybe we want to try to simplify away some of those interleaves, but the interleaving behavior of this instruction is unusual compared to the rest.

src/IR.h Outdated
* A uint8 shuffle of with 4*n lanes and indices:
* 0, 1, 2, 3, 0, 1, 2, 3, ....., 0, 1, 2, 3
* can be treated as a uint32 broadcast with n lanes.
* Return a Broadcast if possible, Expr() otherwise. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return true/false, and just accept a broadcast factor. The caller should be responsible for making a broadcast op if it wants one.

return IRMutator::visit(op);
}

typedef struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: struct Signature { ... }, also move this to be a local declaration in the visit function.

@aankit-ca
Copy link
Contributor Author

@steven-johnson Can you pull in the changes on #5432 for testing?

@dsharletg
Copy link
Contributor

Whoops, I already started it on #5543

@aankit-ca
Copy link
Contributor Author

@dsharletg Thanks. I've pushed one more change that needs to be pulled on the test branch. I think the current failure on the branch is unrelated to this PR. I can see the same testcase failing on other PR's too (like here #5518)

@dsharletg dsharletg merged commit 968f6b3 into halide:master Dec 10, 2020
@alexreinking alexreinking added this to the v11.0.0 milestone Dec 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants