-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AVX-512 VNNI saturating dot products #5807
Conversation
@@ -1093,6 +1093,12 @@ class VectorSubs : public IRMutator { | |||
reduce_op = VectorReduce::Or; | |||
} | |||
} | |||
} else if (const Call *call_op = store->value.as<Call>()) { | |||
if (call_op->is_intrinsic(Call::saturating_add)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably run find_intrinsics
before vectorize_loops
in lowering, otherwise this will not work if people write saturating add as a pattern they expect to match rather than using the (currently Halide::Internal
) intrinsic.
This is a fairly big change to make, because the simplifier and other lowering passes don't understand saturating_add
(yet? @rootjalex @abadams).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable, I will try to move the find_intrinsics
earlier in the pipeline and I will the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsharletg I placed the find_intrisics
pass right before vectorize_loops
, thanks for the suggestion (b880a61)
Tests are passing locally, I will keep an eye on the buildbots
One of the failures was a compilation failure in Bounds.cpp, on handling the bounds of this new vector op (because this case is not covered). A quick fix would be just adding:
(essentially, give up on the bounds of this type). |
Looks like Monotonic.cpp has a switch statement on the |
@mcleary The switch statement in Simplify_Exprs.cpp that is missing the |
Thanks for the feedback. I will make sure to address this issue today. Could you give me any hints on what should be done, if anything, with the second https://github.com/halide/Halide/blob/master/src/Simplify_Exprs.cpp#L116 |
Thanks for pointing out this one. I will add the |
The second switch statement has a I am surprised that we don't have any |
@mcleary I spoke to @abadams about this yesterday - he agreed that this switch statement: Line 1576 in 5aa1a65
Should simply give up for now:
|
That sounds good, to avoid repeating code I just left the |
I moved the |
What you did works fine, no need to change it |
Hmm, I would probably not inline it, but I don't know where best to put it. @abadams or @dsharletg might have opinions on that? |
The |
Yeah, this seems to be unrelated, I'm trying to track down the cause, but I think you can ignore this specific failure for now. |
src/Lower.cpp
Outdated
@@ -332,6 +332,11 @@ Module lower(const vector<Function> &output_funcs, | |||
debug(2) << "Lowering after unrolling:\n" | |||
<< s << "\n\n"; | |||
|
|||
debug(1) << "Finding intrinsics...\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this should be above unrolling, just because unrolling and vectorizing are kind of similar, and finding intrinsics should be irrelevant for unrolling (if anything it might make the IR smaller before unrolling, which is kind of nice I guess).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I noticed a failure in the simd_op_check when I ran it with the Sapphire Rapids target enabled after moving the find_intrinsics pass. I will investigate this today and report back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I ran the tests in #5807 (comment) I forgot to enable the correct target to trigger the generation of the new instructions. After some experimentation I found this.
This first block is supposed to be Lower.cpp
without modifications, so we have unrolling, vectorize and find_intrinsics. As you can see, find_intrinsics is able to replace the multiplication with a widening_mul
and this is pattern expected in CodeGen_X86
Find Instrinsics After Vectorize Loops
// Unrolling
f[(f.s1.x.v1*16) + f.s1.x.v2] =
let t7.s = int16((((((f.s1.x.v1*16) + f.min.0) + f.s1.x.v2)*2) + f.s1.rdom$x)) in
saturating_add(f[(f.s1.x.v1*16) + f.s1.x.v2], int32(t7.s)*int32(t7.s))
// Vectorize
f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)] = (int32x16)saturating_add(f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)], (int32x16)vector_reduce(SaturatingAdd, (int32x32((int16x32)t7.s.widened.f.s1.rdom$x)*int32x32((int16x32)t7.s.widened.f.s1.rdom$x))))
// Find Intrinsics
f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)] = (int32x16)saturating_add(f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)], (int32x16)vector_reduce(SaturatingAdd, (int32x32)widening_mul((int16x32)t7.s.widened.f.s1.rdom$x, (int16x32)t7.s.widened.f.s1.rdom$x)))
When I placed find_intrisics
before vectorize/unrolling I don't get the widening_mul in the expression, which won't match the expected pattern, thus, simd_op_check will fail for the expression
check("vpdpwssds*zmm", 16, saturating_sum(i32(0), i32(in_i16(2 * x + r)) * in_i16(2 * x + r + 32)));
Find Intrinsics Before Vectorize Loops
// Unrolling
f[(f.s1.x.v1*16) + f.s1.x.v2] =
let t7.s = int16((((((f.s1.x.v1*16) + f.min.0) + f.s1.x.v2)*2) + f.s1.rdom$x)) in
saturating_add(f[(f.s1.x.v1*16) + f.s1.x.v2], int32(t7.s)*int32(t7.s))
// Find Intrisincs
f[(f.s1.x.v1*16) + f.s1.x.v2] =
let t7.s = int16((((((f.s1.x.v1*16) + f.min.0) + f.s1.x.v2)*2) + f.s1.rdom$x)) in
saturating_add(f[(f.s1.x.v1*16) + f.s1.x.v2], int32(t7.s)*int32(t7.s))
// Vectorize
let t7.s.widened.f.s1.rdom$x = int16x32(ramp(((f.s1.x.v1*16) + f.min.0)*2, 1, 32))
f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)] = (int32x16)saturating_add(f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)], (int32x16)vector_reduce(SaturatingAdd, (int32x32((int16x32)t7.s.widened.f.s1.rdom$x)*int32x32((int16x32)t7.s.widened.f.s1.rdom$x))))
A quick solution is to repeat the find_intrisics pass after vectorize_loops, but I was wondering if you could give me any hints on what to look to fix this and keep the find_intrisics before unrolling.
Another possible solution is to add more overloads to the list of patterns, something like
{VectorReduce::SaturatingAdd, 2, i32(wild_i16x_) * i32(wild_i16x_), "saturating_dot_product", {}, Pattern::CombineInit},
{VectorReduce::SaturatingAdd, 2, i32(widening_mul(wild_i16x_, wild_i16x_)), "saturating_dot_product", {}, Pattern::CombineInit},
to cover the cases where a widening_mul
is not present, or even replace the widening_mul
by the explicit multiplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I suspected, just moving the find_intrinsics pass will break the tests, not only for the newly added saturating dot product, but for other intrinsics as well.
https://buildbot.halide-lang.org/master/#/builders/53/builds/115/steps/12/logs/stdio
I guess both of my proposed solutions work but I would like to know your thoughts as well @dsharletg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like VectorizeLoops.cpp looks for particular expressions in quite a few places that would also need to look for intrinsics, for example here: https://github.com/halide/Halide/blob/master/src/VectorizeLoops.cpp#L1067
I don't think we should add new overloads to the list of patterns. That's basically just replicating the logic in find_intrinsics, which is non-trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsharletg I rebased and reverted the last two commits to avoid the failure in the tests. I understand your reasoning for moving the find_instrinsics pass earlier in the pipeline but that doesn't seem to work well.
This commit adds support to Intel VNNI saturating dot product instructions vpdpbuds and vpdpwssd This was accomplished by adding a new VectorReduce operation to perform the saturating_add and exposing a new inline reduction saturaring_sum. Users can then write RDom r(0, 4); f(x) = saturating_sum(i32(0), i16(i8(g(x + r)) * u8(h(x + r)))) bool override_associativity_test = true; int vector_width = 4; Var xo, xi; f.update() .split(x, xo, xi, vector_width) .atomic(override_associativity_test) .vectorize(r) .vectorize(xi); To lower the expression into a call to vpdpbuds. Note that override_associativity_test is set to true or halide will fail to prove the associativity of the saturating_add operation Add support for VectorReduce::SaturatingAdd in CodeGen_LLVM Code is correctly generated when no intrinsic is available to perform a saturating dot product. Add vpdpbusds,vpdpwssd tests to simd_op_check Test if the saturating dot product instructions are being generated for AVX512_SapphireRapids targets
ddd00ad
to
872c5a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't reply to the existing thread about where we do find_intrinsics, but I'm fine with leaving this as-is for now, despite missing patterns for saturating add for vector reductions. I just have a few nits about the new saturating_sum.
src/InlineReductions.h
Outdated
@@ -36,6 +36,7 @@ namespace Halide { | |||
*/ | |||
//@{ | |||
Expr sum(Expr, const std::string &s = "sum"); | |||
Expr saturating_sum(const Expr &init_val, Expr e, const std::string &s = "saturating_sum"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need an init_val in some way that the other inline reductions don't? I think it would be better if this was consistent with the other inline reductions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it you are right. init_val
is not required, I will revert it. When I first wrote this function I had the expression saturating_sum(u8(f(x)) * i8(f(x))
in mind. When using groups of 4 8 bit integers and a 32bit accumulator you can never saturate the result unless init_val
is something close to INT_MAX
but I didn't consider the general case where this won't map directly to the vnni instructinos. I guess this ties in with my question #5825, since inline reductions are basically a shortcut for some common expression patterns.
init_val
removed in a16d661
test/correctness/simd_op_check.h
Outdated
@@ -208,7 +210,7 @@ class SimdOpCheckTest { | |||
g.compute_at(f, x) | |||
.update() | |||
.split(x, xo, xi, vector_width) | |||
.atomic() | |||
.atomic(has_inline_reduction.override_associativity_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just make this unconditionally true, rather than check for saturating_sum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, this does indicate that saturating dot products are kind of a problem... are they actually useful? It seems like only an unsigned saturating sum would be useful, a signed saturating sum is going to do weird things when there are mixes of positive and negative terms in the summation. (But this instruction set has only signed saturating sums!)
The result could saturate at the max and add a bunch of positive terms, and then a bunch of negative terms, then the result will be something far from the max, but that's incorrect if the positive terms are bigger than the negative terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that it's only unsigned saturating dot products that are really useful. signed ones aren't even associative!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding of the operation, at the instruction level, the saturation part occurs at the end, after all the results are multiplied to each other, so the associativity of the summation terms is not really a problem. A signing saturating sum is indeed not associative, but with the ability to override the associativity test makes it explicit that you know what you are doing.
I didn't want to set it to always be true to avoid interfering with existing tests.
My understanding was that it's only unsigned saturating dot products that are really useful. signed ones aren't even associative!
I don't know much about the field but saturating dot products, both signed and unsigned, are used in VoIP applications, but you are right, they are not associative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atomic(true) won't break anything that atomic() didn't. Basically setting that to always true reduces coverage of atomic()'s associativity test, but IMO it's better to just accept that (very slight) loss in coverage to avoid the extra logic/possibly brittle test behavior.
Regarding associativity: I agree if this instruction were only a 4-way dot product, the saturation is uninteresting. But this is an accumulating dot product instruction, so it can be used in an arbitrarily large loop, and presumably it would be in ML applications (I assume the reason this instruction is being added). In that case, the saturation is hard to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atomic(true) won't break anything that atomic() didn't. Basically setting that to always true reduces coverage of atomic()'s associativity test, but IMO it's better to just accept that (very slight) loss in coverage to avoid the extra logic/possibly brittle test behavior.
That sounds reasonable, I've set it to true in a1d2685
Are there any other thoughts on this? I'm happy to change it if needed |
The OSX failure is unrelated (will be fixed by #5841), should be good to land |
That's excellent. I will merge master here to make sure everything is ok. |
src/Simplify_Internal.h
Outdated
@@ -28,6 +28,20 @@ | |||
namespace Halide { | |||
namespace Internal { | |||
|
|||
namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should not be in an anonymous namespace. This will cause the linker to put multiple copies of this function in the compiled library (normally, functions with inline linkage are not duplicated, except when they are inlined of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you prefer to have the function defined in some translation unit? Because the way the Simplify pass is split I'm not sure where to put it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think inline is fine (if it had to go somewhere, I guess Simplify.cpp is the right place).
BTW, I've added a use case of this in the interpret_nn branch: https://github.com/halide/Halide/blob/interpret_nn/apps/hannk/halide/convolution_generator.cpp#L78. I haven't tested it so I have no idea if it works or is fast... but it would be an interesting thing to try :) edit: I guess this PR is specifically for saturating dot products, which that doesn't use, it's using the previous VNNI dot product support. |
I will give it a try today. Thanks for pointing it out. |
This might be a little bit out of scope, but how do I run this app? I built everything using Make but I can't see an executable. When I tried the scripts it was asking for adb. Does it only works on Android? |
It should work on x86 too. Here's how I run it:
Where that .tflite file is just the mobilenet v2 downloaded from https://www.tensorflow.org/lite/guide/hosted_models. To try using the VNNI dot products, it should just require the right target flag added to HL_TARGET (and a machine that can run it). The branch is in a rough WIP state, so the documentation and functionality is spotty :) |
This commit adds support to Intel VNNI saturating dot product
instructions vpdpbuds and vpdpwssd
This was accomplished by adding a new VectorReduce operation
to perform the saturating_add and exposing a new inline reduction
saturaring_sum. Users can then write
RDom r(0, 4);
f(x) = saturating_sum(i32(0), i16(i8(g(x + r)) * u8(h(x + r))))
bool override_associativity_test = true;
int vector_width = 4;
Var xo, xi;
f.update()
.split(x, xo, xi, vector_width)
.atomic(override_associativity_test)
.vectorize(r)
.vectorize(xi);
To lower the expression into a call to vpdpbuds.
Note that override_associativity_test is set to true or halide will fail
to prove the associativity of the saturating_add operation
Add support for VectorReduce::SaturatingAdd in CodeGen_LLVM
Code is correctly generated when no intrinsic is available to perform
a saturating dot product.
Add vpdpbusds,vpdpwssd tests to simd_op_check
Test if the saturating dot product instructions are being generated
for AVX512_SapphireRapids targets