Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AVX-512 VNNI saturating dot products #5807

Merged
merged 14 commits into from
Mar 30, 2021

Conversation

mcleary
Copy link
Contributor

@mcleary mcleary commented Mar 15, 2021

This commit adds support to Intel VNNI saturating dot product
instructions vpdpbuds and vpdpwssd

This was accomplished by adding a new VectorReduce operation
to perform the saturating_add and exposing a new inline reduction
saturaring_sum. Users can then write

RDom r(0, 4);
f(x) = saturating_sum(i32(0), i16(i8(g(x + r)) * u8(h(x + r))))

bool override_associativity_test = true;
int vector_width = 4;
Var xo, xi;
f.update()
.split(x, xo, xi, vector_width)
.atomic(override_associativity_test)
.vectorize(r)
.vectorize(xi);

To lower the expression into a call to vpdpbuds.

Note that override_associativity_test is set to true or halide will fail
to prove the associativity of the saturating_add operation

Add support for VectorReduce::SaturatingAdd in CodeGen_LLVM

Code is correctly generated when no intrinsic is available to perform
a saturating dot product.

Add vpdpbusds,vpdpwssd tests to simd_op_check

Test if the saturating dot product instructions are being generated
for AVX512_SapphireRapids targets

src/CodeGen_X86.cpp Outdated Show resolved Hide resolved
@@ -1093,6 +1093,12 @@ class VectorSubs : public IRMutator {
reduce_op = VectorReduce::Or;
}
}
} else if (const Call *call_op = store->value.as<Call>()) {
if (call_op->is_intrinsic(Call::saturating_add)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably run find_intrinsics before vectorize_loops in lowering, otherwise this will not work if people write saturating add as a pattern they expect to match rather than using the (currently Halide::Internal) intrinsic.

This is a fairly big change to make, because the simplifier and other lowering passes don't understand saturating_add (yet? @rootjalex @abadams).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable, I will try to move the find_intrinsics earlier in the pipeline and I will the tests.

Copy link
Contributor Author

@mcleary mcleary Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsharletg I placed the find_intrisics pass right before vectorize_loops, thanks for the suggestion (b880a61)

Tests are passing locally, I will keep an eye on the buildbots

@rootjalex
Copy link
Member

rootjalex commented Mar 15, 2021

One of the failures was a compilation failure in Bounds.cpp, on handling the bounds of this new vector op (because this case is not covered). A quick fix would be just adding:

case VectoReduce::SaturatingAdd:
    bounds_of_type(op->value.type());
    break;

(essentially, give up on the bounds of this type).
If we had a saturating_mul, we could probably do something smart along the lines of the VectorReduce::Add's case, but I don't think Halide has that intrinsic.
Alternatively, a can_prove(interval.max < op->type.max() / factor) is probably sufficient for reusing VectorReduce::Add's casework for max (and a similar can_prove() guard for min).

@rootjalex
Copy link
Member

rootjalex commented Mar 15, 2021

Looks like Monotonic.cpp has a switch statement on the VectorReduce enum as well that needs to be updated as well. That switch case for SaturatingAdd can be the same for the VectorReduce::Add case, as it is a monotonic reduction as well. (right @dsharletg ?)

@rootjalex
Copy link
Member

@mcleary The switch statement in Simplify_Exprs.cpp that is missing the SaturatingAdd case could use the saturating_mul implementation from Simplify_Mul.cpp for the bounds->min *= factor; and bounds->max *= factor; expressions that it will need.

@mcleary
Copy link
Contributor Author

mcleary commented Mar 16, 2021

@mcleary The switch statement in Simplify_Exprs.cpp that is missing the SaturatingAdd case could use the saturating_mul implementation from Simplify_Mul.cpp for the bounds->min *= factor; and bounds->max *= factor; expressions that it will need.

Thanks for the feedback. I will make sure to address this issue today.

Could you give me any hints on what should be done, if anything, with the second switch (op->op) in Simplify_Exprs.cpp? I thought compilation would also fail for that switch since it doesn't handle VectorReduce::Mul for example

https://github.com/halide/Halide/blob/master/src/Simplify_Exprs.cpp#L116

@mcleary
Copy link
Contributor Author

mcleary commented Mar 16, 2021

Looks like Monotonic.cpp has a switch statement on the VectorReduce enum as well that needs to be updated as well. That switch case for SaturatingAdd can be the same for the VectorReduce::Add case, as it is a monotonic reduction as well. (right @dsharletg ?)

Thanks for pointing out this one. I will add the SaturatingAdd case to be the same as Add since they are indeed the same.

@rootjalex
Copy link
Member

Could you give me any hints on what should be done, if anything, with the second switch (op->op) in Simplify_Exprs.cpp? I thought compilation would also fail for that switch since it doesn't handle VectorReduce::Mul for example

https://github.com/halide/Halide/blob/master/src/Simplify_Exprs.cpp#L116

The second switch statement has a default case that covers both VectorReduce::Mul (and now VectorReduce::SaturatingAdd, that doesn't attempt to apply any simplifier rules, so it can probably be left as is until simplifier rules for this operation are either thought up or generated.

I am surprised that we don't have any VectorReduce::Mul rules though.

@rootjalex
Copy link
Member

One of the failures was a compilation failure in Bounds.cpp, on handling the bounds of this new vector op (because this case is not covered). A quick fix would be just adding:

case VectoReduce::SaturatingAdd:
    bounds_of_type(op->value.type());
    break;

(essentially, give up on the bounds of this type).
If we had a saturating_mul, we could probably do something smart along the lines of the VectorReduce::Add's case, but I don't think Halide has that intrinsic.
Alternatively, a can_prove(interval.max < op->type.max() / factor) is probably sufficient for reusing VectorReduce::Add's casework for max (and a similar can_prove() guard for min).

@mcleary I spoke to @abadams about this yesterday - he agreed that this switch statement:

switch (op->op) {

Should simply give up for now:

case VectoReduce::SaturatingAdd:
    bounds_of_type(op->value.type());
    break;

@mcleary
Copy link
Contributor Author

mcleary commented Mar 16, 2021

One of the failures was a compilation failure in Bounds.cpp, on handling the bounds of this new vector op (because this case is not covered). A quick fix would be just adding:

case VectoReduce::SaturatingAdd:
    bounds_of_type(op->value.type());
    break;

(essentially, give up on the bounds of this type).
If we had a saturating_mul, we could probably do something smart along the lines of the VectorReduce::Add's case, but I don't think Halide has that intrinsic.
Alternatively, a can_prove(interval.max < op->type.max() / factor) is probably sufficient for reusing VectorReduce::Add's casework for max (and a similar can_prove() guard for min).

@mcleary I spoke to @abadams about this yesterday - he agreed that this switch statement:

switch (op->op) {

Should simply give up for now:

case VectoReduce::SaturatingAdd:
    bounds_of_type(op->value.type());
    break;

That sounds good, to avoid repeating code I just left the SaturatingAdd case fall through VectorReduce::Mul, since they are going to do the same thing (4a0074a). If you think that making the SaturatingAdd case explicit is better I can change it.

@mcleary
Copy link
Contributor Author

mcleary commented Mar 16, 2021

@mcleary The switch statement in Simplify_Exprs.cpp that is missing the SaturatingAdd case could use the saturating_mul implementation from Simplify_Mul.cpp for the bounds->min *= factor; and bounds->max *= factor; expressions that it will need.

I moved the saturating_mul function from Simplify_Mul.cpp to Simplify_Internal.h to make it visible by Simplify_Expr.cpp (d220cca). I just inlined the function but let me know if you want it to be declared in the header or placed somewhere else.

@rootjalex
Copy link
Member

That sounds good, to avoid repeating code I just left the SaturatingAdd case fall through VectorReduce::Mul, since they are going to do the same thing (4a0074a). If you think that making the SaturatingAdd case explicit is better I can change it.

What you did works fine, no need to change it

@rootjalex
Copy link
Member

@mcleary The switch statement in Simplify_Exprs.cpp that is missing the SaturatingAdd case could use the saturating_mul implementation from Simplify_Mul.cpp for the bounds->min *= factor; and bounds->max *= factor; expressions that it will need.

I moved the saturating_mul function from Simplify_Mul.cpp to Simplify_Internal.h to make it visible by Simplify_Expr.cpp (d220cca). I just inlined the function but let me know if you want it to be declared in the header or placed somewhere else.

Hmm, I would probably not inline it, but I don't know where best to put it. @abadams or @dsharletg might have opinions on that?

@mcleary
Copy link
Contributor Author

mcleary commented Mar 16, 2021

The buildbot/halide-testbranch-main-llvm11-x86-64-linux-cmake is failing the test correctness_async_device_copy, which I don't think it is related to my code changes since the test pass with the host-opencl target for me locally. Could this be a random failure?

@steven-johnson
Copy link
Contributor

The buildbot/halide-testbranch-main-llvm11-x86-64-linux-cmake is failing the test correctness_async_device_copy, which I don't think it is related to my code changes since the test pass with the host-opencl target for me locally. Could this be a random failure?

Yeah, this seems to be unrelated, I'm trying to track down the cause, but I think you can ignore this specific failure for now.

src/Lower.cpp Outdated
@@ -332,6 +332,11 @@ Module lower(const vector<Function> &output_funcs,
debug(2) << "Lowering after unrolling:\n"
<< s << "\n\n";

debug(1) << "Finding intrinsics...\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this should be above unrolling, just because unrolling and vectorizing are kind of similar, and finding intrinsics should be irrelevant for unrolling (if anything it might make the IR smaller before unrolling, which is kind of nice I guess).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I noticed a failure in the simd_op_check when I ran it with the Sapphire Rapids target enabled after moving the find_intrinsics pass. I will investigate this today and report back.

Copy link
Contributor Author

@mcleary mcleary Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran the tests in #5807 (comment) I forgot to enable the correct target to trigger the generation of the new instructions. After some experimentation I found this.

This first block is supposed to be Lower.cpp without modifications, so we have unrolling, vectorize and find_intrinsics. As you can see, find_intrinsics is able to replace the multiplication with a widening_mul and this is pattern expected in CodeGen_X86

Find Instrinsics After Vectorize Loops

// Unrolling
f[(f.s1.x.v1*16) + f.s1.x.v2] =
         let t7.s = int16((((((f.s1.x.v1*16) + f.min.0) + f.s1.x.v2)*2) + f.s1.rdom$x)) in
         saturating_add(f[(f.s1.x.v1*16) + f.s1.x.v2], int32(t7.s)*int32(t7.s))

// Vectorize
f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)] = (int32x16)saturating_add(f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)], (int32x16)vector_reduce(SaturatingAdd, (int32x32((int16x32)t7.s.widened.f.s1.rdom$x)*int32x32((int16x32)t7.s.widened.f.s1.rdom$x))))

// Find Intrinsics
f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)] = (int32x16)saturating_add(f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)], (int32x16)vector_reduce(SaturatingAdd, (int32x32)widening_mul((int16x32)t7.s.widened.f.s1.rdom$x, (int16x32)t7.s.widened.f.s1.rdom$x)))

When I placed find_intrisics before vectorize/unrolling I don't get the widening_mul in the expression, which won't match the expected pattern, thus, simd_op_check will fail for the expression

check("vpdpwssds*zmm", 16, saturating_sum(i32(0), i32(in_i16(2 * x + r)) * in_i16(2 * x + r + 32)));

Find Intrinsics Before Vectorize Loops

// Unrolling
f[(f.s1.x.v1*16) + f.s1.x.v2] =
         let t7.s = int16((((((f.s1.x.v1*16) + f.min.0) + f.s1.x.v2)*2) + f.s1.rdom$x)) in
         saturating_add(f[(f.s1.x.v1*16) + f.s1.x.v2], int32(t7.s)*int32(t7.s))

// Find Intrisincs
f[(f.s1.x.v1*16) + f.s1.x.v2] =
         let t7.s = int16((((((f.s1.x.v1*16) + f.min.0) + f.s1.x.v2)*2) + f.s1.rdom$x)) in
         saturating_add(f[(f.s1.x.v1*16) + f.s1.x.v2], int32(t7.s)*int32(t7.s))

// Vectorize
let t7.s.widened.f.s1.rdom$x = int16x32(ramp(((f.s1.x.v1*16) + f.min.0)*2, 1, 32))
f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)] = (int32x16)saturating_add(f[ramp(f.s1.x.v1*16, 1, 16) aligned(16, 0)], (int32x16)vector_reduce(SaturatingAdd, (int32x32((int16x32)t7.s.widened.f.s1.rdom$x)*int32x32((int16x32)t7.s.widened.f.s1.rdom$x))))

A quick solution is to repeat the find_intrisics pass after vectorize_loops, but I was wondering if you could give me any hints on what to look to fix this and keep the find_intrisics before unrolling.

Another possible solution is to add more overloads to the list of patterns, something like

{VectorReduce::SaturatingAdd, 2, i32(wild_i16x_) * i32(wild_i16x_), "saturating_dot_product", {}, Pattern::CombineInit},
{VectorReduce::SaturatingAdd, 2, i32(widening_mul(wild_i16x_, wild_i16x_)), "saturating_dot_product", {}, Pattern::CombineInit},

to cover the cases where a widening_mul is not present, or even replace the widening_mul by the explicit multiplication.

Copy link
Contributor Author

@mcleary mcleary Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suspected, just moving the find_intrinsics pass will break the tests, not only for the newly added saturating dot product, but for other intrinsics as well.

https://buildbot.halide-lang.org/master/#/builders/53/builds/115/steps/12/logs/stdio

I guess both of my proposed solutions work but I would like to know your thoughts as well @dsharletg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like VectorizeLoops.cpp looks for particular expressions in quite a few places that would also need to look for intrinsics, for example here: https://github.com/halide/Halide/blob/master/src/VectorizeLoops.cpp#L1067

I don't think we should add new overloads to the list of patterns. That's basically just replicating the logic in find_intrinsics, which is non-trivial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsharletg I rebased and reverted the last two commits to avoid the failure in the tests. I understand your reasoning for moving the find_instrinsics pass earlier in the pipeline but that doesn't seem to work well.

Thales Sabino added 8 commits March 22, 2021 09:35
This commit adds support to Intel VNNI saturating dot product
instructions vpdpbuds and vpdpwssd

This was accomplished by adding a new VectorReduce operation
to perform the saturating_add and exposing a new inline reduction
saturaring_sum. Users can then write

RDom r(0, 4);
f(x) = saturating_sum(i32(0), i16(i8(g(x + r)) * u8(h(x + r))))

bool override_associativity_test = true;
int vector_width = 4;
Var xo, xi;
f.update()
 .split(x, xo, xi, vector_width)
 .atomic(override_associativity_test)
 .vectorize(r)
 .vectorize(xi);

To lower the expression into a call to vpdpbuds.

Note that override_associativity_test is set to true or halide will fail
to prove the associativity of the saturating_add operation

Add support for VectorReduce::SaturatingAdd in CodeGen_LLVM

Code is correctly generated when no intrinsic is available to perform
a saturating dot product.

Add vpdpbusds,vpdpwssd tests to simd_op_check

Test if the saturating dot product instructions are being generated
for AVX512_SapphireRapids targets
Copy link
Contributor

@dsharletg dsharletg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reply to the existing thread about where we do find_intrinsics, but I'm fine with leaving this as-is for now, despite missing patterns for saturating add for vector reductions. I just have a few nits about the new saturating_sum.

@@ -36,6 +36,7 @@ namespace Halide {
*/
//@{
Expr sum(Expr, const std::string &s = "sum");
Expr saturating_sum(const Expr &init_val, Expr e, const std::string &s = "saturating_sum");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need an init_val in some way that the other inline reductions don't? I think it would be better if this was consistent with the other inline reductions.

Copy link
Contributor Author

@mcleary mcleary Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it you are right. init_val is not required, I will revert it. When I first wrote this function I had the expression saturating_sum(u8(f(x)) * i8(f(x)) in mind. When using groups of 4 8 bit integers and a 32bit accumulator you can never saturate the result unless init_val is something close to INT_MAX but I didn't consider the general case where this won't map directly to the vnni instructinos. I guess this ties in with my question #5825, since inline reductions are basically a shortcut for some common expression patterns.

init_val removed in a16d661

@@ -208,7 +210,7 @@ class SimdOpCheckTest {
g.compute_at(f, x)
.update()
.split(x, xo, xi, vector_width)
.atomic()
.atomic(has_inline_reduction.override_associativity_test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just make this unconditionally true, rather than check for saturating_sum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, this does indicate that saturating dot products are kind of a problem... are they actually useful? It seems like only an unsigned saturating sum would be useful, a signed saturating sum is going to do weird things when there are mixes of positive and negative terms in the summation. (But this instruction set has only signed saturating sums!)

The result could saturate at the max and add a bunch of positive terms, and then a bunch of negative terms, then the result will be something far from the max, but that's incorrect if the positive terms are bigger than the negative terms.

Copy link
Member

@abadams abadams Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that it's only unsigned saturating dot products that are really useful. signed ones aren't even associative!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding of the operation, at the instruction level, the saturation part occurs at the end, after all the results are multiplied to each other, so the associativity of the summation terms is not really a problem. A signing saturating sum is indeed not associative, but with the ability to override the associativity test makes it explicit that you know what you are doing.

I didn't want to set it to always be true to avoid interfering with existing tests.

My understanding was that it's only unsigned saturating dot products that are really useful. signed ones aren't even associative!

I don't know much about the field but saturating dot products, both signed and unsigned, are used in VoIP applications, but you are right, they are not associative

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic(true) won't break anything that atomic() didn't. Basically setting that to always true reduces coverage of atomic()'s associativity test, but IMO it's better to just accept that (very slight) loss in coverage to avoid the extra logic/possibly brittle test behavior.

Regarding associativity: I agree if this instruction were only a 4-way dot product, the saturation is uninteresting. But this is an accumulating dot product instruction, so it can be used in an arbitrarily large loop, and presumably it would be in ML applications (I assume the reason this instruction is being added). In that case, the saturation is hard to reason about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic(true) won't break anything that atomic() didn't. Basically setting that to always true reduces coverage of atomic()'s associativity test, but IMO it's better to just accept that (very slight) loss in coverage to avoid the extra logic/possibly brittle test behavior.

That sounds reasonable, I've set it to true in a1d2685

@mcleary
Copy link
Contributor Author

mcleary commented Mar 23, 2021

@mcleary The switch statement in Simplify_Exprs.cpp that is missing the SaturatingAdd case could use the saturating_mul implementation from Simplify_Mul.cpp for the bounds->min *= factor; and bounds->max *= factor; expressions that it will need.

I moved the saturating_mul function from Simplify_Mul.cpp to Simplify_Internal.h to make it visible by Simplify_Expr.cpp (d220cca). I just inlined the function but let me know if you want it to be declared in the header or placed somewhere else.

Hmm, I would probably not inline it, but I don't know where best to put it. @abadams or @dsharletg might have opinions on that?

Are there any other thoughts on this? I'm happy to change it if needed

@steven-johnson
Copy link
Contributor

The OSX failure is unrelated (will be fixed by #5841), should be good to land

@mcleary
Copy link
Contributor Author

mcleary commented Mar 24, 2021

The OSX failure is unrelated (will be fixed by #5841), should be good to land

That's excellent. I will merge master here to make sure everything is ok.

@@ -28,6 +28,20 @@
namespace Halide {
namespace Internal {

namespace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be in an anonymous namespace. This will cause the linker to put multiple copies of this function in the compiled library (normally, functions with inline linkage are not duplicated, except when they are inlined of course).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer to have the function defined in some translation unit? Because the way the Simplify pass is split I'm not sure where to put it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think inline is fine (if it had to go somewhere, I guess Simplify.cpp is the right place).

@dsharletg
Copy link
Contributor

dsharletg commented Mar 29, 2021

BTW, I've added a use case of this in the interpret_nn branch: https://github.com/halide/Halide/blob/interpret_nn/apps/hannk/halide/convolution_generator.cpp#L78. I haven't tested it so I have no idea if it works or is fast... but it would be an interesting thing to try :)

edit: I guess this PR is specifically for saturating dot products, which that doesn't use, it's using the previous VNNI dot product support.

@mcleary
Copy link
Contributor Author

mcleary commented Mar 30, 2021

BTW, I've added a use case of this in the interpret_nn branch: https://github.com/halide/Halide/blob/interpret_nn/apps/hannk/halide/convolution_generator.cpp#L78. I haven't tested it so I have no idea if it works or is fast... but it would be an interesting thing to try :)

edit: I guess this PR is specifically for saturating dot products, which that doesn't use, it's using the previous VNNI dot product support.

I will give it a try today. Thanks for pointing it out.

@mcleary
Copy link
Contributor Author

mcleary commented Mar 30, 2021

BTW, I've added a use case of this in the interpret_nn branch: https://github.com/halide/Halide/blob/interpret_nn/apps/hannk/halide/convolution_generator.cpp#L78. I haven't tested it so I have no idea if it works or is fast... but it would be an interesting thing to try :)

edit: I guess this PR is specifically for saturating dot products, which that doesn't use, it's using the previous VNNI dot product support.

This might be a little bit out of scope, but how do I run this app? I built everything using Make but I can't see an executable. When I tried the scripts it was asking for adb. Does it only works on Android?

@dsharletg
Copy link
Contributor

It should work on x86 too. Here's how I run it:

cd apps/hannk
HL_TARGET=host make bin/host/benchmark -j8
bin/host/benchmark ~/mobilenet_v2_1.0_224_quant.tflite

Where that .tflite file is just the mobilenet v2 downloaded from https://www.tensorflow.org/lite/guide/hosted_models.

To try using the VNNI dot products, it should just require the right target flag added to HL_TARGET (and a machine that can run it).

The branch is in a rough WIP state, so the documentation and functionality is spotty :)

@dsharletg dsharletg merged commit 2dd7a6b into halide:master Mar 30, 2021
@mcleary mcleary deleted the saturating_dot_products branch April 7, 2021 09:27
@abadams abadams added the release_notes For changes that may warrant a note in README for official releases. label Apr 20, 2021
@alexreinking alexreinking added this to the v12.0.0 milestone May 19, 2021
@alexreinking alexreinking removed the release_notes For changes that may warrant a note in README for official releases. label Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants