Prototype of multiple scattering update definitions #5553

abadams · 2020-12-12T20:16:07Z

This is a prototype of being able to scatter to multiple coordinates in an update definition. You bundle the args and the values in a special intrinsic, which gets unrolled during lowering on a one-to-one basis for all the bundles that appear on the LHS and RHS (they must all have the same size and can't be nested).

The advantage of this feature is that it lets you write an update definition that requires loading multiple values, doing something to them, and then storing all of them in one step. A good example is in the test - rotating a square image in place can be done by repeatedly loading four values, rotating them, and then storing them again to the same locations. I don't think we could currently express that algorithm.

I'm opening this for discussion as a minimum viable version of the feature.

I don't think the name Call::tuple is the right thing, and there's no sugar for this yet. Tuples are already a thing, and a Tuple has mixed types, which can't work here. Any other ideas for the name or for sugar for the feature are very welcome. "pack" maybe? "bundle"?

src/SplitTuples.cpp

alexreinking · 2020-12-12T22:38:37Z

I will review soon, but we should hold off on merging experimental features until after Halide 11 is released.

zvookin · 2020-12-12T23:13:48Z

Are multiple index tuples allowed on the LHS? What kinds of Exprs are allowed inside the LHS tuples? Specifically is an RVar allowed?

First thought is that having to assert the indices are in range using unsafe_promise_clamped is problematic, especially in this instance where the range of the indices is fixed at compile time.

Second thought is we're getting a fair bit of complexity in the language embedded in argument lists for update stages. (Implicit loops on both sides.) In this case it is further obscured by using an Expr to hold "dst" so one cannot see the correspondence on that line alone. This makes it feel like unintelligible magic. Improvements to syntax will help this. E.g. something like:
sorted(x, scatter_to(min_idx, max_idx)) = scatter_from(...)
Another possibility would be to allow concatenating modifier expressions as we do in scheduling:
f(x, y).scatter(y, {min_idx, max_idx}) = scatter(y, {...});
Which would potentially allow multiple levels of scatter.

abadams · 2020-12-12T23:27:37Z

Multiple LHS index tuples are allowed, but they must be the same size and not nested within each other. I'm currently also not requiring them to be outermost in the expression. E.g. the following is valid:

f(make_expr_tuple({r.x, r.x + 5}) + make_expr_tuple({7, 3}), y, make_expr_tuple({r.y, r.y + r.x})) = make_expr_tuple({a, b});

and it means that you're writing a to f(r.x + 7, y, r.y) and b to f(r.x + 5 + 3, r.y + r.x)

The coordinates on the LHS can be arbitrary Exprs. The rules are the same as the usual rules for things that are allowed on the LHS of an update definition. So e.g. if there's a pure var, it must also appear naked (so not in a tuple) as a LHS coordinate in the same position as in the pure def.

The unsafe_promise_clamped is only because I'm pulling things out of a Buffer where I happen to know that the buffer is all in [0, 7], but it's an input buffer, not part of the compilation, so there's no way for the compiler to know that.

Edit: Unless we're saying that the compiler should assume that embedded Buffers can never change, so it can iterate over them to get bounds. That's tricky in the JIT case where they're a reference to existing memory that the user might mutate. In the AOT case maybe we could do that if they go in ro_data. In either case that's orthogonal to this PR. The bounds of make_expr_tuple are just the bounds of its args.

I think the scatter_to / scatter_from are equivalent to my make_expr_tuple except for the name, assuming you only mean those two values and not the entire range in between them. The second version doesn't quite work because you could be scattering in every axis, not just one.

"scatter_to" "scatter_from" are better names than make_expr_tuple for sure.

abadams · 2020-12-12T23:29:09Z

"scatter_to" and "gather_from, or just "scatter" and "gather" are other options.

zvookin · 2020-12-12T23:33:14Z

"The second version doesn't quite work because you could be scattering in every axis, not just one."

I'm not sure I get that, wouldn't that just be something like f(x, y, z).scatter(y, ...).scatter(z, ...)

dsharletg · 2020-12-12T23:39:05Z

Does it work if we just treat these the way GPUs treat short vectors? I.e. scalars broadcast across the vector, and vectors have to have the same number of elements or its an error. This has the possibly unusual added ability of using a "vector" as an index in a load/store, but that doesn't seem like too extreme of a jump from what the GPU languages do.

That's how I thought about this at first without thinking too hard about it.

zvookin · 2020-12-12T23:39:17Z

Noting that the point of investigating expanded syntax is to spell out what is going on. To make it clear something special is happening. Especially if we think the language might do more of this. E.g. keyed map/reduce or compact/expand type stuff.

abadams · 2020-12-12T23:41:55Z

Yeah these are very much like short vectors on GPU.

abadams · 2020-12-12T23:43:31Z

Another piece of syntax I was thinking of was just listing the different LHS and RHS values as separate definitions somehow:

f.scatter({
  f(min_idx) = min(f(min_idx), f(max_idx));
  f(max_idx) = max(f(min_idx), f(max_idx));
});

But that looks like the two things happen serially instead of together, so it's actively misleading. But maybe there's something like that where multiple scatters naturally look like multiple assignments.

abadams · 2020-12-12T23:45:09Z

What if a Tuple of FuncRef was a thing, and it had an operator= that accepted a Tuple of Exprs?

Tuple(f(min_idx), f(max_idx)) = Tuple(min(f(min_idx), f(max_idx)), max(f(min_idx), f(max_idx)));

ugh, I think I take it back.

abadams · 2020-12-13T18:36:08Z

I changed the name to be "gather" on the RHS and "scatter" on the LHS. Now reads as:

    sorted1(x, scatter(min_idx, max_idx)) =
        gather(min(sorted1(x, min_idx), sorted1(x, max_idx)),
               max(sorted1(x, min_idx), sorted1(x, max_idx)));

The idea is that gathers on the RHS pair with scatters on the LHS.

steven-johnson · 2020-12-14T17:58:18Z

"scatter_to" and "gather_from"

That would be my first vote, with plain "scatter/gather" in second place

abadams · 2020-12-14T18:26:52Z

I thought it might be best to keep it short, so you can write this sort of thing on one line:

f(x, scatter(y0, y1)) = gather(a, b) + c;

When it was longer I found myself writing things like:

Expr dst_y = scatter_to(y0, y1);
Expr ab = gather_from(a, b);
f(x, dst_y) = ab + c;

Which I think obscures what's going on. Better to have the scatter / gather present on the actual definition line.

We could force that behavior with more explicit syntax, e.g:

f.scatter({x, y0}, {x, y1}) = {a + c, b + c};

Not sure I like that specifically because it forces repetition of the +c on the RHS and x on the LHS, which is one of the things that makes Tuples annoying to use.

alexreinking · 2020-12-14T18:32:56Z

Why not mix the two as in:

f.scatter({x, y0}, {x, y1}) = gather(a, b) + c;

abadams · 2020-12-14T18:42:21Z

It's an option. I find the asymmetry a little disturbing.

Another thing I don't like about Func::scatter is that it would be nice if in the syntax a regular single update definition was just the degenerate case of a multi-update definition. I.e. a multiple-scattering update definition that happens to only scatter to one coord should be the same syntax as our current update definitions.

steven-johnson · 2020-12-15T02:36:08Z

The multiple_scatter test is failing under WebAssembly with out-of-memory. How much memory is it allocating?

abadams · 2020-12-15T04:02:27Z

Very little, so that's very odd. I'll see if I can repro.

BachiLi · 2020-12-17T16:14:56Z

probably need to handle this in autodiff as well

steven-johnson · 2020-12-17T18:12:47Z

Looks like we're getting segfaults in apps/fft: https://buildbot.halide-lang.org/master/#/builders/49/builds/19

…airs

…Halide into abadams/multiple_scatter

…atter

The first store should be to the same buffer as the second

abadams · 2020-12-28T18:21:16Z

I believe this is now ready for review

dsharletg

The implementation and idea looks good to me except for the ongoing debate above re: the names. I think the syntax should just be something like:

f(tuple(...)) = g(tuple(...));

tuple is maybe not the right name for this... vector? concat?

scatter/gather just doesn't seem quite right to me. The scatter/gather-ness is already determined by being on the LHS or RHS.

However, I don't feel that strongly about this. I'd be fine with merging this as-is. This is something we could easily think of a better solution for later, and deprecate the current names. I think the core of this (short vectors that broadcast including loads/stores) is a good change.

zvookin · 2020-12-29T10:38:29Z

The scatter/gather syntax explains what is going on. I fear that the degree of magic in the language is high and increasing. If the reading of scatter and gather is helpful and not misleading, it is worth having even if it is not strictly necessary for folks who are very familiar with the language. Using tuple seems problematic to me as it overloads with the concept of tuple valued things, as in storing a group of values to a struct-like location. Not to mention the existing meaning in Halide. I'd expect just having it be a vector valued thing would be the way to go, though this introduces vectors in the user visible language, which has not been done. (And likely causes issues with the vectorization pass in lowering.)

I also like that the scatter/gather words are not too long and draw attention to the specific thing on both sides of the expression. This makes it easier to pair them up. If it is decided this is a bad idea, it is far easier to make the separate names synonyms for one common name and use the LHS vs. RHS property to distinguish them later than it is to do the opposite.

The only downside I can think of is if one wants to introduce a temporary Expr that is used on both sides. E.g.

    Expr indices = vec(y0, y1);
    out(x, indices) = in(indices + 42);

However I can't think of a use case for this.

alexreinking · 2020-12-29T13:56:08Z

To be clear -- for a k-wide gather/scatter update, it first gets a k-wide scratch buffer and then iterates over all of the gathers jointly, recomputing the expression for each coordinate in those gathers, and storing them into the scratch buffer. Then, it does the same thing for the scatters, iterating jointly and copying the scratch to those locations in the func.

So if you have a k-wide gather/scatter, it updates k points in the func.

Now, what happens if someone writes this?

f(scatter(0, 0)) = g(gather(0, 1));

I'm not sure which value (g(1) or g(0)) will "win" in this case. Is it UB, or do we go left-to-right so g(1) wins? Also, how will this interact with tuples? There might be more edge cases I'm missing.

Still, I think this will be safe/sound for the core scheduling directives and a union is all that's needed for BI. Will rfactor work with this?

abadams · 2020-12-29T18:59:33Z

Yes, your understanding is correct. The stores happen left to right, so in the case of collisions the rightmost value wins.

It's orthogonal to Tuples. gather/scatter are Exprs, so they can exist inside Tuples. All gather values / scatter indices x tuple lanes are first computed, and then all are stored.

rfactor won't recognize any of this as associative because that occurs before gather/scatter are unpacked. Any gather/scatter patterns that are in fact associative/commutative updates could be added to the table of associative ops. I'm having a hard time thinking of one right now that wouldn't be better expressed in some other way.

abadams · 2020-12-29T19:02:47Z

Regarding the joint iteration thing - one wart of this approach is that if you want to store a different number of values depending on a condition:

f(select(p, scatter(0, 1), scatter(0, 1, 2)) = select(p, gather(a, b), gather(a, b, c));
...
f.specialize(p);

That's illegal because you can't jointly iterate. You have to write this instead:

f(select(p, scatter(0, 1, 1), scatter(0, 1, 2)) = select(p, gather(a, b, b), gather(a, b, c));

and then rely on dead-store elimination to cull the middle f[1] = b in the case where p is true.

If we wanted to, we could implicitly duplicate the last element of a short gather/scatter to match the longest one, but that sounds like it could hide some gnarly bugs. (e.g. if you accidentally negate the condition on the LHS vs RHS).

alexreinking · 2020-12-29T19:29:24Z

Yes, your understanding is correct. The stores happen left to right, so in the case of collisions the rightmost value wins.

Okay, great, then I think we're definitely safe.

If we wanted to, we could implicitly duplicate the last element of a short gather/scatter to match the longest one, but that sounds like it could hide some gnarly bugs. (e.g. if you accidentally negate the condition on the LHS vs RHS).

Yeah, let's not do this. It does sound likely to hide genuine programming errors.

You have to write this instead: [...] and then rely on dead-store elimination to cull the middle [assignment] in the case where p is true.

One idea would be to let update stages in the algorithm select between several innermost assignments. Something like:

f.update({
  { p,    _(scatter(0, 1)) = gather(a, b) }
  { true, _(scatter(0, 1 ,2) = gather(a, b, c) }
});
f.specialize(p);

Where _ is a placeholder that's filled in with f by update and every update mentions the same set of RDoms. I won't say this is pretty either, but it's a bit better in that it makes the differing updates explicit, keeps the gather/scatter semantics simple, and is just as schedulable as before. I'm not sure what effect this would have on BI. After a specialize, the union operation could get moved higher up versus the select in the original.

abadams · 2020-12-29T19:55:43Z

We could do that. Effectively it adds an optional switch statement to update definitions, along with the optional for loop (the RDom), and the optional inner if statement (RDom::where). But I think (after this PR!) we should stop adding more control flow features to update definitions one by one, and instead consider a holistically-designed imperative extension.

alexreinking · 2020-12-29T22:12:12Z

But I think (after this PR!) we should stop adding more control flow features to update definitions one by one, and instead consider a holistically-designed imperative extension.

👍 to this generally.

alexreinking

Looks good to me, with a couple small improvements I'd like made:

the "rightmost write wins" semantics should be documented both in the Doxygen doc string and as an explicit test so that the behavior isn't inadvertently changed.
I'm pretty sure f(0) = gather(1,2) is nonsense, but f(scatter(0,1)) = 2 looks like a broadcast. If the second example is allowed, it should be tested and documented. If it is not, it should become an error test (same with the no-scatter first example).

src/Simplify_Stmts.cpp

test/correctness/multiple_scatter.cpp

src/IROperator.h

alexreinking · 2020-12-31T01:14:18Z

Since the f(/* no scatter */) = gather(e1, ..., en) case is allowed, should we try to guarantee that in this case gather(e1, ..., en) = en is replaced before BI sees it, so we don't get unnecessary widening / math?

abadams · 2020-12-31T01:27:14Z

Good question... I think it's in the same category as things like 0 * f(x), or select(provably_false_expr, f(x), g(x)). I'm not too worried about bounds inference being tight on silly code. Normally I only write things like 0 * f(x) in order to add false dependence while debugging something.

alexreinking · 2020-12-31T04:34:16Z

I'm not too worried about bounds inference being tight on silly code.

Yeah, I asked mostly because the test imagined it as a degenerate case of some generic (metaprogramming, I assume) code. That might justify treating this case favorably. OTOH that work could be done in a separate PR.

steven-johnson · 2021-01-05T21:51:36Z

Post-holiday check-in: where does this PR stand? Buildbots are clean. Is there more work to be done?

abadams · 2021-01-05T21:56:34Z

It's done and ready to merge. Alex wanted to cut the branch for release 11 before merging, so it's just waiting on that.

abadams added 2 commits December 12, 2020 12:09

Prototype of multiple scattering update definitions

458efbd

Add test

5f6eb57

abadams requested review from dsharletg and alexreinking December 12, 2020 20:16

abadams marked this pull request as draft December 12, 2020 20:16

dsharletg reviewed Dec 12, 2020

View reviewed changes

src/SplitTuples.cpp Show resolved Hide resolved

abadams added 3 commits December 12, 2020 14:52

Handle tuples of tuples

798656f

Better error messages

505f975

Fix comment

0c24fce

Try a new name on for size

75c9373

Merge branch 'master' into abadams/multiple_scatter

d43015d

abadams added 8 commits December 20, 2020 10:26

Do joint CSE on the LHS too when splitting Tuples

b6883f5

Add useful simplifications for the code generated by scatter/gather p…

5203a9a

…airs

Use less stack space.

0597188

Merge branch 'abadams/multiple_scatter' of https://github.com/halide/…

920a14e

…Halide into abadams/multiple_scatter

Merge remote-tracking branch 'origin/master' into abadams/multiple_sc…

151935c

…atter

Appease clang-tidy

77873fc

Somewhat important condition missing from dead store elimination

82d64d2

The first store should be to the same buffer as the second

Another fix for dead store elimination

4694ec0

abadams marked this pull request as ready for review December 28, 2020 18:20

dsharletg approved these changes Dec 29, 2020

View reviewed changes

alexreinking requested changes Dec 29, 2020

View reviewed changes

src/Simplify_Stmts.cpp Show resolved Hide resolved

test/correctness/multiple_scatter.cpp Outdated Show resolved Hide resolved

src/IROperator.h Show resolved Hide resolved

src/IROperator.h Show resolved Hide resolved

abadams added 2 commits December 29, 2020 20:47

The base class visitor does not take the union

04af6b9

Address review comments

5b06a14

alexreinking added this to the v12.0.0 milestone Jan 1, 2021

alexreinking added the enhancement New user-visible features or improvements to existing features. label Jan 1, 2021

alexreinking approved these changes Jan 9, 2021

View reviewed changes

abadams merged commit cc03c9c into master Jan 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype of multiple scattering update definitions #5553

Prototype of multiple scattering update definitions #5553

abadams commented Dec 12, 2020

alexreinking commented Dec 12, 2020

zvookin commented Dec 12, 2020 •

edited

abadams commented Dec 12, 2020 •

edited

abadams commented Dec 12, 2020

zvookin commented Dec 12, 2020 •

edited

dsharletg commented Dec 12, 2020

zvookin commented Dec 12, 2020

abadams commented Dec 12, 2020

abadams commented Dec 12, 2020 •

edited

abadams commented Dec 12, 2020

abadams commented Dec 13, 2020

steven-johnson commented Dec 14, 2020

abadams commented Dec 14, 2020

alexreinking commented Dec 14, 2020

abadams commented Dec 14, 2020

steven-johnson commented Dec 15, 2020

abadams commented Dec 15, 2020

BachiLi commented Dec 17, 2020

steven-johnson commented Dec 17, 2020

abadams commented Dec 28, 2020

dsharletg left a comment

zvookin commented Dec 29, 2020

alexreinking commented Dec 29, 2020

abadams commented Dec 29, 2020

abadams commented Dec 29, 2020

alexreinking commented Dec 29, 2020

abadams commented Dec 29, 2020

alexreinking commented Dec 29, 2020

alexreinking left a comment •

edited

alexreinking commented Dec 31, 2020 •

edited

abadams commented Dec 31, 2020

alexreinking commented Dec 31, 2020

steven-johnson commented Jan 5, 2021

abadams commented Jan 5, 2021

Prototype of multiple scattering update definitions #5553

Prototype of multiple scattering update definitions #5553

Conversation

abadams commented Dec 12, 2020

alexreinking commented Dec 12, 2020

zvookin commented Dec 12, 2020 • edited

abadams commented Dec 12, 2020 • edited

abadams commented Dec 12, 2020

zvookin commented Dec 12, 2020 • edited

dsharletg commented Dec 12, 2020

zvookin commented Dec 12, 2020

abadams commented Dec 12, 2020

abadams commented Dec 12, 2020 • edited

abadams commented Dec 12, 2020

abadams commented Dec 13, 2020

steven-johnson commented Dec 14, 2020

abadams commented Dec 14, 2020

alexreinking commented Dec 14, 2020

abadams commented Dec 14, 2020

steven-johnson commented Dec 15, 2020

abadams commented Dec 15, 2020

BachiLi commented Dec 17, 2020

steven-johnson commented Dec 17, 2020

abadams commented Dec 28, 2020

dsharletg left a comment

Choose a reason for hiding this comment

zvookin commented Dec 29, 2020

alexreinking commented Dec 29, 2020

abadams commented Dec 29, 2020

abadams commented Dec 29, 2020

alexreinking commented Dec 29, 2020

abadams commented Dec 29, 2020

alexreinking commented Dec 29, 2020

alexreinking left a comment • edited

Choose a reason for hiding this comment

alexreinking commented Dec 31, 2020 • edited

abadams commented Dec 31, 2020

alexreinking commented Dec 31, 2020

steven-johnson commented Jan 5, 2021

abadams commented Jan 5, 2021

zvookin commented Dec 12, 2020 •

edited

abadams commented Dec 12, 2020 •

edited

zvookin commented Dec 12, 2020 •

edited

abadams commented Dec 12, 2020 •

edited

alexreinking left a comment •

edited

alexreinking commented Dec 31, 2020 •

edited