Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype of multiple scattering update definitions #5553

Merged
merged 19 commits into from Jan 9, 2021

Conversation

abadams
Copy link
Member

@abadams abadams commented Dec 12, 2020

This is a prototype of being able to scatter to multiple coordinates in an update definition. You bundle the args and the values in a special intrinsic, which gets unrolled during lowering on a one-to-one basis for all the bundles that appear on the LHS and RHS (they must all have the same size and can't be nested).

The advantage of this feature is that it lets you write an update definition that requires loading multiple values, doing something to them, and then storing all of them in one step. A good example is in the test - rotating a square image in place can be done by repeatedly loading four values, rotating them, and then storing them again to the same locations. I don't think we could currently express that algorithm.

I'm opening this for discussion as a minimum viable version of the feature.

I don't think the name Call::tuple is the right thing, and there's no sugar for this yet. Tuples are already a thing, and a Tuple has mixed types, which can't work here. Any other ideas for the name or for sugar for the feature are very welcome. "pack" maybe? "bundle"?

@abadams abadams marked this pull request as draft December 12, 2020 20:16
@alexreinking
Copy link
Member

I will review soon, but we should hold off on merging experimental features until after Halide 11 is released.

@zvookin
Copy link
Member

zvookin commented Dec 12, 2020

Are multiple index tuples allowed on the LHS? What kinds of Exprs are allowed inside the LHS tuples? Specifically is an RVar allowed?

First thought is that having to assert the indices are in range using unsafe_promise_clamped is problematic, especially in this instance where the range of the indices is fixed at compile time.

Second thought is we're getting a fair bit of complexity in the language embedded in argument lists for update stages. (Implicit loops on both sides.) In this case it is further obscured by using an Expr to hold "dst" so one cannot see the correspondence on that line alone. This makes it feel like unintelligible magic. Improvements to syntax will help this. E.g. something like:
sorted(x, scatter_to(min_idx, max_idx)) = scatter_from(...)
Another possibility would be to allow concatenating modifier expressions as we do in scheduling:
f(x, y).scatter(y, {min_idx, max_idx}) = scatter(y, {...});
Which would potentially allow multiple levels of scatter.

@abadams
Copy link
Member Author

abadams commented Dec 12, 2020

Multiple LHS index tuples are allowed, but they must be the same size and not nested within each other. I'm currently also not requiring them to be outermost in the expression. E.g. the following is valid:

f(make_expr_tuple({r.x, r.x + 5}) + make_expr_tuple({7, 3}), y, make_expr_tuple({r.y, r.y + r.x})) = make_expr_tuple({a, b});

and it means that you're writing a to f(r.x + 7, y, r.y) and b to f(r.x + 5 + 3, r.y + r.x)

The coordinates on the LHS can be arbitrary Exprs. The rules are the same as the usual rules for things that are allowed on the LHS of an update definition. So e.g. if there's a pure var, it must also appear naked (so not in a tuple) as a LHS coordinate in the same position as in the pure def.

The unsafe_promise_clamped is only because I'm pulling things out of a Buffer where I happen to know that the buffer is all in [0, 7], but it's an input buffer, not part of the compilation, so there's no way for the compiler to know that.

Edit: Unless we're saying that the compiler should assume that embedded Buffers can never change, so it can iterate over them to get bounds. That's tricky in the JIT case where they're a reference to existing memory that the user might mutate. In the AOT case maybe we could do that if they go in ro_data. In either case that's orthogonal to this PR. The bounds of make_expr_tuple are just the bounds of its args.

I think the scatter_to / scatter_from are equivalent to my make_expr_tuple except for the name, assuming you only mean those two values and not the entire range in between them. The second version doesn't quite work because you could be scattering in every axis, not just one.

"scatter_to" "scatter_from" are better names than make_expr_tuple for sure.

@abadams
Copy link
Member Author

abadams commented Dec 12, 2020

"scatter_to" and "gather_from, or just "scatter" and "gather" are other options.

@zvookin
Copy link
Member

zvookin commented Dec 12, 2020

"The second version doesn't quite work because you could be scattering in every axis, not just one."

I'm not sure I get that, wouldn't that just be something like f(x, y, z).scatter(y, ...).scatter(z, ...)

@dsharletg
Copy link
Contributor

Does it work if we just treat these the way GPUs treat short vectors? I.e. scalars broadcast across the vector, and vectors have to have the same number of elements or its an error. This has the possibly unusual added ability of using a "vector" as an index in a load/store, but that doesn't seem like too extreme of a jump from what the GPU languages do.

That's how I thought about this at first without thinking too hard about it.

@zvookin
Copy link
Member

zvookin commented Dec 12, 2020

Noting that the point of investigating expanded syntax is to spell out what is going on. To make it clear something special is happening. Especially if we think the language might do more of this. E.g. keyed map/reduce or compact/expand type stuff.

@abadams
Copy link
Member Author

abadams commented Dec 12, 2020

Yeah these are very much like short vectors on GPU.

@abadams
Copy link
Member Author

abadams commented Dec 12, 2020

Another piece of syntax I was thinking of was just listing the different LHS and RHS values as separate definitions somehow:

f.scatter({
  f(min_idx) = min(f(min_idx), f(max_idx));
  f(max_idx) = max(f(min_idx), f(max_idx));
});

But that looks like the two things happen serially instead of together, so it's actively misleading. But maybe there's something like that where multiple scatters naturally look like multiple assignments.

@abadams
Copy link
Member Author

abadams commented Dec 12, 2020

What if a Tuple of FuncRef was a thing, and it had an operator= that accepted a Tuple of Exprs?

Tuple(f(min_idx), f(max_idx)) = Tuple(min(f(min_idx), f(max_idx)), max(f(min_idx), f(max_idx)));

ugh, I think I take it back.

@abadams
Copy link
Member Author

abadams commented Dec 13, 2020

I changed the name to be "gather" on the RHS and "scatter" on the LHS. Now reads as:

    sorted1(x, scatter(min_idx, max_idx)) =
        gather(min(sorted1(x, min_idx), sorted1(x, max_idx)),
               max(sorted1(x, min_idx), sorted1(x, max_idx)));

The idea is that gathers on the RHS pair with scatters on the LHS.

@steven-johnson
Copy link
Contributor

"scatter_to" and "gather_from"

That would be my first vote, with plain "scatter/gather" in second place

@abadams
Copy link
Member Author

abadams commented Dec 14, 2020

I thought it might be best to keep it short, so you can write this sort of thing on one line:

f(x, scatter(y0, y1)) = gather(a, b) + c;

When it was longer I found myself writing things like:

Expr dst_y = scatter_to(y0, y1);
Expr ab = gather_from(a, b);
f(x, dst_y) = ab + c;

Which I think obscures what's going on. Better to have the scatter / gather present on the actual definition line.

We could force that behavior with more explicit syntax, e.g:

f.scatter({x, y0}, {x, y1}) = {a + c, b + c};

Not sure I like that specifically because it forces repetition of the +c on the RHS and x on the LHS, which is one of the things that makes Tuples annoying to use.

@alexreinking
Copy link
Member

Why not mix the two as in:

f.scatter({x, y0}, {x, y1}) = gather(a, b) + c;

@abadams
Copy link
Member Author

abadams commented Dec 14, 2020

It's an option. I find the asymmetry a little disturbing.

Another thing I don't like about Func::scatter is that it would be nice if in the syntax a regular single update definition was just the degenerate case of a multi-update definition. I.e. a multiple-scattering update definition that happens to only scatter to one coord should be the same syntax as our current update definitions.

@steven-johnson
Copy link
Contributor

The multiple_scatter test is failing under WebAssembly with out-of-memory. How much memory is it allocating?

@abadams
Copy link
Member Author

abadams commented Dec 15, 2020

Very little, so that's very odd. I'll see if I can repro.

@BachiLi
Copy link
Contributor

BachiLi commented Dec 17, 2020

probably need to handle this in autodiff as well

@steven-johnson
Copy link
Contributor

Looks like we're getting segfaults in apps/fft: https://buildbot.halide-lang.org/master/#/builders/49/builds/19

@abadams abadams marked this pull request as ready for review December 28, 2020 18:20
@abadams
Copy link
Member Author

abadams commented Dec 28, 2020

I believe this is now ready for review

Copy link
Contributor

@dsharletg dsharletg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation and idea looks good to me except for the ongoing debate above re: the names. I think the syntax should just be something like:

f(tuple(...)) = g(tuple(...));

tuple is maybe not the right name for this... vector? concat?

scatter/gather just doesn't seem quite right to me. The scatter/gather-ness is already determined by being on the LHS or RHS.

However, I don't feel that strongly about this. I'd be fine with merging this as-is. This is something we could easily think of a better solution for later, and deprecate the current names. I think the core of this (short vectors that broadcast including loads/stores) is a good change.

@zvookin
Copy link
Member

zvookin commented Dec 29, 2020

The scatter/gather syntax explains what is going on. I fear that the degree of magic in the language is high and increasing. If the reading of scatter and gather is helpful and not misleading, it is worth having even if it is not strictly necessary for folks who are very familiar with the language. Using tuple seems problematic to me as it overloads with the concept of tuple valued things, as in storing a group of values to a struct-like location. Not to mention the existing meaning in Halide. I'd expect just having it be a vector valued thing would be the way to go, though this introduces vectors in the user visible language, which has not been done. (And likely causes issues with the vectorization pass in lowering.)

I also like that the scatter/gather words are not too long and draw attention to the specific thing on both sides of the expression. This makes it easier to pair them up. If it is decided this is a bad idea, it is far easier to make the separate names synonyms for one common name and use the LHS vs. RHS property to distinguish them later than it is to do the opposite.

The only downside I can think of is if one wants to introduce a temporary Expr that is used on both sides. E.g.

    Expr indices = vec(y0, y1);
    out(x, indices) = in(indices + 42);

However I can't think of a use case for this.

@alexreinking
Copy link
Member

To be clear -- for a k-wide gather/scatter update, it first gets a k-wide scratch buffer and then iterates over all of the gathers jointly, recomputing the expression for each coordinate in those gathers, and storing them into the scratch buffer. Then, it does the same thing for the scatters, iterating jointly and copying the scratch to those locations in the func.

So if you have a k-wide gather/scatter, it updates k points in the func.

Now, what happens if someone writes this?

f(scatter(0, 0)) = g(gather(0, 1)); 

I'm not sure which value (g(1) or g(0)) will "win" in this case. Is it UB, or do we go left-to-right so g(1) wins? Also, how will this interact with tuples? There might be more edge cases I'm missing.

Still, I think this will be safe/sound for the core scheduling directives and a union is all that's needed for BI. Will rfactor work with this?

@abadams
Copy link
Member Author

abadams commented Dec 29, 2020

Yes, your understanding is correct. The stores happen left to right, so in the case of collisions the rightmost value wins.

It's orthogonal to Tuples. gather/scatter are Exprs, so they can exist inside Tuples. All gather values / scatter indices x tuple lanes are first computed, and then all are stored.

rfactor won't recognize any of this as associative because that occurs before gather/scatter are unpacked. Any gather/scatter patterns that are in fact associative/commutative updates could be added to the table of associative ops. I'm having a hard time thinking of one right now that wouldn't be better expressed in some other way.

@abadams
Copy link
Member Author

abadams commented Dec 29, 2020

Regarding the joint iteration thing - one wart of this approach is that if you want to store a different number of values depending on a condition:

f(select(p, scatter(0, 1), scatter(0, 1, 2)) = select(p, gather(a, b), gather(a, b, c));
...
f.specialize(p);

That's illegal because you can't jointly iterate. You have to write this instead:

f(select(p, scatter(0, 1, 1), scatter(0, 1, 2)) = select(p, gather(a, b, b), gather(a, b, c));

and then rely on dead-store elimination to cull the middle f[1] = b in the case where p is true.

If we wanted to, we could implicitly duplicate the last element of a short gather/scatter to match the longest one, but that sounds like it could hide some gnarly bugs. (e.g. if you accidentally negate the condition on the LHS vs RHS).

@alexreinking
Copy link
Member

Yes, your understanding is correct. The stores happen left to right, so in the case of collisions the rightmost value wins.

Okay, great, then I think we're definitely safe.

If we wanted to, we could implicitly duplicate the last element of a short gather/scatter to match the longest one, but that sounds like it could hide some gnarly bugs. (e.g. if you accidentally negate the condition on the LHS vs RHS).

Yeah, let's not do this. It does sound likely to hide genuine programming errors.

You have to write this instead: [...] and then rely on dead-store elimination to cull the middle [assignment] in the case where p is true.

One idea would be to let update stages in the algorithm select between several innermost assignments. Something like:

f.update({
  { p,    _(scatter(0, 1)) = gather(a, b) }
  { true, _(scatter(0, 1 ,2) = gather(a, b, c) }
});
f.specialize(p);

Where _ is a placeholder that's filled in with f by update and every update mentions the same set of RDoms. I won't say this is pretty either, but it's a bit better in that it makes the differing updates explicit, keeps the gather/scatter semantics simple, and is just as schedulable as before. I'm not sure what effect this would have on BI. After a specialize, the union operation could get moved higher up versus the select in the original.

@abadams
Copy link
Member Author

abadams commented Dec 29, 2020

We could do that. Effectively it adds an optional switch statement to update definitions, along with the optional for loop (the RDom), and the optional inner if statement (RDom::where). But I think (after this PR!) we should stop adding more control flow features to update definitions one by one, and instead consider a holistically-designed imperative extension.

@alexreinking
Copy link
Member

But I think (after this PR!) we should stop adding more control flow features to update definitions one by one, and instead consider a holistically-designed imperative extension.

👍 to this generally.

Copy link
Member

@alexreinking alexreinking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, with a couple small improvements I'd like made:

  • the "rightmost write wins" semantics should be documented both in the Doxygen doc string and as an explicit test so that the behavior isn't inadvertently changed.
  • I'm pretty sure f(0) = gather(1,2) is nonsense, but f(scatter(0,1)) = 2 looks like a broadcast. If the second example is allowed, it should be tested and documented. If it is not, it should become an error test (same with the no-scatter first example).

src/Simplify_Stmts.cpp Show resolved Hide resolved
test/correctness/multiple_scatter.cpp Outdated Show resolved Hide resolved
src/IROperator.h Show resolved Hide resolved
src/IROperator.h Show resolved Hide resolved
@alexreinking
Copy link
Member

alexreinking commented Dec 31, 2020

Since the f(/* no scatter */) = gather(e1, ..., en) case is allowed, should we try to guarantee that in this case gather(e1, ..., en) = en is replaced before BI sees it, so we don't get unnecessary widening / math?

@abadams
Copy link
Member Author

abadams commented Dec 31, 2020

Good question... I think it's in the same category as things like 0 * f(x), or select(provably_false_expr, f(x), g(x)). I'm not too worried about bounds inference being tight on silly code. Normally I only write things like 0 * f(x) in order to add false dependence while debugging something.

@alexreinking
Copy link
Member

I'm not too worried about bounds inference being tight on silly code.

Yeah, I asked mostly because the test imagined it as a degenerate case of some generic (metaprogramming, I assume) code. That might justify treating this case favorably. OTOH that work could be done in a separate PR.

@alexreinking alexreinking added this to the v12.0.0 milestone Jan 1, 2021
@alexreinking alexreinking added the enhancement New user-visible features or improvements to existing features. label Jan 1, 2021
@steven-johnson
Copy link
Contributor

Post-holiday check-in: where does this PR stand? Buildbots are clean. Is there more work to be done?

@abadams
Copy link
Member Author

abadams commented Jan 5, 2021

It's done and ready to merge. Alex wanted to cut the branch for release 11 before merging, so it's just waiting on that.

@abadams abadams merged commit cc03c9c into master Jan 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New user-visible features or improvements to existing features.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants