-
Notifications
You must be signed in to change notification settings - Fork 250
[query] force aggregate_cols
to be local
#13405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks like you need the globals too. Two concerns:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment
|
I can't find the conversation we all had about this, but I strongly disagree with (2). It has to be the case that this always evaluates to true. This would be profoundly confusing if not.
I hate it but I'm willing to accept that
Returns them out of order, even though I don't like that. EDIT: hit enter too fast |
Heh. We can't annotate_globals. |
OK, alright. I think you're right. I'm just reliving how frustrated I am by this situation. Can you modify aggregate_cols to include the same warning from
|
I can't find good tests either, can you add some tests in the spirit of my shared ipython session? |
Had to chase down a latent bug in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update your PR message with a CHANGELOG line indicating that we fixed the performance regression?
mt = mt.checkpoint(path) | ||
assert(mt.aggregate_cols(hl.agg.collect(mt.col_idx)) == [0, 1, 2]) | ||
mt = mt.key_cols_by() | ||
assert(mt.aggregate_cols(hl.agg.collect(mt.col_idx)) == [2, 1, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, sorry, assert in Python doesn't use parentheses. My bad
CHANGELOG: MatrixTable.aggregate_cols no longer forces a distributed computation. This should be what you want in the majority of cases. In case you know the aggregation is very slow and should be parallelized, use mt.cols().aggregate instead.
Most of the time,
aggregate_cols
will be much faster performing the aggregation locally. Currently, we generate aTableAggregate
over aTableParallelize
of the columns. We shouldn't try to optimize that to a local computation during compilation;TableParallelize
should express the intent that the computation is expensive and really should be parallelized. This should be considered part of the semantics the compiler must preserve.This PR changes
aggregate_cols
to explicitly generate a local computation usingStreamAgg
(which was only exposed in Python relatively recently, which is why we haven't made this change sooner). Longer term, aggregating columns should probably get its own IR node, especially once we start partitioning along columns.