-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix show problem by changing TableOrderBy to accept keyed tables. #5172
Conversation
Can you explain what the problem was? |
It wasn't scanning the full dataset anymore, but: table.head().flatten() was generating a TableOrderBy(TableKeyBy(TableHead)). There was no way to remove this node, even if the table was already keyed by the sort fields, so we ended up doing an extra scan and possibly shuffle. This change simplifies the whole thing, and emits the correct IR from the beginning |
@@ -429,7 +428,9 @@ object Simplify { | |||
TableMapGlobals(TableHead(child, n), newGlobals) | |||
|
|||
case TableHead(TableOrderBy(child, sortFields), n) | |||
if sortFields.forall(_.sortOrder == Ascending) && n < 256 && canRepartition => | |||
if sortFields.forall(_.sortOrder == Ascending) | |||
&& child.typ.key != sortFields.map(_.field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too strict. It should match the condition in table order by: that the sort fields are an prefix of the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which maybe you should break out as a separate function on object TableOrderBy and call in both places.
3bc87a5
to
4807127
Compare
@@ -319,8 +319,7 @@ object Simplify { | |||
TableFilter(t, | |||
ApplySpecial("&&", Array(p1, p2))) | |||
|
|||
case TableOrderBy(child, sortFields) if sortFields.isEmpty => | |||
child | |||
case TableOrderBy(TableKeyBy(child, _, _), sortFields) => TableOrderBy(child, sortFields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is general enough. What about adding:
case TableOrderBy(child, sortFields)
if TableOrderBy.isAlreadyOrdered(sortFields, child.rvdType.key) =>
TableKeyBy(child, Array(), false)
case TableKeyBy(TableKeyBy(child, sortFields, false), IndexedSeq(), _) =>
TableOrderBy(child, sortFields)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec in the google doc wouldn't allow for either of these rewrites. Since we can rewrite a TableKeyBy(TableKeyBy(child, _), newKey)
as TableKeyBy(child, newKey)
, the first would lead to optimization totally blowing away the order. We can't remove TableOrderBy nodes, even if a KeyBy substitution in-place may have the same semantics.
The latter is also a deoptimization - keying by an empty key doesn't guarantee a stable sort, so we don't actually have to do the inner keyBy at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, you're right. This change makes me a little uncomfortable, because the interaction between TableOrdeBy
and TableKeyBy
is now more complicated. I was trying to find a normalizing set of rewrite rules to handle that interaction (the second rule was only to make it confluent). I'll approve for now but I'll keep thinking about it.
Fix deoptimization in Simplify.