New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] allow inteval filters to work with indirect row key #13333
Conversation
82d7d70
to
3af2dc5
Compare
@@ -267,10 +310,10 @@ object ExtractIntervalFilters { | |||
case _ => None | |||
} | |||
} | |||
case Let(name, value, body) if name != es.rowRef.name => | |||
case Let(name, value, body) => | |||
// TODO: thread key identity through values, since this will break when CSE arrives | |||
// TODO: thread predicates in `value` through `body` as a ref |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these TODOs about? What would it mean to do them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I forgot to update the todos.
The problem with the old ExtractIntervalFilters
was that, like many of our analyses and optimizations, we couldn't see through Let nodes. This manifested in two ways, corresponding to the two comments:
- We only recognized a key field when it was literaly
GetField(Ref("row"), keyField)
. If it wasLet(row, Ref("row"), ... GetField(row, keyField) ...)
, we didn't recognize it as a key field, and so wouldn't extract filters on the field to intervals of the key type. - If
value
is a predicate, we didn't try to extract intervals from it. For example, if the filter predicate isLet(p, keyField < 10, ... p ... p)
, we didn't even try to lift thekeyField < 10
part. Ifp
were inlined, however, we would analyze it correctly.
This PR addresses the first, but not the second. Over the weekend, I realized the right way to fix this, which addresses both restrictions, and I think fixes some other blind spots as well, and is a simpler design. I think the better change isn't much harder than this one was. I'm going to take a stab at it this morning, and make a PR if it doesn't take too long. But we can still merge this in the meantime since it fixes a blocking issue.
.zipWithIndex | ||
.forall { case (fd, idx) => idx < rowKeyType.size && fd == GetField(rowRef, rowKeyType.fieldNames(idx)) } | ||
case SelectFields(`rowRef`, fields) => keyFields.startsWith(fields) | ||
case MakeStruct(fields) => fields.view.zipWithIndex.forall { case ((_, fd), idx) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the view
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes a lazy collection wrapping fields
, so the zipWithIndex
doesn't materialize. See https://docs.scala-lang.org/overviews/collections/views.html
case SelectFields(old, fields) => fields.view.zipWithIndex.forall { case (fd, idx) => | ||
idx < rowKeyType.size && fieldIsKeyField(old, fd, rowKeyType.fieldNames(idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not eliminate nested SelectFields
nodes? This case analysis and that in fieldIsKeyField
seems to indicate as much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what nesting you're referring to. A common case here would be SelectFields(Ref("row"), ["locus", "alleles"])
. We want to recognize this as being the key (or possibly a prefix of the key).
case SelectFields(old, fields) => fields.view.zipWithIndex.forall { case (fd, idx) => | ||
idx < rowKeyType.size && fieldIsKeyField(old, fd, rowKeyType.fieldNames(idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you not lift the idx < keySize
comparison out of the forall
(ie fields.length < rowKeyType.size
or something)?
The same for the above pattern for MakeStruct
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I look forward to your subsequent PR that removes this recursive pattern matching
E.g. when the row key, and/or the row itself, in a predicate is hidden behind Refs, MakeStructs, SelectFields. For example, before matching the first key field "kf" only worked for exactly
GetField(Ref("row", ...), "kf")
. Now the following all work (meaning the analysis recognizes these all as being the first key field):Edit:
A note on how serious the issue was in practice: Any filter that uses more than one field (or the same field twice) was broken, where broken means it does no interval filter and reads all the data. E.g.
ht.filter((ht.locus >= hl.locus('20', 1)) & (ht.locus < hl.locus('20', 10200000)))
. This is because the repeated field is pulled out into a let by CSE. If it was two different fields, the underlyingRef("row")
is pulled out.Other restrictions this PR doesn't address:
If