Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] allow inteval filters to work with indirect row key #13333

Merged
merged 7 commits into from Aug 3, 2023

Conversation

patrick-schultz
Copy link
Collaborator

@patrick-schultz patrick-schultz commented Jul 28, 2023

E.g. when the row key, and/or the row itself, in a predicate is hidden behind Refs, MakeStructs, SelectFields. For example, before matching the first key field "kf" only worked for exactly GetField(Ref("row", ...), "kf"). Now the following all work (meaning the analysis recognizes these all as being the first key field):

Let("foo",
  SelectFields(Ref("row"), [..., "kf", ...]),
  GetField(Ref("foo"), "kf"))

Let("foo",
  Ref("row"),
  Let("bar",
    MakeStruct("baz" -> GetField(Ref("foo"), "kf"), ...)
    GetField(Ref("bar"), "baz")))

Edit:
A note on how serious the issue was in practice: Any filter that uses more than one field (or the same field twice) was broken, where broken means it does no interval filter and reads all the data. E.g. ht.filter((ht.locus >= hl.locus('20', 1)) & (ht.locus < hl.locus('20', 10200000))). This is because the repeated field is pulled out into a let by CSE. If it was two different fields, the underlying Ref("row") is pulled out.

Other restrictions this PR doesn't address:

  • The predicate can't have a repeated sub-predicate, e.g.
    subcond = ht.locus.contig == '20'
    ht.filter((subcond & (ht.locus.position < 10)) | (subcond & (ht.locus.position > 10200000)))
    
  • The predicate can't use negation or If
  • The key type being filtered can only be Locus or numeric (or structs of those), doesn't work for e.g. gene Strings.

@@ -267,10 +310,10 @@ object ExtractIntervalFilters {
case _ => None
}
}
case Let(name, value, body) if name != es.rowRef.name =>
case Let(name, value, body) =>
// TODO: thread key identity through values, since this will break when CSE arrives
// TODO: thread predicates in `value` through `body` as a ref
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these TODOs about? What would it mean to do them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I forgot to update the todos.

The problem with the old ExtractIntervalFilters was that, like many of our analyses and optimizations, we couldn't see through Let nodes. This manifested in two ways, corresponding to the two comments:

  1. We only recognized a key field when it was literaly GetField(Ref("row"), keyField). If it was Let(row, Ref("row"), ... GetField(row, keyField) ...), we didn't recognize it as a key field, and so wouldn't extract filters on the field to intervals of the key type.
  2. If value is a predicate, we didn't try to extract intervals from it. For example, if the filter predicate is Let(p, keyField < 10, ... p ... p), we didn't even try to lift the keyField < 10 part. If p were inlined, however, we would analyze it correctly.

This PR addresses the first, but not the second. Over the weekend, I realized the right way to fix this, which addresses both restrictions, and I think fixes some other blind spots as well, and is a simpler design. I think the better change isn't much harder than this one was. I'm going to take a stab at it this morning, and make a PR if it doesn't take too long. But we can still merge this in the meantime since it fixes a blocking issue.

.zipWithIndex
.forall { case (fd, idx) => idx < rowKeyType.size && fd == GetField(rowRef, rowKeyType.fieldNames(idx)) }
case SelectFields(`rowRef`, fields) => keyFields.startsWith(fields)
case MakeStruct(fields) => fields.view.zipWithIndex.forall { case ((_, fd), idx) =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the view do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes a lazy collection wrapping fields, so the zipWithIndex doesn't materialize. See https://docs.scala-lang.org/overviews/collections/views.html

Comment on lines 71 to 72
case SelectFields(old, fields) => fields.view.zipWithIndex.forall { case (fd, idx) =>
idx < rowKeyType.size && fieldIsKeyField(old, fd, rowKeyType.fieldNames(idx))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not eliminate nested SelectFields nodes? This case analysis and that in fieldIsKeyField seems to indicate as much.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what nesting you're referring to. A common case here would be SelectFields(Ref("row"), ["locus", "alleles"]). We want to recognize this as being the key (or possibly a prefix of the key).

Comment on lines 71 to 72
case SelectFields(old, fields) => fields.view.zipWithIndex.forall { case (fd, idx) =>
idx < rowKeyType.size && fieldIsKeyField(old, fd, rowKeyType.fieldNames(idx))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you not lift the idx < keySize comparison out of the forall (ie fields.length < rowKeyType.size or something)?
The same for the above pattern for MakeStruct.

Copy link
Collaborator

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I look forward to your subsequent PR that removes this recursive pattern matching

@danking danking merged commit 3388359 into hail-is:main Aug 3, 2023
8 checks passed
@patrick-schultz patrick-schultz deleted the interval-filter-fix branch October 30, 2023 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants