New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[query] allow inteval filters to work with indirect row key #13333

Merged

danking merged 7 commits into hail-is:main from patrick-schultz:interval-filter-fix

Aug 3, 2023

Collaborator

patrick-schultz commented Jul 28, 2023 •

edited

E.g. when the row key, and/or the row itself, in a predicate is hidden behind Refs, MakeStructs, SelectFields. For example, before matching the first key field "kf" only worked for exactly GetField(Ref("row", ...), "kf"). Now the following all work (meaning the analysis recognizes these all as being the first key field):

Let("foo",
  SelectFields(Ref("row"), [..., "kf", ...]),
  GetField(Ref("foo"), "kf"))

Let("foo",
  Ref("row"),
  Let("bar",
    MakeStruct("baz" -> GetField(Ref("foo"), "kf"), ...)
    GetField(Ref("bar"), "baz")))

Edit:
A note on how serious the issue was in practice: Any filter that uses more than one field (or the same field twice) was broken, where broken means it does no interval filter and reads all the data. E.g. ht.filter((ht.locus >= hl.locus('20', 1)) & (ht.locus < hl.locus('20', 10200000))). This is because the repeated field is pulled out into a let by CSE. If it was two different fields, the underlying Ref("row") is pulled out.

Other restrictions this PR doesn't address:

The predicate can't have a repeated sub-predicate, e.g.

subcond = ht.locus.contig == '20'
ht.filter((subcond & (ht.locus.position < 10)) | (subcond & (ht.locus.position > 10200000)))

The predicate can't use negation or If
The key type being filtered can only be Locus or numeric (or structs of those), doesn't work for e.g. gene Strings.

patrick-schultz added 2 commits

July 28, 2023 14:54


          add test

8630eff


          teach ExtractIntervalFilters to see through refs

3685e65

patrick-schultz assigned ehigham

patrick-schultz added 3 commits

July 28, 2023 15:53


          another test

8d58ccf


          fix ordering

54d87e7


          add GRCh38 test

3af2dc5

patrick-schultz force-pushed the interval-filter-fix branch from 82d7d70 to 3af2dc5 Compare

July 28, 2023 20:29

danking reviewed

View reviewed changes

hail/src/main/scala/is/hail/expr/ir/ExtractIntervalFilters.scala Outdated

@@ @@ -267,10 +310,10 @@ object ExtractIntervalFilters { @@
                           case _ => None
                         }
                       }
-                    case Let(name, value, body) if name != es.rowRef.name =>
+                    case Let(name, value, body) =>
                       // TODO: thread key identity through values, since this will break when CSE arrives
                       // TODO: thread predicates in `value` through `body` as a ref

Collaborator

danking Jul 29, 2023

What are these TODOs about? What would it mean to do them?

Collaborator Author

patrick-schultz Jul 31, 2023

Thanks, I forgot to update the todos.

The problem with the old ExtractIntervalFilters was that, like many of our analyses and optimizations, we couldn't see through Let nodes. This manifested in two ways, corresponding to the two comments:

We only recognized a key field when it was literaly GetField(Ref("row"), keyField). If it was Let(row, Ref("row"), ... GetField(row, keyField) ...), we didn't recognize it as a key field, and so wouldn't extract filters on the field to intervals of the key type.
If value is a predicate, we didn't try to extract intervals from it. For example, if the filter predicate is Let(p, keyField < 10, ... p ... p), we didn't even try to lift the keyField < 10 part. If p were inlined, however, we would analyze it correctly.

This PR addresses the first, but not the second. Over the weekend, I realized the right way to fix this, which addresses both restrictions, and I think fixes some other blind spots as well, and is a simpler design. I think the better change isn't much harder than this one was. I'm going to take a stab at it this morning, and make a PR if it doesn't take too long. But we can still merge this in the meantime since it fixes a blocking issue.

hail/src/main/scala/is/hail/expr/ir/ExtractIntervalFilters.scala Outdated

-                      .zipWithIndex
-                      .forall { case (fd, idx) => idx < rowKeyType.size && fd == GetField(rowRef, rowKeyType.fieldNames(idx)) }
-                    case SelectFields(`rowRef`, fields) => keyFields.startsWith(fields)
+                    case MakeStruct(fields) => fields.view.zipWithIndex.forall { case ((_, fd), idx) =>

Collaborator

danking Jul 29, 2023

what's the view do?

Collaborator Author

patrick-schultz Jul 31, 2023

It makes a lazy collection wrapping fields, so the zipWithIndex doesn't materialize. See https://docs.scala-lang.org/overviews/collections/views.html


          fix comments

86ffc6b

ehigham previously requested changes

View reviewed changes

hail/src/main/scala/is/hail/expr/ir/ExtractIntervalFilters.scala Show resolved Hide resolved

hail/src/main/scala/is/hail/expr/ir/ExtractIntervalFilters.scala Outdated

Comment on lines 71 to 72

		case SelectFields(old, fields) => fields.view.zipWithIndex.forall { case (fd, idx) =>
		idx < rowKeyType.size && fieldIsKeyField(old, fd, rowKeyType.fieldNames(idx))

Collaborator

ehigham Jul 31, 2023

Do we not eliminate nested SelectFields nodes? This case analysis and that in fieldIsKeyField seems to indicate as much.

Collaborator Author

patrick-schultz Jul 31, 2023

Not sure what nesting you're referring to. A common case here would be SelectFields(Ref("row"), ["locus", "alleles"]). We want to recognize this as being the key (or possibly a prefix of the key).

hail/src/main/scala/is/hail/expr/ir/ExtractIntervalFilters.scala Outdated

Comment on lines 71 to 72

		case SelectFields(old, fields) => fields.view.zipWithIndex.forall { case (fd, idx) =>
		idx < rowKeyType.size && fieldIsKeyField(old, fd, rowKeyType.fieldNames(idx))

Collaborator

ehigham Jul 31, 2023

can you not lift the idx < keySize comparison out of the forall (ie fields.length < rowKeyType.size or something)?
The same for the above pattern for MakeStruct.


          address comments

bd466d6

patrick-schultz requested a review from ehigham

July 31, 2023 19:29

patrick-schultz dismissed ehigham’s stale review

July 31, 2023 19:29

done

ehigham approved these changes

View reviewed changes

Collaborator

ehigham left a comment

Nice. I look forward to your subsequent PR that removes this recursive pattern matching

danking merged commit 3388359 into hail-is:main

8 checks passed

patrick-schultz deleted the interval-filter-fix branch

October 30, 2023 13:41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment