Skip to content

[compiler] rewrite ExtractIntervalFilters to be more robust #13355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Oct 13, 2023

Conversation

patrick-schultz
Copy link
Collaborator

@patrick-schultz patrick-schultz commented Aug 1, 2023

CHANGELOG: make hail's optimization rewriting filters to interval-filters smarter and more robust

Completely rewrites ExtractIntervalFilters. Instead of matching against very specific patterns, and failing completely for things that don't quite match (e.g. an input is let bound, or the fold implementing "locus is contained in a set of intervals" is written slightly differently), this uses a standard abstract interpretation framework, which is almost completely insensitive to the form of the IR, only depending on the semantics. It also correctly handles missing key fields, where the previous implementation often produced an unsound transformation of the IR.

Also adds a much more thorough test suite than we had before.

At the top level, the analysis takes a boolean typed IR cond in an environment where there is a reference to some key, and produces a set intervals, such that cond is equivalent to cond & intervals.contains(key) (in other words cond implies intervals.contains(key), or intervals contains all rows where cond is true). This means for instance it is safe to replace TableFilter(t, cond) with TableFilter(TableFilterIntervals(t, intervals), cond).

Then in a second pass it rewrites cond to cond2, such that cond & (intervals.contains(key)) is equivalent to cond2 & intervals.contains(key) (in other words cond implies cond2, and cond2 & intervals.contains(key) implies cond). This means it is safe to replace the TableFilter(t, cond) with TableFilter(TableFilterIntervals(t, intervals), cond2). A common example is when cond can be completely captured by the interval filter, i.e. cond is equivant to intervals.contains(key), in which case we can take cond2 = True, and the TableFilter can be optimized away.

This all happens in the function

  def extractPartitionFilters(ctx: ExecuteContext, cond: IR, ref: Ref, key: IndexedSeq[String]): Option[(IR, IndexedSeq[Interval])] = {
    if (key.isEmpty) None
    else {
      val extract = new ExtractIntervalFilters(ctx, ref.typ.asInstanceOf[TStruct].typeAfterSelectNames(key))
      val trueSet = extract.analyze(cond, ref.name)
      if (trueSet == extract.KeySetLattice.top)
        None
      else {
        val rw = extract.Rewrites(mutable.Set.empty, mutable.Set.empty)
        extract.analyze(cond, ref.name, Some(rw), trueSet)
        Some((extract.rewrite(cond, rw), trueSet))
      }
    }
  }

trueSet is the set of intervals which contains all rows where cond is true. This set is passed back into analyze in a second pass, which asks it to rewrite cond to something equivalent, under the assumption that all keys are contained in trueSet.

The abstraction of runtime values tracks two types of information:

  • Is this value a reference to / copy of one of the key fields of this row? We need to know this to be able to recognize comparisons with key values, which we want to extract to interval filters.
  • For boolean values (including, ultimately, the filter predicate itself), we track three sets of intervals of the key type: overapproximations of when the bool is true, false, and missing. Overapproximation here means, for example, if the boolean evaluates to true in some row with key k, then k must be contained in the "true" set of intervals. But it's completely fine if the set of intervals contains keys of rows where the bool is not true. In particular, a boolean about which we know nothing (e.g. it's just some non-key boolean field in the dataset) is represented by an abstract boolean value where all three sets are the set of all keys.

def meet(l: Value, r: Value): Value
}

object AbstactLattice {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing an r: AbstractLattice

# Conflicts:
#	hail/python/test/hail/extract_intervals/test_key_prefix.py
#	hail/src/main/scala/is/hail/expr/ir/ExtractIntervalFilters.scala
#	hail/testng.xml
# Conflicts:
#	hail/src/main/scala/is/hail/expr/ir/lowering/LowerDistributedSort.scala
#	hail/src/test/scala/is/hail/expr/ir/BlockMatrixIRSuite.scala
#	hail/src/test/scala/is/hail/expr/ir/ExtractIntervalFiltersSuite.scala
#	hail/src/test/scala/is/hail/expr/ir/IRSuite.scala
Copy link
Member

@ehigham ehigham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool and like this change a lot. Sorry it's taken a while to post a review.
Most of my comments are on implementation details. I have no real gripes with the design.

KeySetLattice.meet(x.naBound, acc)
}
BoolValue(trueBound, falseBound, naBound)
aVals.asInstanceOf[Seq[BoolValue]].reduce(BoolValue.coalesce)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful

@danking danking merged commit bd6e397 into hail-is:main Oct 13, 2023
@patrick-schultz patrick-schultz deleted the interval-filter-fix2 branch October 30, 2023 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants