-
Notifications
You must be signed in to change notification settings - Fork 256
[hail] Introduce ArrayFold2 to permit more deforesting
#6981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ported Benchmark: @benchmark
def table_range_means():
ht = hl.utils.range_table(10_000_000, 16)
ht = ht.annotate(m = hl.mean(hl.range(0, ht.idx % 1111)))
ht._force_count()Master: PR: |
Problem 1: functions like `min`/`max` need to do a length check, preventing us from implementing them in a single primitive fold. Problem 2: functions like `mean` use a struct as the accumulator, leading to allocation (!) every element. Solution to both: make it possible to have multiple primitive accumulators. This is `ArrayFold2`. The node is different from `ArrayFold` in that it: * has as sequence of accumulators, not just one * has a sequence of seq ops, one for each accumulator. Each of these sequence ops can see all the accumulators, and will see the updated value from sequence operations with a smaller index. * has a result op, which is a function from accumulators to result. By changing `min`/`max` to use ArrayFold2 and inlining these functions, we can get a reasonable speedup on `split_multi_hts`: hail-is#6980 (this PR's parent): ``` 2019-09-03 07:11:16,374: INFO: burn in: 42.33s 2019-09-03 07:11:56,085: INFO: run 1: 39.71s 2019-09-03 07:12:34,916: INFO: run 2: 38.83s 2019-09-03 07:13:14,087: INFO: run 3: 39.17s ``` PR: ``` 2019-09-03 07:32:10,416: INFO: burn in: 38.03s 2019-09-03 07:32:39,237: INFO: run 1: 28.82s 2019-09-03 07:33:07,778: INFO: run 2: 28.50s 2019-09-03 07:33:35,997: INFO: run 3: 28.21s ```
Benchmark: ```python @benchmark def table_range_means(): ht = hl.utils.range_table(10_000_000, 16) ht = ht.annotate(m = hl.mean(hl.range(0, ht.idx % 1111))) ht._force_count() ``` Master: ``` 2019-09-03 09:39:05,777: INFO: [1/1] Running table_range_means... 2019-09-03 09:40:52,557: INFO: burn in: 106.78s 2019-09-03 09:42:34,333: INFO: run 1: 101.78s 2019-09-03 09:44:14,982: INFO: run 2: 100.65s 2019-09-03 09:45:53,590: INFO: run 3: 98.61s ``` PR: ``` 2019-09-03 09:47:26,110: INFO: [1/1] Running table_range_means... 2019-09-03 09:47:29,465: INFO: burn in: 3.35s 2019-09-03 09:47:32,615: INFO: run 1: 3.15s 2019-09-03 09:47:35,703: INFO: run 2: 3.09s 2019-09-03 09:47:38,840: INFO: run 3: 3.14s ```
4c32ecc to
1a79591
Compare
|
parent PR is in, ready for review. Note that the next step is replacing ArrayFold with ArrayFold2. |
| case ArrayFilter(a, name, _) => if (i == 1) Array(name -> -coerce[TStreamable](a.typ).elementType) else empty | ||
| case ArrayFold(a, zero, accumName, valueName, _) => if (i == 2) Array(accumName -> zero.typ, valueName -> -coerce[TStreamable](a.typ).elementType) else empty | ||
| case ArrayFold2(a, accum, valueName, seq, result) => | ||
| val l = 2 + 2 * accum.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l is unused
| If(Ref(first, TBoolean()), Ref(value, t), invoke(op, t, Ref(acc, t), Ref(value, t))), | ||
| False() | ||
| ), | ||
| Ref(acc, t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the boolean state seems unnecessary. What about only doing a fold when the array is non-empty, like:
val t = -coerce[TArray](a.typ).elementType
val accum = genUID()
val value = genUID()
val aUID = genUID()
val aRef = Ref(aUID, a.typ)
val zVal = ArrayRef(a, I32(0))
val body = invoke(op, t, Ref(value, t), Ref(accum, t))
Let(aUID, a,
If(ApplyComparisonOp(EQ(TInt32()), ArrayLen(aRef), I32(0)),
NA(t),
ArrayFold(aRef, zVal, accum, value, body)))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(as discussed -- wrapping in a Let prevents deforesting)
| final case class ArrayFold(a: IR, zero: IR, accumName: String, valueName: String, body: IR) extends IR | ||
|
|
||
| object ArrayFold2 { | ||
| def apply(a: ArrayFold): ArrayFold2 = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to make ArrayFold.apply call this and get rid of ArrayFold completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is part 2, but it's a larger refactor and wanted to isolate the new behavior in a single PR.
Stacked on #6980
Problem 1: functions like
min/maxneed to do a length check,preventing us from implementing them in a single primitive fold.
Problem 2: functions like
meanuse a struct as the accumulator,leading to allocation (!) every element.
Solution: make it possible to have multiple primitive
accumulators. This is
ArrayFold2.The node is different from
ArrayFoldin that it:sequence ops can see all the accumulators, and will see the updated
value from sequence operations with a smaller index.
By changing
min/maxto use ArrayFold2 and inlining these functions,we can get a reasonable speedup on
split_multi_hts:#6980 (this PR's parent):
PR: