Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Introduce `ArrayFold2` to permit more deforesting #6981

Merged
merged 9 commits into from Sep 16, 2019

Conversation

@tpoterba
Copy link
Collaborator

commented Sep 3, 2019

Stacked on #6980

Problem 1: functions like min/max need to do a length check,
preventing us from implementing them in a single primitive fold.

Problem 2: functions like mean use a struct as the accumulator,
leading to allocation (!) every element.

Solution: make it possible to have multiple primitive
accumulators. This is ArrayFold2.

The node is different from ArrayFold in that it:

  • has as sequence of accumulators, not just one
  • has a sequence of seq ops, one for each accumulator. Each of these
    sequence ops can see all the accumulators, and will see the updated
    value from sequence operations with a smaller index.
  • has a result op, which is a function from accumulators to result.

By changing min/max to use ArrayFold2 and inlining these functions,
we can get a reasonable speedup on split_multi_hts:

#6980 (this PR's parent):

2019-09-03 07:11:16,374: INFO:     burn in: 42.33s
2019-09-03 07:11:56,085: INFO:     run 1: 39.71s
2019-09-03 07:12:34,916: INFO:     run 2: 38.83s
2019-09-03 07:13:14,087: INFO:     run 3: 39.17s

PR:

2019-09-03 07:32:10,416: INFO:     burn in: 38.03s
2019-09-03 07:32:39,237: INFO:     run 1: 28.82s
2019-09-03 07:33:07,778: INFO:     run 2: 28.50s
2019-09-03 07:33:35,997: INFO:     run 3: 28.21s
@tpoterba tpoterba added the stacked PR label Sep 3, 2019
@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 3, 2019

Ported ArrayFunctions.mean to ArrayFold2.

Benchmark:

@benchmark
def table_range_means():
    ht = hl.utils.range_table(10_000_000, 16)
    ht = ht.annotate(m = hl.mean(hl.range(0, ht.idx % 1111)))
    ht._force_count()

Master:

2019-09-03 09:39:05,777: INFO: [1/1] Running table_range_means...
2019-09-03 09:40:52,557: INFO:     burn in: 106.78s
2019-09-03 09:42:34,333: INFO:     run 1: 101.78s
2019-09-03 09:44:14,982: INFO:     run 2: 100.65s
2019-09-03 09:45:53,590: INFO:     run 3: 98.61s

PR:

2019-09-03 09:47:26,110: INFO: [1/1] Running table_range_means...
2019-09-03 09:47:29,465: INFO:     burn in: 3.35s
2019-09-03 09:47:32,615: INFO:     run 1: 3.15s
2019-09-03 09:47:35,703: INFO:     run 2: 3.09s
2019-09-03 09:47:38,840: INFO:     run 3: 3.14s
tpoterba added 7 commits Sep 3, 2019
Problem 1: functions like `min`/`max` need to do a length check,
preventing us from implementing them in a single primitive fold.

Problem 2: functions like `mean` use  a struct as the accumulator,
leading to allocation (!) every element.

Solution to both: make it possible to have multiple primitive
accumulators. This is `ArrayFold2`.

The node is different from `ArrayFold` in that it:
 * has as sequence of accumulators, not just one
 * has a sequence of seq ops, one for each accumulator. Each of these
   sequence ops can see all the accumulators, and will see the updated
   value from sequence operations with a smaller index.
 * has a result op, which is a function from accumulators to result.

By changing `min`/`max` to use ArrayFold2 and inlining these functions,
we can get a reasonable speedup on `split_multi_hts`:

#6980 (this PR's parent):

```
2019-09-03 07:11:16,374: INFO:     burn in: 42.33s
2019-09-03 07:11:56,085: INFO:     run 1: 39.71s
2019-09-03 07:12:34,916: INFO:     run 2: 38.83s
2019-09-03 07:13:14,087: INFO:     run 3: 39.17s
```

PR:
```
2019-09-03 07:32:10,416: INFO:     burn in: 38.03s
2019-09-03 07:32:39,237: INFO:     run 1: 28.82s
2019-09-03 07:33:07,778: INFO:     run 2: 28.50s
2019-09-03 07:33:35,997: INFO:     run 3: 28.21s
```
Benchmark:
```python
@benchmark
def table_range_means():
    ht = hl.utils.range_table(10_000_000, 16)
    ht = ht.annotate(m = hl.mean(hl.range(0, ht.idx % 1111)))
    ht._force_count()
```

Master:
```
2019-09-03 09:39:05,777: INFO: [1/1] Running table_range_means...
2019-09-03 09:40:52,557: INFO:     burn in: 106.78s
2019-09-03 09:42:34,333: INFO:     run 1: 101.78s
2019-09-03 09:44:14,982: INFO:     run 2: 100.65s
2019-09-03 09:45:53,590: INFO:     run 3: 98.61s
```

PR:
```
2019-09-03 09:47:26,110: INFO: [1/1] Running table_range_means...
2019-09-03 09:47:29,465: INFO:     burn in: 3.35s
2019-09-03 09:47:32,615: INFO:     run 1: 3.15s
2019-09-03 09:47:35,703: INFO:     run 2: 3.09s
2019-09-03 09:47:38,840: INFO:     run 3: 3.14s
```
fix
@tpoterba tpoterba removed the stacked PR label Sep 3, 2019
@tpoterba tpoterba force-pushed the tpoterba:array-fold-2 branch from 4c32ecc to 1a79591 Sep 3, 2019
@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 3, 2019

parent PR is in, ready for review.

Note that the next step is replacing ArrayFold with ArrayFold2.

@@ -18,6 +18,14 @@ object Bindings {
case ArrayFlatMap(a, name, _) => if (i == 1) Array(name -> -coerce[TStreamable](a.typ).elementType) else empty
case ArrayFilter(a, name, _) => if (i == 1) Array(name -> -coerce[TStreamable](a.typ).elementType) else empty
case ArrayFold(a, zero, accumName, valueName, _) => if (i == 2) Array(accumName -> zero.typ, valueName -> -coerce[TStreamable](a.typ).elementType) else empty
case ArrayFold2(a, accum, valueName, seq, result) =>
val l = 2 + 2 * accum.length

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Sep 5, 2019

Collaborator

l is unused

If(Ref(first, TBoolean()), Ref(value, t), invoke(op, t, Ref(acc, t), Ref(value, t))),
False()
),
Ref(acc, t))

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Sep 5, 2019

Collaborator

Keeping the boolean state seems unnecessary. What about only doing a fold when the array is non-empty, like:

val t = -coerce[TArray](a.typ).elementType
val accum = genUID()
val value = genUID()
val aUID = genUID()
val aRef = Ref(aUID, a.typ)
val zVal = ArrayRef(a, I32(0))
val body = invoke(op, t, Ref(value, t), Ref(accum, t))

Let(aUID, a,
  If(ApplyComparisonOp(EQ(TInt32()), ArrayLen(aRef), I32(0)),
    NA(t),
    ArrayFold(aRef, zVal, accum, value, body)))

This comment has been minimized.

Copy link
@tpoterba

tpoterba Sep 12, 2019

Author Collaborator

(as discussed -- wrapping in a Let prevents deforesting)

@@ -235,6 +235,16 @@ final case class ArrayFlatMap(a: IR, name: String, body: IR) extends IR {
}
final case class ArrayFold(a: IR, zero: IR, accumName: String, valueName: String, body: IR) extends IR

object ArrayFold2 {
def apply(a: ArrayFold): ArrayFold2 = {

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Sep 5, 2019

Collaborator

Would it be possible to make ArrayFold.apply call this and get rid of ArrayFold completely?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Sep 12, 2019

Author Collaborator

this is part 2, but it's a larger refactor and wanted to isolate the new behavior in a single PR.

@danking danking merged commit e067ad0 into hail-is:master Sep 16, 2019
1 check passed
1 check passed
ci-test success
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.