Skip to content

NDArray Sum Aggregator #9209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Aug 5, 2020

Conversation

johnc1231
Copy link
Contributor

The ability to sum over many ndarrays with an aggregator is necessary for several distributed matmul like operations. We'll probably want something more general in the future, but for now this should suffice. This PR:

  • Adds hl.agg.ndarray_sum, a python aggregator that sums over ndarrays of the same shape
  • The corresponding Scala classes, mainly NDArraySumAggregator
  • PNDArrayValue.sameShape, a helper function that checks if two PNDArrays are the same shape.

Future work:

  • Add an optional initOp that specifies a shape in cases when we want the sum of no elements to be an ndarray of 0s.

@danking

danking
danking previously requested changes Aug 3, 2020
Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a benchmark that sums a table of ndarrays?

@johnc1231
Copy link
Contributor Author

johnc1231 commented Aug 3, 2020

Sure. Any size preference? I could do the sum of 100 2ds with dims 4096 x 4096 ndarrays for block matrix / dndarray similarity.

@danking
Copy link
Contributor

danking commented Aug 3, 2020

Just big enough to get consistent timing numbers (probs easiest to check this on your laptop for now), imo. Also very curious to hear how master compares to this PR.

@johnc1231
Copy link
Contributor Author

Doing 100 4096 x 4096s on my laptop took 1min20s.

In [3]: %time ht.aggregate(hl.agg.ndarray_sum(ht.nd))
[Stage 1:====================================================>    (11 + 1) / 12]CPU times: user 2.74 s, sys: 554 ms, total: 3.29 s
Wall time: 1min 20s
Out[3]:
array([[100., 100., 100., ..., 100., 100., 100.],
       [100., 100., 100., ..., 100., 100., 100.],
       [100., 100., 100., ..., 100., 100., 100.],
       ...,
       [100., 100., 100., ..., 100., 100., 100.],
       [100., 100., 100., ..., 100., 100., 100.],
       [100., 100., 100., ..., 100., 100., 100.]])

@johnc1231 johnc1231 dismissed danking’s stale review August 3, 2020 21:18

Added a benchmark

@johnc1231
Copy link
Contributor Author

Your hacky version took 1min41 seconds on my laptop. Not as big a difference as I'd hoped.

@johnc1231
Copy link
Contributor Author

Ah, I bet I'm hitting the same problem I hit with the original NDArrayEmitter, in that my nested loops aren't getting jitted. I'll see if getting them in a separate method improves the situation.

@johnc1231
Copy link
Contributor Author

Ok, so Cotton's new thing means emitting separate methods by hand is not a thing we do anymore. But there are two factors hurting the benchmark.

One is that the benchmark is hiding the fact that we are spending ~25 seconds serializing and de-serializing JSON for this ndarray. So the real comparison is more like 55 seconds vs 75 seconds, which is a roughly 25% speed improvement.

The other is that hl.nd.ones is just an alias for hl.nd.array(hl.range(shape_product)).map(lambda x: 1).reshape((n_rows, n_cols)). This is going to create a bunch of row major data, copy it to column major in a pretty cache inefficient way during the reshape, then do the additions. So that's eating some of the time too. We should probably have a way for all the constant methods to not go through regular array.

Anyway, 25% improvement + better interface is a win for now, we can revisit ways to make this faster in the future.

@tpoterba
Copy link
Contributor

tpoterba commented Aug 4, 2020

you can definitely still generate methods by hand!

@johnc1231
Copy link
Contributor Author

I talked to Cotton about it and he said not to. But it's not clear how much of a difference that makes yet anyway. I think this version is pretty good and an improvement. Plus it'll add a benchmark which we can work on optimizing.

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few high level comments.

@@ -54,7 +56,11 @@ abstract class PNDArray extends PType {
abstract class PNDArrayValue extends PValue {
def apply(indices: IndexedSeq[Value[Long]], mb: EmitMethodBuilder[_]): Value[_]

override def pt: PNDArray = ???
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not leave abstract?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lets me set the return type for things that override it so they don't have to be cast.

Copy link
Contributor

@akotlar akotlar Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tim and I previously handled this using

private def ptND: PNDArray
override def pt: PType = ptND


override def resultType: PType = ndTyp

val stateType = PCanonicalTuple(true, PBooleanRequired, ndTyp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the seqOp skips missing values, the ndarray field can never be missing if the state is initialized, right? Is there a reason not to just let a missing ndarray be the uninitialized state? Then the result is exactly the state.

If not, up to you if you want to change it. I think it's a bit cleaner, but not a big difference.

@@ -126,6 +126,10 @@ final case class PCanonicalNDArray(elementType: PType, nDims: Int, required: Boo
))
}

def mutateElement(indices: IndexedSeq[Value[Long]], ndAddress: Value[Long], newElement: Code[_], mb: EmitMethodBuilder[_]): Code[Unit] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd expect this to be called setElement. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@johnc1231 johnc1231 dismissed patrick-schultz’s stale review August 4, 2020 21:04

Addressed all comments

@johnc1231
Copy link
Contributor Author

Addressed all comments, should be good to go.

One question I did have is that I have to specify a region size in this aggregator, and I picked TINY. Does it matter? TINY does seem small compared to the kinds of operations that we'll do on dndarray and block matrix

@johnc1231
Copy link
Contributor Author

Added new test that checks summing mix of regular and transposed data to make sure that future changes respect striding.

@patrick-schultz
Copy link
Collaborator

Addressed all comments, should be good to go.

One question I did have is that I have to specify a region size in this aggregator, and I picked TINY. Does it matter? TINY does seem small compared to the kinds of operations that we'll do on dndarray and block matrix

That specifies the granularity of allocations. Each time the region needs more space than it has allocated, it allocates a new block of size determined by the size parameter. Unless it's trying to make a single allocation larger than the block size, in which case it allocates a block of exactly the desired size. The ideal thing here would be to make a single block with exactly the size needed for the state, but the current Region interface doesn't support that. The closest we can get is using the smallest block size, which is TINIER. Then the missing bit and pointer to the data would go in the initial tiny block, and the data would go there too if it's small enough, otherwise it would go in an allocation of exactly the right size.

Comment on lines 27 to 29
def isInitialized(state: State): Code[Boolean] = {
stateType.isFieldDefined(state.off, ndarrayFieldNumber)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you're just using the missing bit, you shouldn't need this or ndArrayPointer. Instead of

cb.ifx(isInitialized(state), {
  val currentNDPValue = PCode(ndTyp, ndArrayPointer(state)).asNDArray.memoize(cb, "ndarray_sum_seqop_current")
  ...
}, {
  // uninitialized case
})

you can simplify to

val statePV = new PCanonicalBaseStructSettable(stateType, state.off)
...
statePV.loadField(ndarrayFieldNumber).consume(cb, {
  // uninitialized case
}, { currentNDPCode =>
  val currentNDPValue = currentNDPCode.asNDArray.memoize(cb, "ndarray_sum_seqop_current")
  ...
})

A bit simpler and more idiomatic.

@johnc1231
Copy link
Contributor Author

Thanks for pushing me to clean that up with the consume infrastructure, much neater / shorter.

However, I can't seem to write a working _result function in that style. I tried several different ways, all led to segfaults, not sure why. I ended up just inlining calls to isFieldDefined and PBaseStruct.loadField for that one, as I didn't want to spend more time digging into it.

@patrick-schultz
Copy link
Collaborator

Weird. Did you do something like this?

val statePV = new PCanonicalBaseStructSettable(stateType, state.off)
statePV.loadField(ndarrayFieldNumber)
       .consume(cb, srvb.setMissing(), srvb.addIRIntermediate(_, deepCopy = true))

Anyways, happy to approve if you don't want to mess with it.

@johnc1231
Copy link
Contributor Author

Yup, I did that, and I also tried a version where I did state.get to get the emit code, then toI.consumeed that, and used that to get the pvalue I loaded the field of, and that also didn't work. I'm going to pick not to mess with it.

@danking danking merged commit 6bd1cee into hail-is:main Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants