Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] new aggregator path for TableKeyByAndAggregate #7194

Merged
merged 6 commits into from
Oct 8, 2019

Conversation

catoverdrive
Copy link
Contributor

This pretty closely mimics the existing execution pattern.

I made a change to allow BufferedArrayIterator to store its internal buffer values in non-serializable form---I wanted to store the intermediate aggregator states as region values for as long as possible, and only serialize them when we have to return something for the iterator we're passing to Spark's aggregateByKey function.

I'm also allocating one aggRegion per key---that way, when we send an aggregator state into the Spark aggregation, we can clean up the region that we were using for aggregation. I'm managing this region manually, though--not sure how else to do it.

@catoverdrive
Copy link
Contributor Author

catoverdrive commented Oct 3, 2019

added a new benchmark:

wm2b0-b9b:hail wang$ hail-bench compare old.json new.json --metric median
               Name      Ratio    Time 1    Time 2
               ----      -----    ------    ------
group_by_take_rekey      23.5%    23.902     5.612
----------------------
Geometric mean: 23.5%
Simple mean: 23.5%
Median:  23.5%



@benchmark
def group_by_take_rekey():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add matrix table group_rows_by benchmarks that are the same as these two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah---do you want me to do that here? I think they should lower into the same thing, right? so that would mostly serve to check that we're not messing up the lowering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to do it here, but it shouldn't be hard. The matrix one will be an array agg on the entries, rather than the whole row.

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome.

rvd = RVD.coerce(RVDType(newRowType, keyType.fieldNames), crdd))
} catch {
case e: agg.UnsupportedExtraction =>
log.info(s"couldn't lower TableAggregate: $e")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just need to fix the copy/paste error here and we can merge. Can put in benchmarks separately.

@danking danking merged commit 53229eb into hail-is:master Oct 8, 2019
jigold pushed a commit to jigold/hail that referenced this pull request Oct 9, 2019
* [hail] new aggregator path for TableKeyByAndAggregate

* allow one agg region per key

* fix

* fix test

* add benchmark to exercise TableKeyByAndAggregate

* fix message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants