Skip to content

[query] aggregation of split_multi collapses to one partition #13407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
patrick-schultz opened this issue Aug 10, 2023 · 0 comments · Fixed by #13414
Closed

[query] aggregation of split_multi collapses to one partition #13407

patrick-schultz opened this issue Aug 10, 2023 · 0 comments · Fixed by #13414
Labels

Comments

@patrick-schultz
Copy link
Collaborator

For example,

mt = hl.split_multi_hts(mt, permit_shuffle=True)
mt = mt.annotate_cols(foo=hl.agg.sum(...))

performs the aggregation in one giant partition.

This is a known issue, and we are working on it. A workaround is to write the result of split_multi to disk before using it (which is a good idea regardless).

The cause appears to the interaction of several things:

  • split_multi computes two separate pieces, then merges them together with a TableUnion
  • The aggregation doesn't care how the mt is keyed, and we propagate that upstream
  • Because of that, the TableUnion ends up unioning two unkeyed tables
  • TableUnion, as currently implemented, can only produce a result with a strict partitioner. In the case of no key fields, that means one partition.
danking pushed a commit that referenced this issue Aug 16, 2023
fixes #13407

CHANGELOG: Resolves #13407 in which uses of `union_rows` could reduce
parallelism to one partition resulting in severely degraded performance.

TableUnion was always collapsing to a single partition when the key was
empty. This adds a special case handling, which just concatenates
partitions.

The body of the resulting TableStage is a little hacky: it does a
StreamMultiMerge, but where exactly one input stream is non-empty. I
think that should have fine performance, and I didn’t see any simpler
ways to do it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant