Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] add _localize flag to locus_windows to prevent serialization in ld_matrix #5690

Merged
merged 13 commits into from
Apr 4, 2019

Conversation

catoverdrive
Copy link
Contributor

@catoverdrive catoverdrive commented Mar 25, 2019

leaving unassigned right now, since there's still some behavior change questions to be figured out.

cc @konradjk

Fixes #5407.

@konradjk
Copy link
Collaborator

Tested and appears to run on a medium sized dataset (1700 samples), definitely saves a lot of time (no more ~45 minute break between stages while things serialize)

@catoverdrive
Copy link
Contributor Author

I fixed the part where the behavior of locus_windows was changed, and now the behavior should be consistent with the previous version. (Some of the error types were changed, but I don't really see that as a breaking interface change).

@catoverdrive catoverdrive changed the title add _localize flag to locus_windows to prevent serialization in ld_matrix [feature] add _localize flag to locus_windows to prevent serialization in ld_matrix Mar 27, 2019
@tpoterba
Copy link
Contributor

Is this the version that just caused konrad to OOM?

@@ -166,6 +166,16 @@ def locus_windows(locus_expr, radius, coord_expr=None, _localize=True):
if radius < 0:
raise ValueError(f"locus_windows: 'radius' must be non-negative, found {radius}")
check_row_indexed('locus_windows', locus_expr)

# check loci are in sorted order
loci = locus_expr.global_position().collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is this going from hail to Python back to hail? Isn't just asserting that it's keyed by locus_expr enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, crap. That should be a hl.agg.collect, probably. It's not quite asserting that it's keyed by locus_expr--as long as it's not out-of-order. Probably the same in the vast majority of cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think I mind that constraint, but it's up to you. Alternatively, there's no KeyBy(die_if_unsorted) is there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there isn't :(. I'd prefer to keep it this way so there's no worry about accidentally suddenly breaking someone's pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I folded it into one aggregation pass, though, so hopefully not too much overhead.

@konradjk
Copy link
Collaborator

@tpoterba with NegativeArraySize you mean? No, that was different (well, I used aspects of this PR, but I don't think that's what made the difference)

.when(last_pos >= 0, contig_group_expr)
.or_error("locus_windows: 'locus_expr' has length 0"))

src = locus_expr._indices.source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this bit and below can be:

checked_contig_groups._aggregation_method(checked_contig_groups, _localize=False)

.or_error("locus_windows: 'locus_expr' global position must be in ascending order.")),
-1,
hl.agg.collect(hl.case()
.when(hl.is_defined(locus_expr), locus_expr.global_position())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be in a bind

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

around the whole thing probably

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locus_expr? This is hard/impossible to do with the current aggregation interface, right? I think best I can do is to annotate it as a field if it isn't already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I mean around the hl.case(). locus_expr could be extremely expensive to compute, we don't want to do it twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd be computing it again in contig_group_expr, which is hard to wrap. I pulled it out above as a field so we'd only compute it once, but I can just do a bind around the case expression if we're ok with computing it twice and don't want that much code there

@catoverdrive catoverdrive dismissed tpoterba’s stale review April 1, 2019 15:50

addressed comments

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last change

.or_error("locus_windows: 'locus_expr' global position must be in ascending order.")),
-1,
hl.agg.collect(hl.case()
.when(hl.is_defined(locus_expr), locus_expr.global_position())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I mean around the hl.case(). locus_expr could be extremely expensive to compute, we don't want to do it twice.

@catoverdrive
Copy link
Contributor Author

Also, this fixes #5407.

coord_expr = locus_expr.position

rg = locus_expr.dtype.reference_genome
contig_group_expr = hl.agg.group_by(hl.locus(locus_expr.contig, 1, reference_genome=rg), hl.agg.collect(coord_expr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to keep going back and forth on this PR, but this stuff is a bit complicated.

This is a locus instead of a contig because we want it to sort in contig order below, right? At some point we can add a hl.contig_index(contig, rg) function which would let us just take the contig, but don't need to do that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, for sure! That would be a good addition. (I considered doing a bit of a hack to do essentially that here, but decided against it. Adding a function on the reference genome would be a good way to do it.)

Also you meant to approve, yeah? Because it just merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this was a non-blocking discussion point.

@danking danking merged commit 7307820 into hail-is:master Apr 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants