-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] add _localize flag to locus_windows to prevent serialization in ld_matrix #5690
Conversation
Tested and appears to run on a medium sized dataset (1700 samples), definitely saves a lot of time (no more ~45 minute break between stages while things serialize) |
34a56f5
to
4673249
Compare
I fixed the part where the behavior of locus_windows was changed, and now the behavior should be consistent with the previous version. (Some of the error types were changed, but I don't really see that as a breaking interface change). |
Is this the version that just caused konrad to OOM? |
@@ -166,6 +166,16 @@ def locus_windows(locus_expr, radius, coord_expr=None, _localize=True): | |||
if radius < 0: | |||
raise ValueError(f"locus_windows: 'radius' must be non-negative, found {radius}") | |||
check_row_indexed('locus_windows', locus_expr) | |||
|
|||
# check loci are in sorted order | |||
loci = locus_expr.global_position().collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, is this going from hail to Python back to hail? Isn't just asserting that it's keyed by locus_expr enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, crap. That should be a hl.agg.collect, probably. It's not quite asserting that it's keyed by locus_expr--as long as it's not out-of-order. Probably the same in the vast majority of cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think I mind that constraint, but it's up to you. Alternatively, there's no KeyBy(die_if_unsorted)
is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there isn't :(. I'd prefer to keep it this way so there's no worry about accidentally suddenly breaking someone's pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I folded it into one aggregation pass, though, so hopefully not too much overhead.
@tpoterba with NegativeArraySize you mean? No, that was different (well, I used aspects of this PR, but I don't think that's what made the difference) |
.when(last_pos >= 0, contig_group_expr) | ||
.or_error("locus_windows: 'locus_expr' has length 0")) | ||
|
||
src = locus_expr._indices.source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this bit and below can be:
checked_contig_groups._aggregation_method(checked_contig_groups, _localize=False)
.or_error("locus_windows: 'locus_expr' global position must be in ascending order.")), | ||
-1, | ||
hl.agg.collect(hl.case() | ||
.when(hl.is_defined(locus_expr), locus_expr.global_position()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be in a bind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
around the whole thing probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
locus_expr
? This is hard/impossible to do with the current aggregation interface, right? I think best I can do is to annotate it as a field if it isn't already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I mean around the hl.case()
. locus_expr
could be extremely expensive to compute, we don't want to do it twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd be computing it again in contig_group_expr, which is hard to wrap. I pulled it out above as a field so we'd only compute it once, but I can just do a bind around the case expression if we're ok with computing it twice and don't want that much code there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one last change
.or_error("locus_windows: 'locus_expr' global position must be in ascending order.")), | ||
-1, | ||
hl.agg.collect(hl.case() | ||
.when(hl.is_defined(locus_expr), locus_expr.global_position()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I mean around the hl.case()
. locus_expr
could be extremely expensive to compute, we don't want to do it twice.
Also, this fixes #5407. |
coord_expr = locus_expr.position | ||
|
||
rg = locus_expr.dtype.reference_genome | ||
contig_group_expr = hl.agg.group_by(hl.locus(locus_expr.contig, 1, reference_genome=rg), hl.agg.collect(coord_expr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to keep going back and forth on this PR, but this stuff is a bit complicated.
This is a locus instead of a contig because we want it to sort in contig order below, right? At some point we can add a hl.contig_index(contig, rg)
function which would let us just take the contig, but don't need to do that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, for sure! That would be a good addition. (I considered doing a bit of a hack to do essentially that here, but decided against it. Adding a function on the reference genome would be a good way to do it.)
Also you meant to approve, yeah? Because it just merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this was a non-blocking discussion point.
leaving unassigned right now, since there's still some behavior change questions to be figured out.
cc @konradjk
Fixes #5407.