[feature] add _localize flag to locus_windows to prevent serialization in ld_matrix#5690
Conversation
|
Tested and appears to run on a medium sized dataset (1700 samples), definitely saves a lot of time (no more ~45 minute break between stages while things serialize) |
34a56f5 to
4673249
Compare
|
I fixed the part where the behavior of locus_windows was changed, and now the behavior should be consistent with the previous version. (Some of the error types were changed, but I don't really see that as a breaking interface change). |
|
Is this the version that just caused konrad to OOM? |
| check_row_indexed('locus_windows', locus_expr) | ||
|
|
||
| # check loci are in sorted order | ||
| loci = locus_expr.global_position().collect() |
There was a problem hiding this comment.
Wait, is this going from hail to Python back to hail? Isn't just asserting that it's keyed by locus_expr enough?
There was a problem hiding this comment.
ah, crap. That should be a hl.agg.collect, probably. It's not quite asserting that it's keyed by locus_expr--as long as it's not out-of-order. Probably the same in the vast majority of cases.
There was a problem hiding this comment.
Yeah, I don't think I mind that constraint, but it's up to you. Alternatively, there's no KeyBy(die_if_unsorted) is there?
There was a problem hiding this comment.
No, there isn't :(. I'd prefer to keep it this way so there's no worry about accidentally suddenly breaking someone's pipeline.
There was a problem hiding this comment.
I folded it into one aggregation pass, though, so hopefully not too much overhead.
|
@tpoterba with NegativeArraySize you mean? No, that was different (well, I used aspects of this PR, but I don't think that's what made the difference) |
| .when(last_pos >= 0, contig_group_expr) | ||
| .or_error("locus_windows: 'locus_expr' has length 0")) | ||
|
|
||
| src = locus_expr._indices.source |
There was a problem hiding this comment.
this bit and below can be:
checked_contig_groups._aggregation_method(checked_contig_groups, _localize=False)
| .or_error("locus_windows: 'locus_expr' global position must be in ascending order.")), | ||
| -1, | ||
| hl.agg.collect(hl.case() | ||
| .when(hl.is_defined(locus_expr), locus_expr.global_position()) |
There was a problem hiding this comment.
around the whole thing probably
There was a problem hiding this comment.
locus_expr? This is hard/impossible to do with the current aggregation interface, right? I think best I can do is to annotate it as a field if it isn't already?
There was a problem hiding this comment.
sorry, I mean around the hl.case(). locus_expr could be extremely expensive to compute, we don't want to do it twice.
There was a problem hiding this comment.
We'd be computing it again in contig_group_expr, which is hard to wrap. I pulled it out above as a field so we'd only compute it once, but I can just do a bind around the case expression if we're ok with computing it twice and don't want that much code there
| .or_error("locus_windows: 'locus_expr' global position must be in ascending order.")), | ||
| -1, | ||
| hl.agg.collect(hl.case() | ||
| .when(hl.is_defined(locus_expr), locus_expr.global_position()) |
There was a problem hiding this comment.
sorry, I mean around the hl.case(). locus_expr could be extremely expensive to compute, we don't want to do it twice.
|
Also, this fixes #5407. |
| coord_expr = locus_expr.position | ||
|
|
||
| rg = locus_expr.dtype.reference_genome | ||
| contig_group_expr = hl.agg.group_by(hl.locus(locus_expr.contig, 1, reference_genome=rg), hl.agg.collect(coord_expr)) |
There was a problem hiding this comment.
sorry to keep going back and forth on this PR, but this stuff is a bit complicated.
This is a locus instead of a contig because we want it to sort in contig order below, right? At some point we can add a hl.contig_index(contig, rg) function which would let us just take the contig, but don't need to do that here.
There was a problem hiding this comment.
yeah, for sure! That would be a good addition. (I considered doing a bit of a hack to do essentially that here, but decided against it. Adding a function on the reference genome would be a good way to do it.)
Also you meant to approve, yeah? Because it just merged.
There was a problem hiding this comment.
yeah, this was a non-blocking discussion point.
leaving unassigned right now, since there's still some behavior change questions to be figured out.
cc @konradjk
Fixes #5407.