Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] add _localize flag to locus_windows to prevent serialization in ld_matrix #5690

Merged
merged 13 commits into from Apr 4, 2019

Conversation

Projects
None yet
4 participants
@catoverdrive
Copy link
Collaborator

commented Mar 25, 2019

leaving unassigned right now, since there's still some behavior change questions to be figured out.

cc @konradjk

Fixes #5407.

catoverdrive and others added some commits Mar 25, 2019

@konradjk

This comment has been minimized.

Copy link
Collaborator

commented Mar 27, 2019

Tested and appears to run on a medium sized dataset (1700 samples), definitely saves a lot of time (no more ~45 minute break between stages while things serialize)

@catoverdrive catoverdrive force-pushed the catoverdrive:locus-windows-function branch from 34a56f5 to 4673249 Mar 27, 2019

@catoverdrive

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 27, 2019

I fixed the part where the behavior of locus_windows was changed, and now the behavior should be consistent with the previous version. (Some of the error types were changed, but I don't really see that as a breaking interface change).

@catoverdrive catoverdrive changed the title add _localize flag to locus_windows to prevent serialization in ld_matrix [feature] add _localize flag to locus_windows to prevent serialization in ld_matrix Mar 27, 2019

@tpoterba

This comment has been minimized.

Copy link
Collaborator

commented Mar 27, 2019

Is this the version that just caused konrad to OOM?

@@ -166,6 +166,16 @@ def locus_windows(locus_expr, radius, coord_expr=None, _localize=True):
if radius < 0:
raise ValueError(f"locus_windows: 'radius' must be non-negative, found {radius}")
check_row_indexed('locus_windows', locus_expr)

# check loci are in sorted order
loci = locus_expr.global_position().collect()

This comment has been minimized.

Copy link
@konradjk

konradjk Mar 27, 2019

Collaborator

Wait, is this going from hail to Python back to hail? Isn't just asserting that it's keyed by locus_expr enough?

This comment has been minimized.

Copy link
@catoverdrive

catoverdrive Mar 27, 2019

Author Collaborator

ah, crap. That should be a hl.agg.collect, probably. It's not quite asserting that it's keyed by locus_expr--as long as it's not out-of-order. Probably the same in the vast majority of cases.

This comment has been minimized.

Copy link
@konradjk

konradjk Mar 27, 2019

Collaborator

Yeah, I don't think I mind that constraint, but it's up to you. Alternatively, there's no KeyBy(die_if_unsorted) is there?

This comment has been minimized.

Copy link
@catoverdrive

catoverdrive Mar 27, 2019

Author Collaborator

No, there isn't :(. I'd prefer to keep it this way so there's no worry about accidentally suddenly breaking someone's pipeline.

This comment has been minimized.

Copy link
@catoverdrive

catoverdrive Mar 27, 2019

Author Collaborator

I folded it into one aggregation pass, though, so hopefully not too much overhead.

@konradjk

This comment has been minimized.

Copy link
Collaborator

commented Mar 27, 2019

@tpoterba with NegativeArraySize you mean? No, that was different (well, I used aspects of this PR, but I don't think that's what made the difference)

catoverdrive added some commits Mar 27, 2019

fix
.when(last_pos >= 0, contig_group_expr)
.or_error("locus_windows: 'locus_expr' has length 0"))

src = locus_expr._indices.source

This comment has been minimized.

Copy link
@tpoterba

tpoterba Mar 28, 2019

Collaborator

this bit and below can be:

checked_contig_groups._aggregation_method(checked_contig_groups, _localize=False)
.or_error("locus_windows: 'locus_expr' global position must be in ascending order.")),
-1,
hl.agg.collect(hl.case()
.when(hl.is_defined(locus_expr), locus_expr.global_position())

This comment has been minimized.

Copy link
@tpoterba

tpoterba Mar 28, 2019

Collaborator

needs to be in a bind

This comment has been minimized.

Copy link
@tpoterba

tpoterba Mar 28, 2019

Collaborator

around the whole thing probably

This comment has been minimized.

Copy link
@catoverdrive

catoverdrive Mar 28, 2019

Author Collaborator

locus_expr? This is hard/impossible to do with the current aggregation interface, right? I think best I can do is to annotate it as a field if it isn't already?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 2, 2019

Collaborator

sorry, I mean around the hl.case(). locus_expr could be extremely expensive to compute, we don't want to do it twice.

This comment has been minimized.

Copy link
@catoverdrive

catoverdrive Apr 2, 2019

Author Collaborator

We'd be computing it again in contig_group_expr, which is hard to wrap. I pulled it out above as a field so we'd only compute it once, but I can just do a bind around the case expression if we're ok with computing it twice and don't want that much code there

addressed comments

@tpoterba
Copy link
Collaborator

left a comment

one last change

.or_error("locus_windows: 'locus_expr' global position must be in ascending order.")),
-1,
hl.agg.collect(hl.case()
.when(hl.is_defined(locus_expr), locus_expr.global_position())

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 2, 2019

Collaborator

sorry, I mean around the hl.case(). locus_expr could be extremely expensive to compute, we don't want to do it twice.

fix

addressed

@catoverdrive

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 3, 2019

Also, this fixes #5407.

coord_expr = locus_expr.position

rg = locus_expr.dtype.reference_genome
contig_group_expr = hl.agg.group_by(hl.locus(locus_expr.contig, 1, reference_genome=rg), hl.agg.collect(coord_expr))

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 4, 2019

Collaborator

sorry to keep going back and forth on this PR, but this stuff is a bit complicated.

This is a locus instead of a contig because we want it to sort in contig order below, right? At some point we can add a hl.contig_index(contig, rg) function which would let us just take the contig, but don't need to do that here.

This comment has been minimized.

Copy link
@catoverdrive

catoverdrive Apr 4, 2019

Author Collaborator

yeah, for sure! That would be a good addition. (I considered doing a bit of a hack to do essentially that here, but decided against it. Adding a function on the reference genome would be a good way to do it.)

Also you meant to approve, yeah? Because it just merged.

This comment has been minimized.

Copy link
@tpoterba

tpoterba Apr 4, 2019

Collaborator

yeah, this was a non-blocking discussion point.

@danking danking merged commit 7307820 into hail-is:master Apr 4, 2019

1 check passed

hail-ci-0-1 successful build
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.