Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Add `permit_shuffle` option to `split_multi` #7071

Merged
merged 5 commits into from Sep 25, 2019

Conversation

@tpoterba
Copy link
Collaborator

commented Sep 17, 2019

Makes it possible to proceed past the duplicate-multiallelic-loci error.

Makes it possible to proceed past the duplicate-multiallelic-loci error.
def split_multi(ds, keep_star=False, left_aligned=False):
left_aligned=bool,
permit_shuffle=bool)
def split_multi(ds, keep_star=False, left_aligned=False, *, permit_shuffle=False):

This comment has been minimized.

Copy link
@catoverdrive

catoverdrive Sep 18, 2019

Collaborator

What's the * doing in this case? does it just enforce that permit_shuffle has to be defined as a kwarg?

This comment has been minimized.

Copy link
@tpoterba

tpoterba Sep 18, 2019

Author Collaborator

yep. I think we should do this more going forward, since it makes back-compatibility easier (argument order doesn't mater)

Copy link
Collaborator

left a comment

I'm not entirely sure this addresses the problem, as I understand it---moved is a table containing only alleles for which the minrepped locus is different from the original. I think the case that you want to account for is e.g. a dataset that contains the following rows:

1:2587 A T ....
1:2587 AT TT,A,TTTTT ....

which I don't think would fall into that category? We should add a test for that sort of case with the permit_shuffle flag on. (I'd expect to see that split into something like the following:

1:2587 A T ...
1:2587 A T ...
1:2587 AT A ...
1:2587 A TTTT ...

)

@tpoterba tpoterba dismissed catoverdrive’s stale review Sep 19, 2019

fixed and added test.

def split_multi_hts(ds, keep_star=False, left_aligned=False, vep_root='vep'):
vep_root=str,
permit_shuffle=bool)
def split_multi_hts(ds, keep_star=False, left_aligned=False, vep_root='vep', *, permit_shuffle=False):

This comment has been minimized.

Copy link
@catoverdrive

catoverdrive Sep 20, 2019

Collaborator

need to pass the flag through to the actual split_multi call here

@@ -2013,7 +2018,7 @@ def error_on_moved(v):
.or_error("Found non-left-aligned variant in split_multi"))
return hl.bind(error_on_moved,
hl.min_rep(old_row.locus, [old_row.alleles[0], old_row.alleles[i]]))
return split_rows(hl.sorted(kept_alleles.map(make_struct)), False)
return split_rows(hl.sorted(kept_alleles.map(make_struct)), permit_shuffle)

This comment has been minimized.

Copy link
@catoverdrive

catoverdrive Sep 20, 2019

Collaborator

I'm not too sure it makes sense to allow both permit_shuffle and left_aligned here, since the left_aligned flag is kind of a performance thing to prevent the more expensive shuffle/join, but I also don't have strong feelings about keeping it like this.

def test_split_multi_shuffle(self):
ht = hl.utils.range_table(1)
ht = ht.annotate(keys=[hl.struct(locus=hl.locus('1', 1180), alleles=['A', 'T']),
hl.struct(locus=hl.locus('1', 1180), alleles=['A', 'C', 'G'])])

This comment has been minimized.

Copy link
@catoverdrive

catoverdrive Sep 20, 2019

Collaborator

I don't think this will do the thing you want it to do; the exploded, re-keyed table you're passing in to split_multi will look like

1:1180 A,C,G
1:1180 A,T

which is already sorted. Maybe something like

1:1180 A,C,T
1:1180 A,G

would work?
(and maybe collect the unsplit alleles after rekeying and assert that they're going to split into the out-of-order thing you expect?)

This comment has been minimized.

Copy link
@tpoterba

tpoterba Sep 20, 2019

Author Collaborator

huh, it totally looks like you should be right, but this was failing before I fixed the split multi flag passing. I'll investigate.

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 23, 2019

aha! Looks like this caught a separate optimizer bug related to keying :(

Will fix this test and then fix that bug afterwars.

@tpoterba tpoterba dismissed catoverdrive’s stale review Sep 23, 2019

fixed test

Copy link
Collaborator

left a comment

still need to pass the permit_shuffle flag through to the actual split_multi call in split_multi_hts.

@tpoterba

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 24, 2019

womp

@tpoterba tpoterba dismissed catoverdrive’s stale review Sep 24, 2019

fixed and added test

@danking danking merged commit 4e8fc80 into hail-is:master Sep 25, 2019
1 check passed
1 check passed
ci-test success
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.