Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hail] Add permit_shuffle option to split_multi #7071

Merged
merged 5 commits into from Sep 25, 2019

Conversation

@tpoterba
Copy link
Collaborator

@tpoterba tpoterba commented Sep 17, 2019

Makes it possible to proceed past the duplicate-multiallelic-loci error.

Makes it possible to proceed past the duplicate-multiallelic-loci error.
def split_multi(ds, keep_star=False, left_aligned=False):
left_aligned=bool,
permit_shuffle=bool)
def split_multi(ds, keep_star=False, left_aligned=False, *, permit_shuffle=False):
Copy link
Contributor

@catoverdrive catoverdrive Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the * doing in this case? does it just enforce that permit_shuffle has to be defined as a kwarg?

Copy link
Collaborator Author

@tpoterba tpoterba Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. I think we should do this more going forward, since it makes back-compatibility easier (argument order doesn't mater)

Copy link
Contributor

@catoverdrive catoverdrive left a comment

I'm not entirely sure this addresses the problem, as I understand it---moved is a table containing only alleles for which the minrepped locus is different from the original. I think the case that you want to account for is e.g. a dataset that contains the following rows:

1:2587 A T ....
1:2587 AT TT,A,TTTTT ....

which I don't think would fall into that category? We should add a test for that sort of case with the permit_shuffle flag on. (I'd expect to see that split into something like the following:

1:2587 A T ...
1:2587 A T ...
1:2587 AT A ...
1:2587 A TTTT ...

)

@tpoterba tpoterba dismissed catoverdrive’s stale review Sep 19, 2019

fixed and added test.

def split_multi_hts(ds, keep_star=False, left_aligned=False, vep_root='vep'):
vep_root=str,
permit_shuffle=bool)
def split_multi_hts(ds, keep_star=False, left_aligned=False, vep_root='vep', *, permit_shuffle=False):
Copy link
Contributor

@catoverdrive catoverdrive Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to pass the flag through to the actual split_multi call here

@@ -2013,7 +2018,7 @@ def error_on_moved(v):
.or_error("Found non-left-aligned variant in split_multi"))
return hl.bind(error_on_moved,
hl.min_rep(old_row.locus, [old_row.alleles[0], old_row.alleles[i]]))
return split_rows(hl.sorted(kept_alleles.map(make_struct)), False)
return split_rows(hl.sorted(kept_alleles.map(make_struct)), permit_shuffle)
Copy link
Contributor

@catoverdrive catoverdrive Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure it makes sense to allow both permit_shuffle and left_aligned here, since the left_aligned flag is kind of a performance thing to prevent the more expensive shuffle/join, but I also don't have strong feelings about keeping it like this.

def test_split_multi_shuffle(self):
ht = hl.utils.range_table(1)
ht = ht.annotate(keys=[hl.struct(locus=hl.locus('1', 1180), alleles=['A', 'T']),
hl.struct(locus=hl.locus('1', 1180), alleles=['A', 'C', 'G'])])
Copy link
Contributor

@catoverdrive catoverdrive Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will do the thing you want it to do; the exploded, re-keyed table you're passing in to split_multi will look like

1:1180 A,C,G
1:1180 A,T

which is already sorted. Maybe something like

1:1180 A,C,T
1:1180 A,G

would work?
(and maybe collect the unsplit alleles after rekeying and assert that they're going to split into the out-of-order thing you expect?)

Copy link
Collaborator Author

@tpoterba tpoterba Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, it totally looks like you should be right, but this was failing before I fixed the split multi flag passing. I'll investigate.

@tpoterba
Copy link
Collaborator Author

@tpoterba tpoterba commented Sep 23, 2019

aha! Looks like this caught a separate optimizer bug related to keying :(

Will fix this test and then fix that bug afterwars.

Copy link
Contributor

@catoverdrive catoverdrive left a comment

still need to pass the permit_shuffle flag through to the actual split_multi call in split_multi_hts.

@tpoterba
Copy link
Collaborator Author

@tpoterba tpoterba commented Sep 24, 2019

womp

@tpoterba tpoterba dismissed catoverdrive’s stale review Sep 24, 2019

fixed and added test

@danking danking merged commit 4e8fc80 into hail-is:master Sep 25, 2019
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants