Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Multiallelic Variants in main pipeline #586

Open
MattWellie opened this issue Jan 8, 2024 · 0 comments · May be fixed by #602
Open

Split Multiallelic Variants in main pipeline #586

MattWellie opened this issue Jan 8, 2024 · 0 comments · May be fixed by #602
Assignees
Labels
shared The change impacts both the LC and RD pipeline.

Comments

@MattWellie
Copy link
Contributor

MattWellie commented Jan 8, 2024

Our main pipeline calls variants in a multiallelic way, then retains that multiallelic representation through to the end of the workflow, instead of splitting

  • Joint Called VCF is multiallelic
  • VQSR runs on this multiallelic representation and includes splitting variants as part of the standard workflow. When VQSR results are annotated onto the original data, this is done after splitting the callset.
  • VEP runs on the multiallelic representation
  • We read the multiallelic VCF into a MatrixTable, then join VEP annotations using locus as the only key (instead of locus, alt which would be typical). Only then is the data split into atomic alt alleles, and VQSR annotations are added

This is not working great in VEP - we have a recurrent failure to annotate a given fragment of the VCF. This also occurred on the Broad dataset, which prompted an upgrade to VEP 110 (solved the issue for a while, but we're seeing it again now).

This annotation problem goes away if the VCF is split into atomic variant representations. I'm yet to make a bug ticket for Ensembl with the failing VCF(s), but I attempted that previously and they didn't have a solution... just asked us to update our annotation sources, and we're already up to date.

Q?

Is there a reason not to split our joint-calling VCF immediately after calling, and proceed with atomic variants throughout the pipeline?

TODO

  • matt to gather various code sections for where this splitting would occur
  • matt to gather code sections where we currently rely on multiallelic variant representations when joining data from the 3 sources (VCF, VEP, VQSR)
  • discuss with team - is there a benefit to our current process I'm not seeing? Is there a reason not to change?

see https://batch.hail.populationgenomics.org.au/batches/431187/jobs/1

Codey bits

Needs to be a compound (locus, alt) key here

Join here on full key, not just locus

Split here needs to be removed, unsure if locus_old and alleles_old are relevant if splitting doesn't occur...

Looks like VQSR table is already keyed correctly, and annotation is applied post-splitting here. My earlier criticism of VQSR is invalid...

An additional stage should follow JointGenotyping, or a splitting job should be included in the JointGenotyping stage here

@MattWellie MattWellie linked a pull request Jan 19, 2024 that will close this issue
@vivbak vivbak added the shared The change impacts both the LC and RD pipeline. label Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
shared The change impacts both the LC and RD pipeline.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants