Split Multiallelic Variants in main pipeline #586

MattWellie · 2024-01-08T23:46:07Z

Our main pipeline calls variants in a multiallelic way, then retains that multiallelic representation through to the end of the workflow, instead of splitting

Joint Called VCF is multiallelic
VQSR runs on this multiallelic representation and includes splitting variants as part of the standard workflow. When VQSR results are annotated onto the original data, this is done after splitting the callset.
VEP runs on the multiallelic representation
We read the multiallelic VCF into a MatrixTable, then join VEP annotations using locus as the only key (instead of locus, alt which would be typical). Only then is the data split into atomic alt alleles, and VQSR annotations are added

This is not working great in VEP - we have a recurrent failure to annotate a given fragment of the VCF. This also occurred on the Broad dataset, which prompted an upgrade to VEP 110 (solved the issue for a while, but we're seeing it again now).

This annotation problem goes away if the VCF is split into atomic variant representations. I'm yet to make a bug ticket for Ensembl with the failing VCF(s), but I attempted that previously and they didn't have a solution... just asked us to update our annotation sources, and we're already up to date.

Q?

Is there a reason not to split our joint-calling VCF immediately after calling, and proceed with atomic variants throughout the pipeline?

TODO

~~matt to gather various code sections for where this splitting would occur~~
~~matt to gather code sections where we currently rely on multiallelic variant representations when joining data from the 3 sources (VCF, VEP, VQSR)~~
discuss with team - is there a benefit to our current process I'm not seeing? Is there a reason not to change?

see https://batch.hail.populationgenomics.org.au/batches/431187/jobs/1

Codey bits

Needs to be a compound (locus, alt) key here

Join here on full key, not just locus

Split here needs to be removed, unsure if locus_old and alleles_old are relevant if splitting doesn't occur...

Looks like VQSR table is already keyed correctly, and annotation is applied post-splitting here. My earlier criticism of VQSR is invalid...

An additional stage should follow JointGenotyping, or a splitting job should be included in the JointGenotyping stage here

MattWellie assigned jmarshall, MattWellie, EddieLF and vivbak Jan 8, 2024

MattWellie linked a pull request Jan 19, 2024 that will close this issue

586 Multiallelic splitting #602

Open

vivbak added the shared The change impacts both the LC and RD pipeline. label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split Multiallelic Variants in main pipeline #586

Split Multiallelic Variants in main pipeline #586

MattWellie commented Jan 8, 2024 •

edited

Loading

Split Multiallelic Variants in main pipeline #586

Split Multiallelic Variants in main pipeline #586

Comments

MattWellie commented Jan 8, 2024 • edited Loading

Q?

TODO

Codey bits

MattWellie commented Jan 8, 2024 •

edited

Loading