You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our main pipeline calls variants in a multiallelic way, then retains that multiallelic representation through to the end of the workflow, instead of splitting
Joint Called VCF is multiallelic
VQSR runs on this multiallelic representation and includes splitting variants as part of the standard workflow. When VQSR results are annotated onto the original data, this is done after splitting the callset.
VEP runs on the multiallelic representation
We read the multiallelic VCF into a MatrixTable, then join VEP annotations using locus as the only key (instead of locus, alt which would be typical). Only then is the data split into atomic alt alleles, and VQSR annotations are added
This is not working great in VEP - we have a recurrent failure to annotate a given fragment of the VCF. This also occurred on the Broad dataset, which prompted an upgrade to VEP 110 (solved the issue for a while, but we're seeing it again now).
This annotation problem goes away if the VCF is split into atomic variant representations. I'm yet to make a bug ticket for Ensembl with the failing VCF(s), but I attempted that previously and they didn't have a solution... just asked us to update our annotation sources, and we're already up to date.
Q?
Is there a reason not to split our joint-calling VCF immediately after calling, and proceed with atomic variants throughout the pipeline?
TODO
matt to gather various code sections for where this splitting would occur
matt to gather code sections where we currently rely on multiallelic variant representations when joining data from the 3 sources (VCF, VEP, VQSR)
discuss with team - is there a benefit to our current process I'm not seeing? Is there a reason not to change?
Our main pipeline calls variants in a multiallelic way, then retains that multiallelic representation through to the end of the workflow, instead of splitting
This is not working great in VEP - we have a recurrent failure to annotate a given fragment of the VCF. This also occurred on the Broad dataset, which prompted an upgrade to VEP 110 (solved the issue for a while, but we're seeing it again now).
This annotation problem goes away if the VCF is split into atomic variant representations. I'm yet to make a bug ticket for Ensembl with the failing VCF(s), but I attempted that previously and they didn't have a solution... just asked us to update our annotation sources, and we're already up to date.
Q?
Is there a reason not to split our joint-calling VCF immediately after calling, and proceed with atomic variants throughout the pipeline?
TODO
matt to gather various code sections for where this splitting would occurmatt to gather code sections where we currently rely on multiallelic variant representations when joining data from the 3 sources (VCF, VEP, VQSR)see https://batch.hail.populationgenomics.org.au/batches/431187/jobs/1
Codey bits
Needs to be a compound (locus, alt) key here
Join here on full key, not just locus
Split here needs to be removed, unsure if locus_old and alleles_old are relevant if splitting doesn't occur...
Looks like VQSR table is already keyed correctly, and annotation is applied post-splitting here. My earlier criticism of VQSR is invalid...
An additional stage should follow JointGenotyping, or a splitting job should be included in the JointGenotyping stage here
The text was updated successfully, but these errors were encountered: