Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gCNV bugfix - make variant IDs unique #589

Merged
merged 7 commits into from
Jan 11, 2024
Merged

Conversation

MattWellie
Copy link
Contributor

@MattWellie MattWellie commented Jan 10, 2024

Test batch: https://batch.hail.populationgenomics.org.au/batches/431755

Weird unforeseen issue here - When passing the VCF through SVAnnotate in GATK all the DEL variants were being removed. It seems to be that GATK-SV includes some weird semi-manual VCF parsing/translation to BED which retains only unique variant IDs...

  • gCNV calls all variants in the VCF as CHR POS ID DEL,DUP ..., i.e. each span is called as DUP and DEL on the same row, sharing a single ID
  • This was previously Split, as GATK-SV's AnnotateVcf can't handle multiallelics
  • Upon splitting each row was broken into a DEL and a DUP row, with the corresponding genotypes, but the same ID
  • Passing this through AnnotateVcf created annotation, but only DUPs were retained
  • Working assumption is that buried in the deep dark heart of GATK-SV there's a plaintext manipulation of the variants, made unique on ID, so we only retain the variant type that comes second in the VCF for each span

This change introduces an additional amendment in the VCF; the ID is edited to be ID_ALT - these values become unique in the whole dataset, and annotation now completes with both variant types retained

@MattWellie MattWellie merged commit e29f660 into main Jan 11, 2024
3 checks passed
@MattWellie MattWellie deleted the gcnv_bugfix_unique_id branch January 11, 2024 00:39
@jmarshall
Copy link
Contributor

Not new in this PR, but

 headers.extend([
                        '##INFO=<ID=SVTYPE,Number=1,Type=String,Description="SV Type">\n',
                        '##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="SV Length">\n']

SVLEN should be Number=. or Number=A, which could make a difference to downstream picard/gatk validation.

@MattWellie
Copy link
Contributor Author

SVLEN should be Number=. or Number=A, which could make a difference to downstream picard/gatk validation.

cheers, I've prepped that into the next PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants