Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vcf] Replace write_VCF_translation with treetime's write_vcf #1356

Open
jameshadfield opened this issue Dec 12, 2023 · 3 comments
Open

[vcf] Replace write_VCF_translation with treetime's write_vcf #1356

jameshadfield opened this issue Dec 12, 2023 · 3 comments

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Dec 12, 2023

augur.io.vcf includes write_VCF_translation which is used by augur translate. This functionality is better implemented (and now tested) in treetime so we should use that function instead.

@jameshadfield jameshadfield self-assigned this Dec 13, 2023
@jameshadfield jameshadfield changed the title [vcf] use treetime read/write functions [vcf] use treetime write_vcf function Dec 13, 2023
@jameshadfield
Copy link
Member Author

jameshadfield commented Dec 13, 2023

A few observations after looking into this today:

Minor errors in VCF genotype syntax
The write_VCF_translation produces VCFs which look like this

##fileformat=VCFv4.2
##source=NextStrain_Protein_Translation
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	node_root	node_AB	sample_A	sample_B	sample_C
gene1	2	.	P	L	.	PASS	.	GT	.	1/1	1/1	1/1	.
gene2	2	.	V	D	.	PASS	.	GT	.	1/1	1/1	1/1	.

The genotype calls should be haploid (i.e. 1 not 1/1) and the intention of genotype . here is "no change from REF" but . actually means "a call cannot be made at the given locus".

VCF files can't represent amino acid diversity

The VCF spec doesn't seem to say that it's for nucleotide sequences only, but the 4.3 spec defines genotype ALTs as matching ^([ACGTNacgtn]+|\*|\.)$ so the intention is certainly nuc-only. I don't see why we couldn't make a VCF-like file for amino acids, and it'd mostly work, but I don't know how we'd represent stop codons because ALT=* is reserved for "allele missing due to overlapping deletion".

write_vcf can't write these data currently

It only writes one chromosome, and here we use different chromosomes for different genes. It wouldn't be hard to refactor it to accept multi-chromsome inputs. It's currently only used in 1 place in Augur and 1 place in TreeTime. Alternatively we could keep things so that write_vcf writes single chromosome VCFs and add a write_multi_chrom_vcf. If we do this we should probably extend read_vcf to read in multiple chromosomes.

But before embarking on this we need to decide whether we should be producing AA-VCFs. I know augur sequence-traits uses them.

@emmahodcroft
Copy link
Member

VCF files can't represent amino acid diversity

I may be completely and totally misremembering but I have some kind of vague memory of writing code that did output AAs in VCF format. This would very likely not at all have been official, but was my effort to produce a large number of gene translations in some kind of compressed format.
As I remember, I also used genes in place of chromosomes as you reference in the 3rd part above.

If that's gone from the code now though, I can't really when where or when it disappeared...

@jameshadfield
Copy link
Member Author

I may be completely and totally misremembering but I have some kind of vague memory of writing code that did output AAs in VCF format. This would very likely not at all have been official, but was my effort to produce a large number of gene translations in some kind of compressed format. As I remember, I also used genes in place of chromosomes as you reference in the 3rd part above.

Yup! That's exactly the behaviour of write_VCF_translation (now located within augur.io.vcf). And I think it's ok to use VCF for this, but because it's not part of the spec we can't use TreeTime's write_VCF and so we have to maintain parallel VCF-serialisation code, and parallel VCF-parsing code for the same reasons. The AA-VCF file is also only used by augur sequence-traits as far as I know, so this is a minor concern.

@jameshadfield jameshadfield changed the title [vcf] use treetime write_vcf function [vcf] Replace write_VCF_translation with treetime's write_vcf Jan 25, 2024
@jameshadfield jameshadfield removed their assignment Jan 25, 2024
@jameshadfield jameshadfield removed the bug label Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants