VCF files without FORMAT/genotype fields #95

Closed
freeseek opened this Issue Feb 25, 2013 · 3 comments

Projects

None yet

3 participants

@freeseek

It is allowed to have VCF files without the FORMAT field and without genotype fields. However, there is no way to generate files like this. There are two issues:

  1. The function vcf.Writer(stream, template) writes the FORMAT field regardless of this being present in the template or not. (minor problem)
  2. The write_record(record) function will append a "\t" at the end of the line even if the FORMAT field is empty. (major problem)
    For now, I have used the trick of passing the output to "cut -f1-8" to remove the undesired empty FORMAT field, so that I can use the output with GATK. But an option to produce VCF files without genotype records would be greatly appreciated.
    An example of such VCF files can be found here:
  3. ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase1/analysis_results/integrated_call_sets/ALL.wgs.integrated_phase1_v3.20101123.snps_indels_sv.sites.vcf.gz
  4. http://evs.gs.washington.edu/evs_bulk_data/ESP6500SI.snps_indels.vcf.tar.gz
@martijnvermaat
Collaborator

Actually, I find that you cannot even directly write records (read from your example files) because of the samples attribute being None. This causes an exception in the writer.

Your points are also valid of course.

@freeseek

I almost forgot about that. I used "record.samples=[]" to get around that. But, of course, the FORMAT field is written nevertheless even if it is empty (which means a trailing "\t" is present at the end of each line after the INFO field which confuses the GATK).

@martijnvermaat
Collaborator

I see two ways of dealing with files where no samples are defined:

  1. Set things like samples attribute to the empty list.
  2. If FORMAT is defined do as 1, otherwise set these attributes to None.

First I was inclined to prefer the second option, as without a FORMAT column, there are no samples by definition. Hence differentiating between this case and the case where there is a FORMAT column but just no samples are defined (None versus empty list).

However, this will make for a mess in _Record and any code that deals with samples (you'd always have to check for None before iterating). So I implemented the first approach (#97).

Here I assume we want to treat the samples attribute on the Reader and _Record likewise, otherwise more confusion will arise.

@jamescasbon jamescasbon closed this Mar 4, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment