Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Changed the rule to split records into columns #77

Closed
wants to merge 1 commit into from

3 participants

@marcofalcioni

According to the specification the columns must be tab separated. I encountered an VCF file from NCBI that has spaces in the INFO column, which caused PyVCF to fail.
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

@marcofalcioni marcofalcioni Changed the rule to split records into columns
According to the specification the columns must be tab separated. I encountered an VCF file from NCBI that has spaces in the INFO column, which caused PyVCF to fail.
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
df5c1e8
@jamescasbon
Owner

Thanks for picking this up!

That should be '\t' and not '\t+', right? It has to be only one tab.

I'm tempted to have a permissive mode as I'm sure I've seen space separated files somewhere. Did you try running the test suite as there might be some in there?

@martijnvermaat
Collaborator

At this point you might just as well make it

row = line.split('\t')

which is probably faster.

I would be interested in seeing an example of a space separated file. I'd say @marcofalcioni's use case is a valid one (especially if he can point us to such a file), although spaces in the INFO column are not allowed by the specification.

@marcofalcioni

Here is an example - ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf/clinvar_00-latest.vcf.gz

@marcofalcioni

Indeed - issue 16 and issue 49 both fail, as they have spaces instead of tabs. I feel unclean.

@jamescasbon
Owner

Added 'strict_whitespace' version to 0.6.1 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Nov 14, 2012
  1. @marcofalcioni

    Changed the rule to split records into columns

    marcofalcioni authored
    According to the specification the columns must be tab separated. I encountered an VCF file from NCBI that has spaces in the INFO column, which caused PyVCF to fail.
    http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
This page is out of date. Refresh to see the latest.
Showing with 1 addition and 1 deletion.
  1. +1 −1  vcf/parser.py
View
2  vcf/parser.py
@@ -437,7 +437,7 @@ def _parse_alt(self, str):
def next(self):
'''Return the next record in the file.'''
line = self.reader.next()
- row = re.split('\t| +', line)
+ row = re.split('\t+', line)
chrom = row[0]
if self._prepend_chr:
chrom = 'chr' + chrom
Something went wrong with that request. Please try again.