Changed the rule to split records into columns #77

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
3 participants

According to the specification the columns must be tab separated. I encountered an VCF file from NCBI that has spaces in the INFO column, which caused PyVCF to fail.
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Changed the rule to split records into columns
According to the specification the columns must be tab separated. I encountered an VCF file from NCBI that has spaces in the INFO column, which caused PyVCF to fail.
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
Owner

jamescasbon commented Nov 15, 2012

Thanks for picking this up!

That should be '\t' and not '\t+', right? It has to be only one tab.

I'm tempted to have a permissive mode as I'm sure I've seen space separated files somewhere. Did you try running the test suite as there might be some in there?

Collaborator

martijnvermaat commented Nov 15, 2012

At this point you might just as well make it

row = line.split('\t')

which is probably faster.

I would be interested in seeing an example of a space separated file. I'd say @marcofalcioni's use case is a valid one (especially if he can point us to such a file), although spaces in the INFO column are not allowed by the specification.

Here is an example - ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf/clinvar_00-latest.vcf.gz

Indeed - issue 16 and issue 49 both fail, as they have spaces instead of tabs. I feel unclean.

Owner

jamescasbon commented Nov 27, 2012

Added 'strict_whitespace' version to 0.6.1 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment