0/. has numeric representation (gt_type) 1 #93

Open
jasperlinthorst opened this Issue Feb 6, 2013 · 4 comments

3 participants

@jasperlinthorst

Hello James,
First of all, thank you for creating PyVCF. Secondly, I'm working on a manually filtered vcf file, so it could be that the error is due to an inconsistency in the used vcf file in relation to the vcf standard...
That said, I don't think it is correct to numerically represent variants for which one of the alleles has the reference and the other allele is unknown, so 0/., as 1. I would represent them as either 'None' or 0. Where 'None' would have my preference.

I encountered this when using PyVCF version 0.6.0.

A response with your views on this would be greatly appreciated!

Thanks,
Jasper

@jamescasbon
Owner

Yes, I think you are right.

Do you have a patch?

@martijnvermaat
Collaborator

Slightly off-topic (I agree and vote for None), but does anyone know which variant callers actually generate GT values with the correct use of . (no call) as opposed to just using 0 (reference) for everything? Not just for heterozygous variants, but also for what should be ./. (actually, I'm not even sure 0/. is allowed at all).

@jasperlinthorst

No, I don't have a patch. I now have a workaround in which i do check for the occurrence of a '.' in the data.GT. In case, I interpret it as a missing call. I would say that something similar can be implemented in the gt_type method...

@martijnvermaat
Collaborator

I think this needs some more thinking. If we make a choice for 0/., what about 1/.? What does it even mean?

Reading the spec, I think . can only be used if no call could be made for the sample, so for diploid this would only allow ./..

So I'm inclined to set _Call.called to False for all of these cases (0/., 1/., ./.). This would directly fix Jasper's issue and some other properties like is_variant and is_het.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment