In [1]:
from ness_vcf import ness_vcf
import vcf

This vcf filtering strategy uses the structure and language in PyVCF to create a flexible filtering strategy that is not specific to the idiosyncracies of the program that made that particular vcf

It relies on naming the particular attributes and methods of `record` (single genome site) objects and `call` objects (single individual genotype call)

It allows filtering of a whole record using the `test_site()` function or filtering indoividual genotype calls using the `test_genotype()` function

## Test whole genome sites 

Sites in the genome are represented as individual rows in a vcf. Each row has predefined columns for various pieces of information. In pyVCF each row is converted to a `record` object. Each column of the VCF is accessible through this object

In [2]:
vr = vcf.Reader(filename='/scratch/research/projects/chlamydomonas/quebec/all/haplotypeCaller/new_assembly/all_quebec.HC.vcf.gz')

In [3]:
record = vr.fetch('chromosome_1', 1500000, 15000010).next()

The information in each column is available as attributes of the record object:

In [4]:
print record.CHROM
print record.POS
print record.REF
print record.ALT


chromosome_1
1500001
C
[None]


and there are convenience methods of the record also available

In [5]:
print record.num_called
print record.num_het
print record.is_snp

24
0
False


Any of these peices of information can be built into a filter where you may want to say:

If record is snp and the reference base is an A, ie:
   
       if record.is_snp ==True and record.REF ==A: 

You can instead make a dictionary that I call `recordAttribute_filters` where each test is built into a dictionary

In [6]:
recordAttribute_filters = {"is_snp": "==True", "REF": "=='A'"}

# note that the A in =='A'  is in quotes - that is because python will pass that command 

You can access the 'nested' information in the INFO column using pyVCF in a slighlty different way

In [7]:
print record.INFO['DP']

989


You can therefore access the INFO attributes with another set of filters called `recordINFOfilters`

In [8]:
recordINFOfilters = {"DP": ">250"}

combining these filters you can test sites

In [9]:
for record in vr.fetch('chromosome_1', 1500000, 1500200):
    if ness_vcf.test_site(record, recordAttribute_filters,recordINFOfilters ):
        print record.CHROM, record.POS, record.REF, record.INFO['DP']
    

chromosome_1 1500050 A 1019
chromosome_1 1500110 A 979
chromosome_1 1500197 A 1008


There is an additional option called `lenient` which means that if a filter doesn't exist for that site the program will ignore the filter on that site. This is relevant when you want to filter different kinds of sites differently especially when certain metrics only exist in special cases. eg. Allele Frequency `AF` is only defined for variable sites  

If `lenient` is set to False the site will fail if the filter is nonexistent. This may lead to strange problems where you expect passing sites but get none. As a result I have created another option to go with `lenient` that prints the missing filter to the screen every time it happens. It is called `verbose` and its only really meant for debugging

In [19]:
recordAttribute_filters = {"REF": "=='A'"}
recordINFOfilters = {"DP": ">250", "AF": ">0.25"}

In [20]:
for record in vr.fetch('chromosome_1', 1500000, 1500100):
    if ness_vcf.test_site(record, recordAttribute_filters,recordINFOfilters, lenient = True ):
        print record.CHROM, record.POS, record.REF, record.INFO['DP']
    

chromosome_1 1500010 A 914
chromosome_1 1500019 A 981
chromosome_1 1500022 A 997
chromosome_1 1500024 A 997
chromosome_1 1500027 A 1002
chromosome_1 1500032 A 995
chromosome_1 1500040 A 984
chromosome_1 1500044 A 1010
chromosome_1 1500047 A 994
chromosome_1 1500048 A 1009
chromosome_1 1500049 A 1006
chromosome_1 1500050 A 1019
chromosome_1 1500054 A 983
chromosome_1 1500065 A 976
chromosome_1 1500071 A 981
chromosome_1 1500085 A 946
chromosome_1 1500096 A 961


In [21]:
for record in vr.fetch('chromosome_1', 1500000, 1500100):
    if ness_vcf.test_site(record, recordAttribute_filters,recordINFOfilters, lenient = False, verbose=True ):
        print record.CHROM, record.POS, record.REF, record.INFO['DP']
        

chromosome_1 1500050 A 1019


AF does not exist in the INFO field of site chromosome_1:1500010 
AF does not exist in the INFO field of site chromosome_1:1500019 
AF does not exist in the INFO field of site chromosome_1:1500022 
AF does not exist in the INFO field of site chromosome_1:1500024 
AF does not exist in the INFO field of site chromosome_1:1500027 
AF does not exist in the INFO field of site chromosome_1:1500032 
AF does not exist in the INFO field of site chromosome_1:1500040 
AF does not exist in the INFO field of site chromosome_1:1500044 
AF does not exist in the INFO field of site chromosome_1:1500047 
AF does not exist in the INFO field of site chromosome_1:1500048 
AF does not exist in the INFO field of site chromosome_1:1500049 
AF does not exist in the INFO field of site chromosome_1:1500054 
AF does not exist in the INFO field of site chromosome_1:1500065 
AF does not exist in the INFO field of site chromosome_1:1500071 
AF does not exist in the INFO field of site chromosome_1:1500085 
AF does no

# Testing individual Genotype calls
In addition  properties of whole sites pyVCF also breaks down the information in the genotype calls (format field information). 

pyVCF provides access to the core attributes of a genotypecall

In [13]:
sample_id= vr.samples[0]

record = vr.fetch('chromosome_1', 1500049,1500050).next() 

call = record.genotype(sample_id)
print call['GT']
print call['GQ']
print call['DP']
print call['AD']

1
99
53
[0, 53]


It also provides a number of convenience methods of call objects

In [14]:
print call.is_het
print call.gt_alleles
print call.gt_bases

False
['1']
T


In a way similar to above you can create filters for individual sites using the `callFormat_filters` for the information access like this `call['FIELD']` and `callAttribute_filters` for information accessed like this `call.ATTRIBUTE`

In [22]:
callFormat_filters={"DP":">5", "GQ" : ">30", "RGQ" : ">0"}
callAttribute_filters={'called': "==True", "is_variant": "== True"}


In [23]:
for record in vr.fetch('chromosome_1', 1500000, 1500100):
    if ness_vcf.test_site(record, recordAttribute_filters,recordINFOfilters, lenient = True):
        print record.CHROM, record.POS, record.REF, record.INFO['DP']
        for call in record.samples:
            if ness_vcf.test_genotype(call,callFormat_filters, callAttribute_filters ):
                print "\t\t", call.sample, call.gt_bases
                

chromosome_1 1500010 A 914
chromosome_1 1500019 A 981
chromosome_1 1500022 A 997
chromosome_1 1500024 A 997
chromosome_1 1500027 A 1002
chromosome_1 1500032 A 995
chromosome_1 1500040 A 984
chromosome_1 1500044 A 1010
chromosome_1 1500047 A 994
chromosome_1 1500048 A 1009
chromosome_1 1500049 A 1006
chromosome_1 1500050 A 1019
		CC2935 T
		CC2936 T
		CC2937 T
		CC3059 T
		CC3060 T
		CC3061 T
		CC3062 T
		CC3063 T
		CC3064 T
		CC3065 T
		CC3068 T
		CC3069 T
		CC3071 T
		CC3073 T
		CC3076 T
		CC3082 T
		CC3083 T
		CC3084 T
		CC3086 T
chromosome_1 1500054 A 983
chromosome_1 1500065 A 976
chromosome_1 1500071 A 981
chromosome_1 1500085 A 946
chromosome_1 1500096 A 961


The `lenient` and `verbose` options are also available for the `test_genotype()` function

In [24]:
callFormat_filters={"DP":">5", "GQ" : ">30", "RGQ" : ">0"}
callAttribute_filters={'called': "==True", "is_variant": "== True"}

for record in vr.fetch('chromosome_1', 1500000, 1500100):
    if ness_vcf.test_site(record, recordAttribute_filters,recordINFOfilters, lenient = True):
        print record.CHROM, record.POS, record.REF, record.INFO['DP']
        for call in record.samples:
            if ness_vcf.test_genotype(call,callFormat_filters, callAttribute_filters, lenient=False, verbose=True ):
                print "\t\t", call.sample, call.gt_bases
                

chromosome_1 1500010 A 914
chromosome_1 1500019 A 981
chromosome_1 1500022 A 997
chromosome_1 1500024 A 997
chromosome_1 1500027 A 1002
chromosome_1 1500032 A 995
chromosome_1 1500040 A 984
chromosome_1 1500044 A 1010
chromosome_1 1500047 A 994
chromosome_1 1500048 A 1009
chromosome_1 1500049 A 1006
chromosome_1 1500050 A 1019
chromosome_1 1500054 A 983
chromosome_1 1500065 A 976
chromosome_1 1500071 A 981
chromosome_1 1500085 A 946
chromosome_1 1500096 A 961


GQ does not exist in genotype CC2935 at site chromosome_1:1500010 
GQ does not exist in genotype CC2936 at site chromosome_1:1500010 
GQ does not exist in genotype CC2937 at site chromosome_1:1500010 
GQ does not exist in genotype CC2938 at site chromosome_1:1500010 
GQ does not exist in genotype CC3059 at site chromosome_1:1500010 
GQ does not exist in genotype CC3060 at site chromosome_1:1500010 
GQ does not exist in genotype CC3061 at site chromosome_1:1500010 
GQ does not exist in genotype CC3062 at site chromosome_1:1500010 
GQ does not exist in genotype CC3063 at site chromosome_1:1500010 
GQ does not exist in genotype CC3064 at site chromosome_1:1500010 
GQ does not exist in genotype CC3065 at site chromosome_1:1500010 
GQ does not exist in genotype CC3068 at site chromosome_1:1500010 
GQ does not exist in genotype CC3069 at site chromosome_1:1500010 
GQ does not exist in genotype CC3071 at site chromosome_1:1500010 
GQ does not exist in genotype CC3072 at site chromosome_1:1500