## Filtering Triodenovo Calls

DNMs were called by Yale using the TrioDenovo program and by HMS as previously described, and filtered using the same criteria, which have been shown to yield a specificity of 96.3% as described previously. These hard filters include: 
(1) an in-cohort MAF ≤ 4×10−4; 
(2) a minimum 10 total reads total, 5 alternate allele reads, and a minimum 20% alternate allele ratio in the proband if alternate allele reads ≥ 10 or, if alternate allele reads is < 10, a minimum 28% alternate ratio; 
(3) a minimum depth of 10 reference reads and alternate allele ratio < 3.5% in parents; and 
(4) exonic or canonical splice-site variants.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5675000/

The output genotypes for Indels are incorrect in this version but the de novo evidence is correctly calculated. The incorrect genotype "A/A" is the homozygous reference allele, "A/C" is the heterozygous, and "C/C" is the homozygous alternative allele, where the reference and alternative alleles for Indels are in the REF and ALT columns.

https://genome.sph.umich.edu/wiki/Triodenovo#Output

In [4]:

with open('../intermediate/triodenovo/pcgc11_variants.vcf','w') as outf:
    with open('/pollard/home/mpittman/oligo/intermediate/triodenovo/pcgc11_denovo.vcf') as f:
        for line in f:

            if line.startswith('##'):
                outf.write(line)
                continue

            if line.startswith('#'):
                header = line.strip().split('\t')
                outf.write(line)
                continue

            line_array = line.strip().split('\t')

            form = line_array[8].split(':')
            dp_idx = form.index('DP')

            # Find the carrier family
            dps = []
            for i,value in enumerate(line_array[9:], start=9):
                if not value.startswith('./.'):

                    ## Filter by minimum of total reads - Proband
                    if not header[i].endswith(('-01','-02')):
                        proband_dp = value.split(':')[dp_idx]
                        dps.append(proband_dp)

                        ## Filter by minimum of alternate allele ratio
                        # Can't be done without AD info 

                    ## Filter minimum parental reads
                    else:
                        parent_dp = value.split(':')[dp_idx]
                        dps.append(parent_dp)
                        # Allele read ratio - can't be done without AD info
                        
            # Remove if too many people in the cohort had this variant
                        
            if any(i=='.' for i in dps):
                continue
            if any(int(i) < 10 for i in dps):
                continue
                        
            outf.write(line)



In [25]:
weird_list = []
pid_list = []

with open('../intermediate/triodenovo/pcgc1_variants.vcf','w') as outf:
    with open('/pollard/home/mpittman/oligo/intermediate/triodenovo/pcgc1_denovo.vcf') as f:
        for line in f:

            if line.startswith('##'):
                outf.write(line)
                continue

            if line.startswith('#'):
                header = line.strip().split('\t')
                outf.write(line)
                continue

            line_array = line.strip().split('\t')

            form = line_array[8].split(':')
            dp_idx = form.index('DP')

            # Find the carrier family
            dps = []
            for i,value in enumerate(line_array[9:], start=9):
                if not value.startswith('./.'):

                    ## Filter by minimum of total reads - Proband
                    if not header[i].endswith(('-01','-02')):
                        proband = header[i]
                        proband_dp = value.split(':')[dp_idx]
                        dps.append(proband_dp)

                        ## Filter by minimum of alternate allele ratio
                        # Can't be done without AD info 

                    ## Filter minimum parental reads
                    else:
                        parent_dp = value.split(':')[dp_idx]
                        dps.append(parent_dp)
                        # Allele read ratio - can't be done without AD info
            
            #######################
            # Remove if too many individuals in the cohort had this variant
                  
            # This is very odd.....DP should only be one integer - this should be called AD
            if any(',' in i for i in dps):
                pid_list.append(proband)
                weird_list.append(line)
                for k,val in enumerate(dps):
                    dps[k] = sum([int(j) for j in val.split(',')]) 
                
            if any(j=='.' for j in dps):
                continue
            if any(int(j) < 10 for j in dps):
                continue
                        
            outf.write(line)



KeyboardInterrupt: 

In [None]:
# Check out weird lines
print(len(weird_list))
print(weird_list[0])
print(pid_list[0])

In [None]:
print(header)

In [21]:
rsid = "rs757111315"

with open('/pollard/data/genetics/PCGC/WES/Yale/rare_PCGC_072020/split_files/xaa') as f:
    for line in f:
        if line.startswith('#'):
            continue
        line_array = line.strip().split('\t')
        if line_array[2] == rsid:
            print(line_array[0:9])
            for i in line_array[9:]:
                if not i.startswith(('./.','0/0')):
                    print(i)
        else:
            continue

['1', '899927', 'rs757111315', 'C', 'T', '4529.2', 'PASS', '.', 'GT:AB:AD:DP:GQ:PGT:PID:PL']
0/1:.:8,1:9:18:0|1:899924_C_A:18,0,434
0/1:.:13,2:15:4:.:.:4,0,418
0/1:0.89:31,4:35:75:0|1:899924_C_A:75,0,2186
0/1:0.92:22,2:24:3:0|1:899924_C_A:3,0,1091
0/1:.:8,2:10:40:0|1:899924_C_A:40,0,330
0/1:.:0,1:1:29:0|1:899924_C_A:39,0,29
0/1:0.75:3,1:4:65:0|1:899924_C_A:65,0,120
1/1:.:0,1:1:3:.:.:33,3,0
0/1:0.89:31,4:35:59:0|1:899924_C_A:59,0,1468
['1', '899927', 'rs757111315', 'CGGGGAGGGGGG', 'C', '4529.2', 'PASS', '.', 'GT:AB:AD:DP:GQ:PGT:PID:PL']
1/1:.:0,8:8:36:.:.:327,36,0
0/1:.:4,2:6:25:.:.:25,0,1021
1/1:.:0,9:9:39:.:.:361,39,0
0/1:.:8,1:9:38:.:.:38,0,930
0/1:.:2,4:6:86:.:.:86,0,436
1/1:.:0,7:7:32:.:.:325,32,0
1/1:.:0,5:5:22:.:.:194,22,0
0/1:.:3,1:4:43:.:.:43,0,397
1/1:.:0,2:2:11:.:.:100,11,0
1/1:.:0,4:4:17:.:.:176,17,0
1/1:.:0,3:3:14:.:.:114,14,0
1/1:.:0,4:4:19:.:.:183,19,0
0/1:.:1,2:3:75:.:.:75,0,131


KeyboardInterrupt: 

In [7]:
# Run annovar on these variants
annovar_path = '/pollard/home/mpittman/PCGC/code/annovar'
out_path = '../intermediate/triodenovo/'
fname = 'pcgc11_variants.vcf'

cmd1 = "perl {}/table_annovar.pl {}/{} {}/humandb/ -buildver hg19 -out {}/{}".format(annovar_path,out_path,fname,
                                                                                     annovar_path,out_path,fname)
cmd2 = " -remove -protocol refGene,dbnsfp30a,exac03 -operation g,f,f -nastring . -vcfinput"

cmd = cmd1+cmd2
os.system(cmd)

In [8]:
# Filter out non-exonic/non-canonical splice sites

In [None]:
# Get into desired format