Let's now count how many deleterious and rare de novo variants each trio has. We'll start with GATK hiConf: 

In [1]:
import pandas as pd
from scipy import stats
import numpy as np

# GATK high conf

First, create the annotations. We can do all the grepping later:

In [29]:
%%bash


module load annovar

cd /data/NCR_SBRB/simplex/gatk_refine
suffix=hiConfDeNovo

while read trio; do
    convert2annovar.pl -format vcf4old ${trio}_${suffix}.vcf > ${trio}_${suffix}.avinput;
    table_annovar.pl ${trio}_${suffix}.avinput $ANNOVAR_DATA/hg19 \
        -protocol refGene,dbnsfp30a,popfreq_max_20150413 -operation g,f,f \
        -build hg19 -nastring .
done < /data/NCR_SBRB/simplex/trio_ids.txt

Now we have just big matrices that we can subset in whatever way we want. For now, let's count how many deleterious, rare variants they have:

In [137]:
fid = open('/data/NCR_SBRB/simplex/trio_ids.txt', 'r')
trios = [t.rstrip() for t in fid]
fid.close()

var_count = {}
for trio in trios:
    df = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/%s_hiConfDeNovo.avinput.hg19_multianno.txt' % trio)
    pred_cols = [i for i, col in enumerate(df.columns) if col.find('_pred') > 0]

    mask = df[pred_cols] == 'D'
    del_idx = mask.any(axis=1)

    df.loc[df['PopFreqMax']=='.', 'PopFreqMax'] = np.nan
    rare_idx = (df['PopFreqMax'].astype(float) < .01) | pd.isnull(df['PopFreqMax'])

    keep_me = del_idx & rare_idx
    var_count[trio] = np.sum(keep_me)
    print '%s: %d de novo SNVs, ' % (trio, df.shape[0]) + \
        '%d rare (MAF<.01), ' % np.sum((df['PopFreqMax'].astype(float) < .01)) + \
        '%d not in PopFreqMax, ' % np.sum(pd.isnull(df['PopFreqMax'])) + \
        '%d deleterious, ' % np.sum(del_idx) + \
        '%d not in dbNSFP' % np.sum(np.all(df[pred_cols] == '.', axis=1))

10033_trio1: 69 de novo SNVs, 9 rare (MAF<.01), 50 not in PopFreqMax, 1 deleterious, 68 not in dbNSFP
10033_trio2: 99 de novo SNVs, 18 rare (MAF<.01), 65 not in PopFreqMax, 1 deleterious, 96 not in dbNSFP
10042_trio1: 81 de novo SNVs, 17 rare (MAF<.01), 47 not in PopFreqMax, 5 deleterious, 74 not in dbNSFP
10090_trio1: 89 de novo SNVs, 16 rare (MAF<.01), 43 not in PopFreqMax, 3 deleterious, 86 not in dbNSFP
10090_trio2: 63 de novo SNVs, 11 rare (MAF<.01), 33 not in PopFreqMax, 0 deleterious, 63 not in dbNSFP
10094_trio1: 80 de novo SNVs, 18 rare (MAF<.01), 41 not in PopFreqMax, 2 deleterious, 76 not in dbNSFP
10094_trio2: 55 de novo SNVs, 13 rare (MAF<.01), 27 not in PopFreqMax, 2 deleterious, 51 not in dbNSFP
10128_trio1: 54 de novo SNVs, 11 rare (MAF<.01), 33 not in PopFreqMax, 2 deleterious, 49 not in dbNSFP
10128_trio2: 52 de novo SNVs, 7 rare (MAF<.01), 35 not in PopFreqMax, 1 deleterious, 51 not in dbNSFP
10131_trio1: 48 de novo SNVs, 9 rare (MAF<.01), 23 not in PopFreqMax, 2 del

Let's summarize the results above per family, where the first one is always the affected trio:

In [117]:
fid = open('/data/NCR_SBRB/simplex/famids.txt', 'r')
fams = [t.rstrip() for t in fid]
fid.close()

for fam in fams:
    keys = [k for k in var_count.iterkeys() if k.find(fam)==0]
    keys.sort()
    f_str = [fam] + [str(var_count[k]) for k in keys]
    print f_str

['10033', '1', '1']
['10042', '5']
['10090', '2', '0']
['10094', '2', '0']
['10128', '2', '0']
['10131', '2', '0', '0', '6']
['10153', '0', '1', '1']
['10164', '0', '3']
['10173', '1', '6']
['10178', '5', '2']
['10182', '0', '1', '1']
['10197', '1', '0']
['10215', '5', '4', '1', '2']
['10369', '36', '31']
['10406', '2', '1', '1']
['10448', '0', '0']
['1892', '1', '2']
['1893', '0', '2']
['1895', '3', '0']
['1976', '1', '2', '2']
['855', '4', '2']


8 families where unaffected has more, 2 ties, 10 where affected has more...

Let's run some stats. First, assuming all pairs are unrelated, then just picking the best pair. We do this parametric and non-parametric t-tests:

In [122]:
print 'Only one pair per family:'
x, y = [], []
for fam in fams:
    if fam != '10042':
        keys = [k for k in var_count.iterkeys() if k.find(fam)==0]
        keys.sort()
        x.append(var_count[keys[0]])
        y.append(var_count[keys[1]])
stat, pval = stats.wilcoxon(x, y)
print 'Wilcoxon p = %.2f' % pval
stat, pval = stats.ttest_rel(x, y)
print 'Paired t-test p = %.2f' % pval

print 'All pairs:'
x, y = [], []
for fam in fams:
    if fam != '10042':
        keys = [k for k in var_count.iterkeys() if k.find(fam)==0]
        keys.sort()
        for k in range(1, len(keys)):
            x.append(var_count[keys[0]])
            y.append(var_count[keys[k]])
stat, pval = stats.wilcoxon(x, y)
print 'Wilcoxon p = %.2f' % pval
stat, pval = stats.ttest_rel(x, y)
print 'Paired t-test p = %.2f' % pval

Only one pair per family:
Wilcoxon p = 0.24
Paired t-test p = 0.35
All pairs:
Wilcoxon p = 0.21
Paired t-test p = 0.31


# GATK all

In [None]:
%%bash


module load annovar

cd /data/NCR_SBRB/simplex/gatk_refine
suffix=allDeNovo

while read trio; do
    convert2annovar.pl -format vcf4old ${trio}_${suffix}.vcf > ${trio}_${suffix}.avinput;
    table_annovar.pl ${trio}_${suffix}.avinput $ANNOVAR_DATA/hg19 \
        -protocol refGene,dbnsfp30a,popfreq_max_20150413 -operation g,f,f \
        -build hg19 -nastring .
done < /data/NCR_SBRB/simplex/trio_ids.txt

In [123]:
fid = open('/data/NCR_SBRB/simplex/trio_ids.txt', 'r')
trios = [t.rstrip() for t in fid]
fid.close()

var_count = {}
for trio in trios:
    df = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/%s_allDeNovo.avinput.hg19_multianno.txt' % trio)
    pred_cols = [i for i, col in enumerate(df.columns) if col.find('_pred') > 0]

    mask = df[pred_cols] == 'D'
    del_idx = mask.any(axis=1)

    df.loc[df['PopFreqMax']=='.', 'PopFreqMax'] = np.nan
    rare_idx = (df['PopFreqMax'].astype(float) < .01) | pd.isnull(df['PopFreqMax'])

    keep_me = del_idx & rare_idx
    var_count[trio] = np.sum(keep_me)

fid = open('/data/NCR_SBRB/simplex/famids.txt', 'r')
fams = [t.rstrip() for t in fid]
fid.close()

for fam in fams:
    keys = [k for k in var_count.iterkeys() if k.find(fam)==0]
    keys.sort()
    f_str = [fam] + [str(var_count[k]) for k in keys]
    print f_str

['10033', '3', '5']
['10042', '16']
['10090', '8', '1']
['10094', '4', '6']
['10128', '3', '0']
['10131', '2', '2', '5', '10']
['10153', '5', '5', '6']
['10164', '10', '9']
['10173', '6', '12']
['10178', '11', '2']
['10182', '10', '4', '4']
['10197', '6', '5']
['10215', '12', '10', '7', '6']
['10369', '87', '83']
['10406', '7', '4', '5']
['10448', '3', '7']
['1892', '5', '11']
['1893', '1', '7']
['1895', '7', '4']
['1976', '7', '3', '4']
['855', '6', '3']


8 families where unaffected has more, 12 where affected has more...

In [124]:
print 'Only one pair per family:'
x, y = [], []
for fam in fams:
    if fam != '10042':
        keys = [k for k in var_count.iterkeys() if k.find(fam)==0]
        keys.sort()
        x.append(var_count[keys[0]])
        y.append(var_count[keys[1]])
stat, pval = stats.wilcoxon(x, y)
print 'Wilcoxon p = %.2f' % pval
stat, pval = stats.ttest_rel(x, y)
print 'Paired t-test p = %.2f' % pval

print 'All pairs:'
x, y = [], []
for fam in fams:
    if fam != '10042':
        keys = [k for k in var_count.iterkeys() if k.find(fam)==0]
        keys.sort()
        for k in range(1, len(keys)):
            x.append(var_count[keys[0]])
            y.append(var_count[keys[k]])
stat, pval = stats.wilcoxon(x, y)
print 'Wilcoxon p = %.2f' % pval
stat, pval = stats.ttest_rel(x, y)
print 'Paired t-test p = %.2f' % pval

Only one pair per family:
Wilcoxon p = 0.31
Paired t-test p = 0.31
All pairs:
Wilcoxon p = 0.20
Paired t-test p = 0.21


# Triodenovo

In [None]:
%%bash


module load annovar

cd /data/NCR_SBRB/simplex/triodenovo
suffix=denovo_v2

while read trio; do
    convert2annovar.pl -format vcf4old ${trio}_${suffix}.vcf > ${trio}_${suffix}.avinput;
    table_annovar.pl ${trio}_${suffix}.avinput $ANNOVAR_DATA/hg19 \
        -protocol refGene,dbnsfp30a,popfreq_max_20150413 -operation g,f,f \
        -build hg19 -nastring .
done < /data/NCR_SBRB/simplex/trio_ids.txt

In [125]:
fid = open('/data/NCR_SBRB/simplex/trio_ids.txt', 'r')
trios = [t.rstrip() for t in fid]
fid.close()

var_count = {}
for trio in trios:
    df = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/%s_denovo_v2.avinput.hg19_multianno.txt' % trio)
    pred_cols = [i for i, col in enumerate(df.columns) if col.find('_pred') > 0]

    mask = df[pred_cols] == 'D'
    del_idx = mask.any(axis=1)

    df.loc[df['PopFreqMax']=='.', 'PopFreqMax'] = np.nan
    rare_idx = (df['PopFreqMax'].astype(float) < .01) | pd.isnull(df['PopFreqMax'])

    keep_me = del_idx & rare_idx
    var_count[trio] = np.sum(keep_me)

fid = open('/data/NCR_SBRB/simplex/famids.txt', 'r')
fams = [t.rstrip() for t in fid]
fid.close()

for fam in fams:
    keys = [k for k in var_count.iterkeys() if k.find(fam)==0]
    keys.sort()
    f_str = [fam] + [str(var_count[k]) for k in keys]
    print f_str

['10033', '16', '33']
['10042', '20']
['10090', '38', '24']
['10094', '8', '10']
['10128', '17', '20']
['10131', '18', '13', '7', '32']
['10153', '21', '13', '15']
['10164', '25', '24']
['10173', '18', '18']
['10178', '26', '23']
['10182', '18', '17', '19']
['10197', '7', '13']
['10215', '19', '24', '9', '11']
['10369', '17', '9']
['10406', '16', '23', '23']
['10448', '12', '13']
['1892', '10', '17']
['1893', '33', '35']
['1895', '33', '20']
['1976', '29', '24', '10']
['855', '24', '14']


9 families where unaffected has more, 1 tie, 10 where affected has more...

In [126]:
print 'Only one pair per family:'
x, y = [], []
for fam in fams:
    if fam != '10042':
        keys = [k for k in var_count.iterkeys() if k.find(fam)==0]
        keys.sort()
        x.append(var_count[keys[0]])
        y.append(var_count[keys[1]])
stat, pval = stats.wilcoxon(x, y)
print 'Wilcoxon p = %.2f' % pval
stat, pval = stats.ttest_rel(x, y)
print 'Paired t-test p = %.2f' % pval

print 'All pairs:'
x, y = [], []
for fam in fams:
    if fam != '10042':
        keys = [k for k in var_count.iterkeys() if k.find(fam)==0]
        keys.sort()
        for k in range(1, len(keys)):
            x.append(var_count[keys[0]])
            y.append(var_count[keys[k]])
stat, pval = stats.wilcoxon(x, y)
print 'Wilcoxon p = %.2f' % pval
stat, pval = stats.ttest_rel(x, y)
print 'Paired t-test p = %.2f' % pval

Only one pair per family:
Wilcoxon p = 0.59
Paired t-test p = 0.60
All pairs:
Wilcoxon p = 0.25
Paired t-test p = 0.28


# Whole genome databases

I'm getting a lot of misses, but could it be because I'm only looking at exome databases? It would make sense, as I have WES data, but what happens if I look at WGS databases as well?

In [None]:
%%bash

table_annovar.pl tmp1.avinput $ANNOVAR_DATA/hg19 -protocol refGene,gerp++,cadd,dann,fathmm,eigen,gwava,dbscsnv11,spidex,clinvar_20160302,avsnp142 -operation g,f,f,f,f,f,f,f,f,f,f  -build hg19 -nastring .

In [143]:
df = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/tmp1.avinput.hg19_multianno.txt')
df[:5].head()

Unnamed: 0,Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,...,dbscSNV_ADA_SCORE,dbscSNV_RF_SCORE,dpsi_max_tissue,dpsi_zscore,CLINSIG,CLNDBN,CLNACC,CLNDSDB,CLNDSDBID,avsnp142
0,chr1,16954434,16954434,C,A,ncRNA_intronic,CROCCP2,.,.,.,...,.,.,.,.,.,.,.,.,.,rs186864069
1,chr1,92065492,92065492,T,C,intergenic,CDC7;TGFBR3,dist=74171;dist=80408,.,.,...,.,.,.,.,.,.,.,.,.,rs184420530
2,chr1,142803199,142803199,G,C,intergenic,ANKRD20A12P;LOC102723769,dist=89594;dist=331404,.,.,...,.,.,.,.,.,.,.,.,.,.
3,chr1,142803606,142803606,-,ATTAATTAATTAATTAAT,intergenic,ANKRD20A12P;LOC102723769,dist=90001;dist=330997,.,.,...,.,.,.,.,.,.,.,.,.,.
4,chr1,142810277,142810277,G,C,intergenic,ANKRD20A12P;LOC102723769,dist=96672;dist=324326,.,.,...,.,.,.,.,.,.,.,.,.,.


Now, the issue is that most of these annotations don't provide predictions, so I'd need to somehow aggregate all these variables per trio (maybe one per variable), and do some sort of t-test using these variables. Similar to what we did before, but now we wouldn't be using counting of Ds, but the average/median value in the annotation... to be continued.