Following the analysis from 10/30, let's check if using the low confidence denovo calls from GATK helps with the consensus results. Also, let's ignore DenovoGear from now. According to the triodenovo paper, in very few (if any) cases, the calls from DNG are better than triodenovo's.

# Within tools stats

## GATK

In [5]:
%%bash
suffix=allDeNovo.vcf 
cd /data/NCR_SBRB/simplex/gatk_refine
rm interesting_snvs_${suffix}.txt
# figure out all family IDs
ls -1 *_trio1_${suffix} > famids.txt
sed -i -e "s/_trio1_${suffix}//g" famids.txt

# for each family ID
while read fam; do
  # figure out how many trios we have
   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;
   ntrios=$(($ntrios))
   # if we have more than one (assuming the first one is affected)
   if [ $ntrios -gt 1 ]; then
      # get all SNVs in the affected trio in the family
      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
      # combine the vcf files of all unnafected trios
      cat ${fam}_trio[2..$ntrios]_${suffix} | grep -v '#' - | cut -f 1,2 - > ${fam}_control_snvs_${suffix}.txt;
      # for each possible SNV in affected trio, mark it as interesting if it's not
      # in the unnafected trios
      while read snv; do
         if ! grep -q "$snv" ${fam}_control_snvs_${suffix}.txt; then
            echo $snv >> interesting_snvs_${suffix}.txt;
         fi;
      done < ${fam}_possible_snvs_${suffix}.txt;
   fi;
done < famids.txt

In [6]:
import pandas as pd
from scipy import stats
import numpy as np


import numpy as np
def do_perms(snps, noften, nboot=10000, npicks=20):
    success = 0
    for i in range(nboot):
        picks = np.random.choice(snps, npicks, replace = True)
        counts = stats.itemfreq(picks)
        nmax = np.max(counts[:,1])
        if (nmax >= noften):
            success += 1
    return(success/float(nboot))

In [7]:
snps_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/interesting_snvs_allDeNovo.vcf.txt',
                               header=None, names=['snp'])
counts_gatk = stats.itemfreq(snps_gatk['snp'])
my_max = np.max(counts_gatk[:, 1])
print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_gatk))
print counts_gatk[counts_gatk[:, 1] == my_max]

Maximum frequency: 3 in 139467 unique snps
[['chr1 120598739' 3]
 ['chr1 142897718' 3]
 ['chr1 143251223' 3]
 ['chr1 143420050' 3]
 ['chr1 143449316' 3]
 ['chr1 245928122' 3]
 ['chr10 133135993' 3]
 ['chr10 19026307' 3]
 ['chr11 51394527' 3]
 ['chr11 55046877' 3]
 ['chr12 118406089' 3]
 ['chr12 121471931' 3]
 ['chr12 9015996' 3]
 ['chr13 19447335' 3]
 ['chr13 52801661' 3]
 ['chr13 70469364' 3]
 ['chr13 73287599' 3]
 ['chr14 104630801' 3]
 ['chr14 20004233' 3]
 ['chr14 35476924' 3]
 ['chr14 66368478' 3]
 ['chr15 22049121' 3]
 ['chr15 78762313' 3]
 ['chr15 79220567' 3]
 ['chr16 30199933' 3]
 ['chr16 32141408' 3]
 ['chr16 33040759' 3]
 ['chr16 33865368' 3]
 ['chr16 33865369' 3]
 ['chr16 33865375' 3]
 ['chr16 33865377' 3]
 ['chr16 33937413' 3]
 ['chr16 55016333' 3]
 ['chr16 87747960' 3]
 ['chr17 21244764' 3]
 ['chr17 21248698' 3]
 ['chr17 25296420' 3]
 ['chr17 78795555' 3]
 ['chr17 80441236' 3]
 ['chr18 15172029' 3]
 ['chr18 34803023' 3]
 ['chr18 9880504' 3]
 ['chr19 53184007' 3]
 ['chr19 

In [8]:
do_perms(counts_gatk[:, 0], 3)

0.0

So, now we have several SNPs with at least 3 families calling it interesting... that's better, even though not all of them are high confidence according to GATK. Still, there are a bit too many for further investigation. Let's see if any of them agree with Triodenovo's calls.

## Triodenovo

In [10]:
%%bash
suffix=denovo_v2.vcf 
cd /data/NCR_SBRB/simplex/triodenovo
rm interesting_snvs_${suffix}.txt
# figure out all family IDs
ls -1 *_trio1_${suffix} > famids.txt
sed -i -e "s/_trio1_${suffix}//g" famids.txt

# for each family ID
while read fam; do
  # figure out how many trios we have
   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;
   ntrios=$(($ntrios))
   # if we have more than one (assuming the first one is affected)
   if [ $ntrios -gt 1 ]; then
      # get all SNVs in the affected trio in the family
      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
      # combine the vcf files of all unnafected trios
      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;
      # for each possible SNV in affected trio, mark it as interesting if it's not
      # in the unnafected trios
      while read snv; do
         if ! grep -q "$snv" ${fam}_control_snvs_${suffix}.txt; then
            echo $snv >> interesting_snvs_${suffix}.txt;
         fi;
      done < ${fam}_possible_snvs_${suffix}.txt;
   fi;
done < famids.txt

rm: cannot remove `interesting_snvs_denovo_v2.vcf.txt': No such file or directory


In [11]:
snps_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/interesting_snvs_denovo_v2.vcf.txt',
                               header=None, names=['snp'])
counts_tdn = stats.itemfreq(snps_tdn['snp'])
my_max = np.max(counts_tdn[:, 1])
print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_tdn))
print counts_tdn[counts_tdn[:, 1] == my_max]

Maximum frequency: 5 in 18802 unique snps
[['chr1 16914580' 5]
 ['chr10 49319323' 5]
 ['chr12 69667893' 5]
 ['chrX 102973509' 5]
 ['chrX 118751076' 5]
 ['chrX 13397236' 5]
 ['chrX 135307049' 5]
 ['chrX 14599572' 5]
 ['chrX 152610294' 5]
 ['chrX 153008911' 5]
 ['chrX 153880181' 5]
 ['chrX 153904473' 5]
 ['chrX 154774663' 5]
 ['chrX 16876980' 5]
 ['chrX 38262808' 5]
 ['chrX 43601142' 5]
 ['chrX 96502650' 5]]


OK, do any of these SNPs with 5 families also show up in the GATK calls?

In [18]:
best_tdn = counts_tdn[counts_tdn[:, 1] == 5, 0]
best_gatk = counts_gatk[counts_gatk[:, 1] == 3, 0]
joint = [s for s in best_tdn if s in best_gatk]
print joint

[]


Whomp whomp...

# Within group stats

The approach here is to calculate the best stats in all ADHD samples, and see what's the best we can do for a specific variable in non-affected siblings. For comparison, we can do it the other way around as well.

## Triodenovo

In [20]:
%%bash
cd /data/NCR_SBRB/simplex/triodenovo/

# concatenate all affected trios and extract the SNPs
rm affected_snvs.txt unaffected_snvs.txt;
for f in `ls *_trio1_denovo_v2.vcf`; do
    cat $f | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> affected_snvs.txt;
done
# need to be careful not to double count unaffected SNPs
while read fam; do
    if [ -e ${fam}_trio2_denovo_v2.vcf ]; then
        cat ${fam}_trio[2..4]_denovo_v2.vcf | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
    fi;
done < famids.txt

In [21]:
naff = 21
nunaff = 29
aff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/affected_snvs.txt',
                               header=None, names=['snp'])
counts_aff_tdn = stats.itemfreq(aff_tdn['snp'])
my_max = np.max(counts_aff_tdn[:, 1])
print 'Maximum frequency: %d in %d affected families' % (my_max, naff)
print counts_aff_tdn[counts_aff_tdn[:, 1] == my_max]

unaff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/unaffected_snvs.txt',
                               header=None, names=['snp'])
counts_unaff_tdn = stats.itemfreq(unaff_tdn['snp'])
my_max = np.max(counts_unaff_tdn[:, 1])
print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)
print counts_unaff_tdn[counts_unaff_tdn[:, 1] == my_max]

Maximum frequency: 14 in 21 affected families
[['chrX:41093413' 14]]
Maximum frequency: 11 in 29 unaffected families
[['chrX:100098355' 11]
 ['chrX:150573743' 11]
 ['chrX:150575385' 11]
 ['chrX:27479339' 11]
 ['chrX:9686187' 11]]


So, we found a mutation in 14 out of the 21 affected families, but we can also find mutations happening in 11 out of 29 unaffected families. So, 14 is not terribly impressive. But let's see how often this one mutation happens in unaffected famlies?

In [22]:
counts_unaff_tdn[counts_unaff_tdn[:, 0] == 'chrX:41093413', 1]

array([9], dtype=object)

We can potentially assign some significance to that... but what if we strict ourselves to autossomal chromosomes?

In [28]:
aff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/affected_snvs.txt',
                               header=None, names=['snp'])
counts_aff_tdn = stats.itemfreq(aff_tdn['snp'])
unaff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/unaffected_snvs.txt',
                               header=None, names=['snp'])
counts_unaff_tdn = stats.itemfreq(unaff_tdn['snp'])

keep_me = [s for s, snp in enumerate(counts_aff_tdn[:, 0]) if snp.find('chrX') < 0]
counts_aff_tdn = counts_aff_tdn[keep_me, ]
keep_me = [s for s, snp in enumerate(counts_unaff_tdn[:, 0]) if snp.find('chrX') < 0]
counts_unaff_tdn = counts_unaff_tdn[keep_me, ]

my_max = np.max(counts_aff_tdn[:, 1])
max_str = counts_aff_tdn[counts_aff_tdn[:, 1] == my_max, 0]
print 'Maximum frequency: %d in %d affected families' % (my_max, naff)
print counts_aff_tdn[counts_aff_tdn[:, 1] == my_max]
my_max = np.max(counts_unaff_tdn[:, 1])
print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)
print counts_unaff_tdn[counts_unaff_tdn[:, 1] == my_max]
for s in max_str:
    print 'Counts %s shows up in unaffected: %d' % (s, counts_unaff_tdn[counts_unaff_tdn[:, 0] == s, 1])

Maximum frequency: 9 in 21 affected families
[['chr13:42894052' 9]
 ['chr3:39556476' 9]
 ['chrM:3012' 9]]
Maximum frequency: 8 in 29 unaffected families
[['chr1:7889972' 8]
 ['chrM:3012' 8]]
Counts chr13:42894052 shows up in unaffected: 4
Counts chr3:39556476 shows up in unaffected: 7
Counts chrM:3012 shows up in unaffected: 8


## GATK

In [37]:
%%bash
cd /data/NCR_SBRB/simplex/gatk_refine

# concatenate all affected trios and extract the SNPs
rm affected_snvs.txt unaffected_snvs.txt;
for f in `ls *_trio1_allDeNovo.vcf`; do
    cat $f | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> affected_snvs.txt;
done
for f in `ls *_trio[2..4]_allDeNovo.vcf`; do
    cat $f | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
done

In [38]:
naff = 21
nunaff = 29
aff_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/affected_snvs.txt',
                               header=None, names=['snp'])
counts_aff_gatk = stats.itemfreq(aff_gatk['snp'])
my_max = np.max(counts_aff_gatk[:, 1])
max_str = counts_aff_gatk[counts_aff_gatk[:, 1] == my_max, 0]
print 'Maximum frequency: %d in %d affected families' % (my_max, naff)
unaff_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/unaffected_snvs.txt',
                               header=None, names=['snp'])
counts_unaff_gatk = stats.itemfreq(unaff_gatk['snp'])
my_max = np.max(counts_unaff_gatk[:, 1])
print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)
for s in max_str:
    if s in counts_unaff_gatk[:, 0]:
        mycount = counts_unaff_gatk[counts_unaff_gatk[:, 0] == s, 1]
    else:
        mycount = 0
    print 'Counts %s shows up in unaffected: %d' % (s, mycount)

Maximum frequency: 3 in 21 affected families
Maximum frequency: 3 in 29 unaffected families
Counts chr10:133135993 shows up in unaffected: 0
Counts chr10:19026307 shows up in unaffected: 0
Counts chr11:51394527 shows up in unaffected: 0
Counts chr11:55046877 shows up in unaffected: 0
Counts chr12:118406089 shows up in unaffected: 0
Counts chr12:121471931 shows up in unaffected: 0
Counts chr12:9015996 shows up in unaffected: 0
Counts chr13:19447335 shows up in unaffected: 0
Counts chr13:52801661 shows up in unaffected: 0
Counts chr13:70469364 shows up in unaffected: 0
Counts chr13:73287599 shows up in unaffected: 0
Counts chr14:104630801 shows up in unaffected: 0
Counts chr14:20004233 shows up in unaffected: 0
Counts chr14:35476924 shows up in unaffected: 0
Counts chr14:66368478 shows up in unaffected: 0
Counts chr14:93118237 shows up in unaffected: 0
Counts chr15:22049121 shows up in unaffected: 0
Counts chr15:78762313 shows up in unaffected: 0
Counts chr15:79220567 shows up in unaffec

Also not good...

# Some numbers

First, how many DNVs do we get in each trio, per tool?

## TrioDenovo

In [41]:
%%bash

cd /data/NCR_SBRB/simplex/triodenovo
for f in `ls *_trio?_denovo_v2.vcf`; do
    echo $f `grep -v '#' ${f} | sort | uniq | wc -l`;
done

10033_trio1_denovo_v2.vcf 1188
10033_trio2_denovo_v2.vcf 2353
10042_trio1_denovo_v2.vcf 2399
10090_trio1_denovo_v2.vcf 1884
10090_trio2_denovo_v2.vcf 1712
10094_trio1_denovo_v2.vcf 1895
10094_trio2_denovo_v2.vcf 1846
10128_trio1_denovo_v2.vcf 2201
10128_trio2_denovo_v2.vcf 2157
10131_trio1_denovo_v2.vcf 2033
10131_trio2_denovo_v2.vcf 1125
10131_trio3_denovo_v2.vcf 1164
10131_trio4_denovo_v2.vcf 1333
10153_trio1_denovo_v2.vcf 1119
10153_trio2_denovo_v2.vcf 1912
10153_trio3_denovo_v2.vcf 1007
10164_trio1_denovo_v2.vcf 2141
10164_trio2_denovo_v2.vcf 1869
10173_trio1_denovo_v2.vcf 1897
10173_trio2_denovo_v2.vcf 1916
10178_trio1_denovo_v2.vcf 2577
10178_trio2_denovo_v2.vcf 2423
10182_trio1_denovo_v2.vcf 2347
10182_trio2_denovo_v2.vcf 1333
10182_trio3_denovo_v2.vcf 1830
10197_trio1_denovo_v2.vcf 1281
10197_trio2_denovo_v2.vcf 1179
10215_trio1_denovo_v2.vcf 2311
10215_trio2_denovo_v2.vcf 2270
10215_trio3_denovo_v2.vcf 1122
10215_trio4_denovo_v2.vcf 1177
10369_trio1_denovo_v2.vcf 600
10369_tri

## GATK

In [42]:
%%bash

cd /data/NCR_SBRB/simplex/gatk_refine
for f in `ls *_trio?_allDeNovo.vcf`; do
    echo $f `grep -v '#' ${f} | sort | uniq | wc -l`;
done

10033_trio1_allDeNovo.vcf 3702
10033_trio2_allDeNovo.vcf 6858
10042_trio1_allDeNovo.vcf 4431
10090_trio1_allDeNovo.vcf 4607
10090_trio2_allDeNovo.vcf 3896
10094_trio1_allDeNovo.vcf 4511
10094_trio2_allDeNovo.vcf 3632
10128_trio1_allDeNovo.vcf 3707
10128_trio2_allDeNovo.vcf 3098
10131_trio1_allDeNovo.vcf 3571
10131_trio2_allDeNovo.vcf 3924
10131_trio3_allDeNovo.vcf 3078
10131_trio4_allDeNovo.vcf 4552
10153_trio1_allDeNovo.vcf 4184
10153_trio2_allDeNovo.vcf 4603
10153_trio3_allDeNovo.vcf 3912
10164_trio1_allDeNovo.vcf 5132
10164_trio2_allDeNovo.vcf 3549
10173_trio1_allDeNovo.vcf 4318
10173_trio2_allDeNovo.vcf 3454
10178_trio1_allDeNovo.vcf 9729
10178_trio2_allDeNovo.vcf 6656
10182_trio1_allDeNovo.vcf 5257
10182_trio2_allDeNovo.vcf 4141
10182_trio3_allDeNovo.vcf 3044
10197_trio1_allDeNovo.vcf 7505
10197_trio2_allDeNovo.vcf 5022
10215_trio1_allDeNovo.vcf 6976
10215_trio2_allDeNovo.vcf 4085
10215_trio3_allDeNovo.vcf 3993
10215_trio4_allDeNovo.vcf 5384
10369_trio1_allDeNovo.vcf 58406
10369_t

How many of those are only show up in the affected trio?

## TrioDenovo

In [43]:
%%bash
suffix=denovo_v2.vcf 
cd /data/NCR_SBRB/simplex/triodenovo
rm interesting_snvs_${suffix}.txt
# figure out all family IDs
ls -1 *_trio1_${suffix} > famids.txt
sed -i -e "s/_trio1_${suffix}//g" famids.txt

# for each family ID
while read fam; do
  # figure out how many trios we have
   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;
   ntrios=$(($ntrios))
   # if we have more than one (assuming the first one is affected)
   if [ $ntrios -gt 1 ]; then
      # get all SNVs in the affected trio in the family
      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
      # combine the vcf files of all unnafected trios
      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;
      # for each possible SNV in affected trio, mark it as interesting if it's not
      # in the unnafected trios
      while read snv; do
         if ! grep -q "$snv" ${fam}_control_snvs_${suffix}.txt; then
            echo $snv >> interesting_snvs_${suffix}.txt;
         fi;
      done < ${fam}_possible_snvs_${suffix}.txt;
      echo $fam `cat interesting_snvs_${suffix}.txt | wc -l`;
      rm interesting_snvs_${suffix}.txt
   fi;
done < famids.txt

10033 956
10090 1238
10094 1052
10128 1080
10131 1707
10153 760
10164 1515
10173 1105
10178 1720
10182 1105
10197 1078
10215 960
10369 402
10406 826
10448 1164
1892 823
1893 1625
1895 1209
1976 1150
855 1075


## GATK

In [44]:
%%bash
suffix=allDeNovo.vcf 
cd /data/NCR_SBRB/simplex/gatk_refine
rm interesting_snvs_${suffix}.txt
# figure out all family IDs
ls -1 *_trio1_${suffix} > famids.txt
sed -i -e "s/_trio1_${suffix}//g" famids.txt

# for each family ID
while read fam; do
  # figure out how many trios we have
   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;
   ntrios=$(($ntrios))
   # if we have more than one (assuming the first one is affected)
   if [ $ntrios -gt 1 ]; then
      # get all SNVs in the affected trio in the family
      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
      # combine the vcf files of all unnafected trios
      cat ${fam}_trio[2..$ntrios]_${suffix} | grep -v '#' - | cut -f 1,2 -> ${fam}_control_snvs_${suffix}.txt;
      # for each possible SNV in affected trio, mark it as interesting if it's not
      # in the unnafected trios
      while read snv; do
         if ! grep -q "$snv" ${fam}_control_snvs_${suffix}.txt; then
            echo $snv >> interesting_snvs_${suffix}.txt;
         fi;
      done < ${fam}_possible_snvs_${suffix}.txt;
      echo $fam `cat interesting_snvs_${suffix}.txt | wc -l`;
      rm interesting_snvs_${suffix}.txt
   fi;
done < famids.txt

10033 3491
10090 4436
10094 4340
10128 3590
10131 3282
10153 3862
10164 4864
10173 4119
10178 8915
10182 4784
10197 6984
10215 6096
10369 52928
10406 3401
10448 5365
1892 2997
1893 5727
1895 3755
1976 5628
855 5088


Let's now find a few intersections. First, for all called SNPs:

Now, intersections for the SNPs in affected sibs only:

* GET NUMBER OF POSSIBLE DE NOVO PER FAMILY
* MAKE VENN DIAGRAMS?