Following the analysis from 10/27, let's fix the errors that might be causing the duplicate variables. Also note that some of these results might be a bit different because today I found an error in the pedigree for trio 10215_trio3, so that's fixed now. I re-ran all SNV tools for that particular trio, and redid the ensembles.

# Within tools stats

## GATK

In [1]:
%%bash
suffix=hiConfDeNovo.vcf 
cd /data/NCR_SBRB/simplex/gatk_refine
rm interesting_snvs_${suffix}.txt
# figure out all family IDs
ls -1 *_trio1_${suffix} > famids.txt
sed -i -e "s/_trio1_${suffix}//g" famids.txt

# for each family ID
while read fam; do
  # figure out how many trios we have
   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;
   ntrios=$(($ntrios))
   # if we have more than one (assuming the first one is affected)
   if [ $ntrios -gt 1 ]; then
      # get all SNVs in the affected trio in the family
      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
      # combine the vcf files of all unnafected trios
      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;
      # for each possible SNV in affected trio, mark it as interesting if it's not
      # in the unnafected trios
      while read snv; do
         if ! grep -q "$snv" ${fam}_control_snvs_${suffix}.txt; then
            echo $snv >> interesting_snvs_${suffix}.txt;
         fi;
      done < ${fam}_possible_snvs_${suffix}.txt;
   fi;
done < famids.txt

In [5]:
import pandas as pd
from scipy import stats
import numpy as np


import numpy as np
def do_perms(snps, noften, nboot=10000, npicks=20):
    success = 0
    for i in range(nboot):
        picks = np.random.choice(snps, npicks, replace = True)
        counts = stats.itemfreq(picks)
        nmax = np.max(counts[:,1])
        if (nmax >= noften):
            success += 1
    return(success/float(nboot))

In [6]:
snps_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/interesting_snvs_hiConfDeNovo.vcf.txt',
                               header=None, names=['snp'])
counts_gatk = stats.itemfreq(snps_gatk['snp'])
my_max = np.max(counts_gatk[:, 1])
print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_gatk))
print counts_gatk[counts_gatk[:, 1] == my_max]

Maximum frequency: 2 in 3388 unique snps
[['chr1 40229504' 2]
 ['chr17 72762902' 2]
 ['chr18 15271172' 2]
 ['chr2 89105006' 2]
 ['chr20 29637674' 2]
 ['chr20 29637691' 2]
 ['chr6 29968761' 2]
 ['chr6 32548712' 2]
 ['chr6 32557647' 2]
 ['chr7 154002420' 2]]


In [7]:
do_perms(counts_gatk[:, 0], 2)

0.0524

The issue we had was not in GATK, but it's good that the changes we made did't affect the results. They're still not great, but maybe close?

## Triodenovo

In [68]:
%%bash
suffix=denovo_v2.vcf 
cd /data/NCR_SBRB/simplex/triodenovo
rm interesting_snvs_${suffix}.txt
# figure out all family IDs
ls -1 *_trio1_${suffix} > famids.txt
sed -i -e "s/_trio1_${suffix}//g" famids.txt

# for each family ID
while read fam; do
  # figure out how many trios we have
   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;
   ntrios=$(($ntrios))
   # if we have more than one (assuming the first one is affected)
   if [ $ntrios -gt 1 ]; then
      # get all SNVs in the affected trio in the family
      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
      # combine the vcf files of all unnafected trios
      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;
      # for each possible SNV in affected trio, mark it as interesting if it's not
      # in the unnafected trios
      while read snv; do
         if ! grep -q "$snv" ${fam}_control_snvs_${suffix}.txt; then
            echo $snv >> interesting_snvs_${suffix}.txt;
         fi;
      done < ${fam}_possible_snvs_${suffix}.txt;
   fi;
done < famids.txt

In [69]:
snps_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/interesting_snvs_denovo_v2.vcf.txt',
                               header=None, names=['snp'])
counts_tdn = stats.itemfreq(snps_tdn['snp'])
my_max = np.max(counts_tdn[:, 1])
print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_tdn))
print counts_tdn[counts_tdn[:, 1] == my_max]

Maximum frequency: 5 in 18802 unique snps
[['chr1 16914580' 5]
 ['chr10 49319323' 5]
 ['chr12 69667893' 5]
 ['chrX 102973509' 5]
 ['chrX 118751076' 5]
 ['chrX 13397236' 5]
 ['chrX 135307049' 5]
 ['chrX 14599572' 5]
 ['chrX 152610294' 5]
 ['chrX 153008911' 5]
 ['chrX 153880181' 5]
 ['chrX 153904473' 5]
 ['chrX 154774663' 5]
 ['chrX 16876980' 5]
 ['chrX 38262808' 5]
 ['chrX 43601142' 5]
 ['chrX 96502650' 5]]


The results here didn't change either... Do we really have 5 trios with each of these SNPs?

In [70]:
%%bash
grep -l "16914580" /data/NCR_SBRB/simplex/triodenovo/*trio1_denovo_v2.vcf
grep -l "49319323" /data/NCR_SBRB/simplex/triodenovo/*trio1_denovo_v2.vcf
grep -l "69667893" /data/NCR_SBRB/simplex/triodenovo/*trio1_denovo_v2.vcf

/data/NCR_SBRB/simplex/triodenovo/10094_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/10128_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/10153_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/10448_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/855_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/10153_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/10406_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/1892_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/1893_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/1976_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/10033_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/10153_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/10164_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/10178_trio1_denovo_v2.vcf
/data/NCR_SBRB/simplex/triodenovo/1976_trio1_denovo_v2.vcf


Apparently yes. Not always the same families, but always 5 different ones.

## DenovoGear

In [71]:
%%bash
suffix=dnm.vcf 
cd /data/NCR_SBRB/simplex/dng
rm interesting_snvs_${suffix}.txt
# figure out all family IDs
ls -1 *_trio1_${suffix} > famids.txt
sed -i -e "s/_trio1_${suffix}//g" famids.txt

# for each family ID
while read fam; do
  # figure out how many trios we have
   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;
   ntrios=$(($ntrios))
   # if we have more than one (assuming the first one is affected)
   if [ $ntrios -gt 1 ]; then
      # get all SNVs in the affected trio in the family
      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
      # combine the vcf files of all unnafected trios
      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;
      # for each possible SNV in affected trio, mark it as interesting if it's not
      # in the unnafected trios
      while read snv; do
         if ! grep -q "$snv" ${fam}_control_snvs_${suffix}.txt; then
            echo $snv >> interesting_snvs_${suffix}.txt;
         fi;
      done < ${fam}_possible_snvs_${suffix}.txt;
   fi;
done < famids.txt

In [72]:
snps_dng = pd.read_table('/data/NCR_SBRB/simplex/dng/interesting_snvs_dnm.vcf.txt',
                               header=None, names=['snp'])
counts_dng = stats.itemfreq(snps_dng['snp'])
my_max = np.max(counts_dng[:, 1])
print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_dng))
print counts_dng[counts_dng[:, 1] == my_max]

Maximum frequency: 6 in 20901 unique snps
[['chr2 89072915' 6]
 ['chr8 97156476' 6]]


Well, 6 families is better than 5 (or 2). But the other tools didn't pick that one up. Still worth analyzing it...

In [73]:
%%bash
grep -l "89072915" /data/NCR_SBRB/simplex/dng/*trio1_dnm.vcf
grep -l "97156476" /data/NCR_SBRB/simplex/dng/*trio1_dnm.vcf

/data/NCR_SBRB/simplex/dng/10128_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/10131_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/10164_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/10173_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/10182_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/1893_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/10164_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/10178_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/10197_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/10406_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/1892_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/1976_trio1_dnm.vcf
/data/NCR_SBRB/simplex/dng/855_trio1_dnm.vcf


Again, all 6 different families. There's also a 7th family, but likely there the variable didn't get selected because it was also in the unaffected sib.

# Within group stats

The approach here is to calculate the best stats in all ADHD samples, and see what's the best we can do for a specific variable in non-affected siblings. For comparison, we can do it the other way around as well.

## Triodenovo

In [74]:
%%bash
cd /data/NCR_SBRB/simplex/triodenovo/

# concatenate all affected trios and extract the SNPs
rm affected_snvs.txt unaffected_snvs.txt;
for f in `ls *_trio1_denovo_v2.vcf`; do
    cat $f | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> affected_snvs.txt;
done
# need to be careful not to double count unaffected SNPs
while read fam; do
    if [ -e ${fam}_trio2_denovo_v2.vcf ]; then
        cat ${fam}_trio[2..4]_denovo_v2.vcf | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
    fi;
done < famids.txt

In [75]:
naff = 21
nunaff = 29
aff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/affected_snvs.txt',
                               header=None, names=['snp'])
counts_aff_tdn = stats.itemfreq(aff_tdn['snp'])
my_max = np.max(counts_aff_tdn[:, 1])
print 'Maximum frequency: %d in %d affected families' % (my_max, naff)
print counts_aff_tdn[counts_aff_tdn[:, 1] == my_max]

unaff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/unaffected_snvs.txt',
                               header=None, names=['snp'])
counts_unaff_tdn = stats.itemfreq(unaff_tdn['snp'])
my_max = np.max(counts_unaff_tdn[:, 1])
print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)
print counts_unaff_tdn[counts_unaff_tdn[:, 1] == my_max]

Maximum frequency: 14 in 21 affected families
[['chrX:41093413' 14]]
Maximum frequency: 11 in 29 unaffected families
[['chrX:100098355' 11]
 ['chrX:150573743' 11]
 ['chrX:150575385' 11]
 ['chrX:27479339' 11]
 ['chrX:9686187' 11]]


So, we found a mutation in 14 out of the 21 affected families, but we can also find mutations happening in 11 out of 29 unaffected families. So, 14 is not terribly impressive. But let's see how often this one mutation happens in unaffected famlies?

In [76]:
counts_unaff_tdn[counts_unaff_tdn[:, 0] == 'chrX:41093413', 1]

array([9], dtype=object)

We can potentially assign some significance to that... but what if we strict ourselves to autossomal chromosomes?

In [77]:
aff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/affected_snvs.txt',
                               header=None, names=['snp'])
counts_aff_tdn = stats.itemfreq(aff_tdn['snp'])
unaff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/unaffected_snvs.txt',
                               header=None, names=['snp'])
counts_unaff_tdn = stats.itemfreq(unaff_tdn['snp'])

keep_me = [s for s, snp in enumerate(counts_aff_tdn[:, 0]) if snp.find('chrX') < 0]
counts_aff_tdn = counts_aff_tdn[keep_me, ]
keep_me = [s for s, snp in enumerate(counts_unaff_tdn[:, 0]) if snp.find('chrX') < 0]
counts_unaff_tdn = counts_unaff_tdn[keep_me, ]

my_max = np.max(counts_aff_tdn[:, 1])
max_str = counts_aff_tdn[counts_aff_tdn[:, 1] == my_max, 0][0]
print 'Maximum frequency: %d in %d affected families' % (my_max, naff)
print counts_aff_tdn[counts_aff_tdn[:, 1] == my_max]
my_max = np.max(counts_unaff_tdn[:, 1])
print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)
print counts_unaff_tdn[counts_unaff_tdn[:, 1] == my_max]
print 'Counts %s shows up in unaffected: %d' % (max_str, counts_unaff_tdn[counts_unaff_tdn[:, 0] == max_str, 1])

Maximum frequency: 9 in 21 affected families
[['chr13:42894052' 9]
 ['chr3:39556476' 9]
 ['chrM:3012' 9]]
Maximum frequency: 8 in 29 unaffected families
[['chr1:7889972' 8]
 ['chrM:3012' 8]]
Counts chr13:42894052 shows up in unaffected: 4


## DenovoGear

In [78]:
%%bash
cd /data/NCR_SBRB/simplex/dng/

# concatenate all affected trios and extract the SNPs
rm affected_snvs.txt unaffected_snvs.txt;
for f in `ls *_trio1_dnm.vcf`; do
    cat $f | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> affected_snvs.txt;
done
for f in `ls *_trio[2..4]_dnm.vcf`; do
    cat $f | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
done

In [79]:
naff = 21
nunaff = 29
aff_dng = pd.read_table('/data/NCR_SBRB/simplex/dng/affected_snvs.txt',
                               header=None, names=['snp'])
counts_aff_dng = stats.itemfreq(aff_dng['snp'])
my_max = np.max(counts_aff_dng[:, 1])
print 'Maximum frequency: %d in %d affected families' % (my_max, naff)
print counts_aff_dng[counts_aff_dng[:, 1] == my_max]

unaff_dng = pd.read_table('/data/NCR_SBRB/simplex/dng/unaffected_snvs.txt',
                               header=None, names=['snp'])
counts_unaff_dng = stats.itemfreq(unaff_dng['snp'])
my_max = np.max(counts_unaff_dng[:, 1])
print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)
print counts_unaff_dng[counts_unaff_dng[:, 1] == my_max]

Maximum frequency: 8 in 21 affected families
[['chr17:21319860' 8]
 ['chrX:8432783' 8]]
Maximum frequency: 7 in 29 unaffected families
[['chr4:113190251' 7]
 ['chrX:13779124' 7]
 ['chrX:152772473' 7]
 ['chrX:8432783' 7]
 ['chrY:9967496' 7]]


In [80]:
aff_dng = pd.read_table('/data/NCR_SBRB/simplex/dng/affected_snvs.txt',
                               header=None, names=['snp'])
counts_aff_dng = stats.itemfreq(aff_dng['snp'])
unaff_dng = pd.read_table('/data/NCR_SBRB/simplex/dng/unaffected_snvs.txt',
                               header=None, names=['snp'])
counts_unaff_dng = stats.itemfreq(unaff_dng['snp'])

keep_me = [s for s, snp in enumerate(counts_aff_dng[:, 0]) if snp.find('chrX') < 0]
counts_aff_dng = counts_aff_dng[keep_me, ]
keep_me = [s for s, snp in enumerate(counts_unaff_dng[:, 0]) if snp.find('chrX') < 0]
counts_unaff_dng = counts_unaff_dng[keep_me, ]

my_max = np.max(counts_aff_dng[:, 1])
max_str = counts_aff_dng[counts_aff_dng[:, 1] == my_max, 0][0]
print 'Maximum frequency: %d in %d affected families' % (my_max, naff)
print counts_aff_dng[counts_aff_dng[:, 1] == my_max]
my_max = np.max(counts_unaff_dng[:, 1])
print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)
print counts_unaff_dng[counts_unaff_dng[:, 1] == my_max]
print 'Counts %s shows up in unaffected: %d' % (max_str, counts_unaff_dng[counts_unaff_dng[:, 0] == max_str, 1])

Maximum frequency: 8 in 21 affected families
[['chr17:21319860' 8]]
Maximum frequency: 7 in 29 unaffected families
[['chr4:113190251' 7]
 ['chrY:9967496' 7]]
Counts chr17:21319860 shows up in unaffected: 6


## GATK

In [32]:
%%bash
cd /data/NCR_SBRB/simplex/gatk_refine

# concatenate all affected trios and extract the SNPs
rm affected_snvs.txt unaffected_snvs.txt;
for f in `ls *_trio1_hiConfDeNovo.vcf`; do
    cat $f | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> affected_snvs.txt;
done
for f in `ls *_trio[2..4]_hiConfDeNovo.vcf`; do
    cat $f | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
done

In [33]:
naff = 21
nunaff = 29
aff_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/affected_snvs.txt',
                               header=None, names=['snp'])
counts_aff_gatk = stats.itemfreq(aff_gatk['snp'])
my_max = np.max(counts_aff_gatk[:, 1])
print 'Maximum frequency: %d in %d affected families' % (my_max, naff)
print counts_aff_gatk[counts_aff_gatk[:, 1] == my_max]

unaff_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/unaffected_snvs.txt',
                               header=None, names=['snp'])
counts_unaff_gatk = stats.itemfreq(unaff_gatk['snp'])
my_max = np.max(counts_unaff_gatk[:, 1])
print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)
print counts_unaff_gatk[counts_unaff_gatk[:, 1] == my_max]

Maximum frequency: 2 in 21 affected families
[['chr17:72762902' 2]
 ['chr18:15271172' 2]
 ['chr1:40229504' 2]
 ['chr20:29637674' 2]
 ['chr20:29637691' 2]
 ['chr2:89105006' 2]
 ['chr6:29968761' 2]
 ['chr6:32548712' 2]
 ['chr6:32557647' 2]
 ['chr6:32725367' 2]
 ['chr7:154002420' 2]
 ['chr7:154467985' 2]]
Maximum frequency: 2 in 29 unaffected families
[['chr12:11214231' 2]
 ['chr12:11214232' 2]
 ['chr12:50745851' 2]
 ['chr12:50745893' 2]
 ['chr12:50745894' 2]
 ['chr12:99139159' 2]
 ['chr16:33940111' 2]
 ['chr16:33940122' 2]
 ['chr1:142813302' 2]
 ['chr1:148902738' 2]
 ['chr1:206566826' 2]
 ['chr20:29638202' 2]
 ['chr2:9546136' 2]
 ['chr6:32083111' 2]
 ['chrX:116025284' 2]]


# Some numbers

First, how many DNVs do we get in each trio, per tool?

## TrioDenovo

In [81]:
%%bash

cd /data/NCR_SBRB/simplex/triodenovo
for f in `ls *_trio?_denovo_v2.vcf`; do
    echo $f `grep -v '#' ${f} | sort | uniq | wc -l`;
done

10033_trio1_denovo_v2.vcf 1188
10033_trio2_denovo_v2.vcf 2353
10042_trio1_denovo_v2.vcf 2399
10090_trio1_denovo_v2.vcf 1884
10090_trio2_denovo_v2.vcf 1712
10094_trio1_denovo_v2.vcf 1895
10094_trio2_denovo_v2.vcf 1846
10128_trio1_denovo_v2.vcf 2201
10128_trio2_denovo_v2.vcf 2157
10131_trio1_denovo_v2.vcf 2033
10131_trio2_denovo_v2.vcf 1125
10131_trio3_denovo_v2.vcf 1164
10131_trio4_denovo_v2.vcf 1333
10153_trio1_denovo_v2.vcf 1119
10153_trio2_denovo_v2.vcf 1912
10153_trio3_denovo_v2.vcf 1007
10164_trio1_denovo_v2.vcf 2141
10164_trio2_denovo_v2.vcf 1869
10173_trio1_denovo_v2.vcf 1897
10173_trio2_denovo_v2.vcf 1916
10178_trio1_denovo_v2.vcf 2577
10178_trio2_denovo_v2.vcf 2423
10182_trio1_denovo_v2.vcf 2347
10182_trio2_denovo_v2.vcf 1333
10182_trio3_denovo_v2.vcf 1830
10197_trio1_denovo_v2.vcf 1281
10197_trio2_denovo_v2.vcf 1179
10215_trio1_denovo_v2.vcf 2311
10215_trio2_denovo_v2.vcf 2270
10215_trio3_denovo_v2.vcf 1122
10215_trio4_denovo_v2.vcf 1177
10369_trio1_denovo_v2.vcf 600
10369_tri

## DenovoGear

In [82]:
%%bash

cd /data/NCR_SBRB/simplex/dng
for f in `ls *_trio?_dnm.vcf`; do
    echo $f `grep -v '#' ${f} | sort | uniq | wc -l`;
done

10033_trio1_dnm.vcf 1417
10033_trio2_dnm.vcf 2305
10042_trio1_dnm.vcf 2234
10090_trio1_dnm.vcf 1896
10090_trio2_dnm.vcf 1810
10094_trio1_dnm.vcf 1962
10094_trio2_dnm.vcf 1845
10128_trio1_dnm.vcf 1574
10128_trio2_dnm.vcf 1896
10131_trio1_dnm.vcf 2078
10131_trio2_dnm.vcf 1538
10131_trio3_dnm.vcf 1398
10131_trio4_dnm.vcf 1663
10153_trio1_dnm.vcf 1399
10153_trio2_dnm.vcf 1710
10153_trio3_dnm.vcf 1284
10164_trio1_dnm.vcf 1909
10164_trio2_dnm.vcf 1724
10173_trio1_dnm.vcf 1825
10173_trio2_dnm.vcf 1836
10178_trio1_dnm.vcf 2310
10178_trio2_dnm.vcf 2343
10182_trio1_dnm.vcf 2157
10182_trio2_dnm.vcf 1620
10182_trio3_dnm.vcf 1775
10197_trio1_dnm.vcf 1634
10197_trio2_dnm.vcf 1414
10215_trio1_dnm.vcf 1912
10215_trio2_dnm.vcf 1900
10215_trio3_dnm.vcf 1311
10215_trio4_dnm.vcf 1262
10369_trio1_dnm.vcf 636
10369_trio2_dnm.vcf 635
10406_trio1_dnm.vcf 1674
10406_trio2_dnm.vcf 1187
10406_trio3_dnm.vcf 1870
10448_trio1_dnm.vcf 1824
10448_trio2_dnm.vcf 1681
10459_trio2_dnm.vcf 1824
1892_trio1_dnm.vcf 1347
189

## GATK

In [64]:
%%bash

cd /data/NCR_SBRB/simplex/gatk_refine
for f in `ls *_trio?_hiConfDeNovo.vcf`; do
    echo $f `grep -v '#' ${f} | sort | uniq | wc -l`;
done

10033_trio1_hiConfDeNovo.vcf 69
10033_trio2_hiConfDeNovo.vcf 99
10042_trio1_hiConfDeNovo.vcf 81
10090_trio1_hiConfDeNovo.vcf 89
10090_trio2_hiConfDeNovo.vcf 63
10094_trio1_hiConfDeNovo.vcf 80
10094_trio2_hiConfDeNovo.vcf 55
10128_trio1_hiConfDeNovo.vcf 54
10128_trio2_hiConfDeNovo.vcf 52
10131_trio1_hiConfDeNovo.vcf 48
10131_trio2_hiConfDeNovo.vcf 65
10131_trio3_hiConfDeNovo.vcf 39
10131_trio4_hiConfDeNovo.vcf 86
10153_trio1_hiConfDeNovo.vcf 70
10153_trio2_hiConfDeNovo.vcf 75
10153_trio3_hiConfDeNovo.vcf 69
10164_trio1_hiConfDeNovo.vcf 122
10164_trio2_hiConfDeNovo.vcf 103
10173_trio1_hiConfDeNovo.vcf 101
10173_trio2_hiConfDeNovo.vcf 112
10178_trio1_hiConfDeNovo.vcf 519
10178_trio2_hiConfDeNovo.vcf 354
10182_trio1_hiConfDeNovo.vcf 90
10182_trio2_hiConfDeNovo.vcf 77
10182_trio3_hiConfDeNovo.vcf 72
10197_trio1_hiConfDeNovo.vcf 416
10197_trio2_hiConfDeNovo.vcf 244
10215_trio1_hiConfDeNovo.vcf 201
10215_trio2_hiConfDeNovo.vcf 128
10215_trio3_hiConfDeNovo.vcf 2023
10215_trio4_hiConfDeNovo.vcf

How many of those are only show up in the affected trio?

## TrioDenovo

In [83]:
%%bash
suffix=denovo_v2.vcf 
cd /data/NCR_SBRB/simplex/triodenovo
rm interesting_snvs_${suffix}.txt
# figure out all family IDs
ls -1 *_trio1_${suffix} > famids.txt
sed -i -e "s/_trio1_${suffix}//g" famids.txt

# for each family ID
while read fam; do
  # figure out how many trios we have
   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;
   ntrios=$(($ntrios))
   # if we have more than one (assuming the first one is affected)
   if [ $ntrios -gt 1 ]; then
      # get all SNVs in the affected trio in the family
      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
      # combine the vcf files of all unnafected trios
      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;
      # for each possible SNV in affected trio, mark it as interesting if it's not
      # in the unnafected trios
      while read snv; do
         if ! grep -q "$snv" ${fam}_control_snvs_${suffix}.txt; then
            echo $snv >> interesting_snvs_${suffix}.txt;
         fi;
      done < ${fam}_possible_snvs_${suffix}.txt;
      echo $fam `cat interesting_snvs_${suffix}.txt | wc -l`;
      rm interesting_snvs_${suffix}.txt
   fi;
done < famids.txt

10033 956
10090 1238
10094 1052
10128 1080
10131 1707
10153 760
10164 1515
10173 1105
10178 1720
10182 1105
10197 1078
10215 960
10369 402
10406 826
10448 1164
1892 823
1893 1625
1895 1209
1976 1150
855 1075


## DenovoGear

In [84]:
%%bash
suffix=dnm.vcf 
cd /data/NCR_SBRB/simplex/dng
rm interesting_snvs_${suffix}.txt
# figure out all family IDs
ls -1 *_trio1_${suffix} > famids.txt
sed -i -e "s/_trio1_${suffix}//g" famids.txt

# for each family ID
while read fam; do
  # figure out how many trios we have
   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;
   ntrios=$(($ntrios))
   # if we have more than one (assuming the first one is affected)
   if [ $ntrios -gt 1 ]; then
      # get all SNVs in the affected trio in the family
      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
      # combine the vcf files of all unnafected trios
      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;
      # for each possible SNV in affected trio, mark it as interesting if it's not
      # in the unnafected trios
      while read snv; do
         if ! grep -q "$snv" ${fam}_control_snvs_${suffix}.txt; then
            echo $snv >> interesting_snvs_${suffix}.txt;
         fi;
      done < ${fam}_possible_snvs_${suffix}.txt;
      echo $fam `cat interesting_snvs_${suffix}.txt | wc -l`;
      rm interesting_snvs_${suffix}.txt
   fi;
done < famids.txt

10033 1128
10090 1403
10094 1349
10128 1171
10131 1614
10153 922
10164 1539
10173 1226
10178 1866
10182 1165
10197 1349
10215 1042
10369 431
10406 947
10448 1355
1892 1074
1893 1743
1895 1499
1976 1298
855 1348


## GATK

In [85]:
%%bash
suffix=hiConfDeNovo.vcf 
cd /data/NCR_SBRB/simplex/gatk_refine
rm interesting_snvs_${suffix}.txt
# figure out all family IDs
ls -1 *_trio1_${suffix} > famids.txt
sed -i -e "s/_trio1_${suffix}//g" famids.txt

# for each family ID
while read fam; do
  # figure out how many trios we have
   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;
   ntrios=$(($ntrios))
   # if we have more than one (assuming the first one is affected)
   if [ $ntrios -gt 1 ]; then
      # get all SNVs in the affected trio in the family
      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
      # combine the vcf files of all unnafected trios
      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;
      # for each possible SNV in affected trio, mark it as interesting if it's not
      # in the unnafected trios
      while read snv; do
         if ! grep -q "$snv" ${fam}_control_snvs_${suffix}.txt; then
            echo $snv >> interesting_snvs_${suffix}.txt;
         fi;
      done < ${fam}_possible_snvs_${suffix}.txt;
      echo $fam `cat interesting_snvs_${suffix}.txt | wc -l`;
      rm interesting_snvs_${suffix}.txt
   fi;
done < famids.txt

10033 68
10090 87
10094 79
10128 52
10131 45
10153 66
10164 119
10173 101
10178 516
10182 85
10197 415
10215 200
10369 946
10406 118
10448 69
1892 91
1893 95
1895 80
1976 89
855 77


Let's now find a few intersections. First, for all called SNPs:

Now, intersections for the SNPs in affected sibs only:

* RUN GATK WITHOUT FILTER? USE BOTH HICONF AND LOCONF DENOVOS!
* GET NUMBER OF POSSIBLE DE NOVO PER FAMILY
* MAKE VENN DIAGRAMS?