Let's get some numbers for Philip to show in the big presentation. Let's do it for GATK high conf only, then all, then triodenovo:

In [1]:
import pandas as pd
from scipy import stats
import numpy as np

In [13]:
# need to make this less clunky in the future!
%env SLURM_JOB_ID=53272018
%env SLURM_CPUS_PER_TASK=32

env: SLURM_JOB_ID=53272018
env: SLURM_CPUS_PER_TASK=32


# GATK high conf

In [None]:
%%bash

module load annovar

suffix=hiConfDeNovo
cd /data/NCR_SBRB/simplex/gatk_refine

# for each family ID
echo "famid,denovo,popmax,popmax_MAFltp01,kaviar,kaviar_MAFltp01,dbSNP,ClinVar" > rare_variants_${suffix}.csv;
while read fam; do
  # get all SNVs in the affected trio in the family
  cut -f 1,2 ${fam}_trio1_${suffix}.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
  ndenovo=`cat ${fam}_possible_snvs_${suffix}.txt | wc -l`;
  # convert the file to ANNOVAR format
  convert2annovar.pl -format vcf4old ${fam}_trio1_${suffix}.vcf > ${fam}_trio1_${suffix}.avinput;
  # assign population statistics to the file accoring to different databases
  annotate_variation.pl -filter -dbtype popfreq_max_20150413 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  npopmax=`cat ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`;
  # filter based on rare variants
  nMAFpopmax=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`
  annotate_variation.pl -filter -dbtype kaviar_20150923 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  nkaviar=`cat ${fam}_trio1_${suffix}.avinput.hg19_kaviar_20150923_dropped | wc -l`;
  nMAFkaviar=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_kaviar_20150923_dropped | wc -l`
  annotate_variation.pl -filter -dbtype avsnp142 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  ndbsnp=`cat ${fam}_trio1_${suffix}.avinput.hg19_avsnp142_dropped | wc -l`;
  annotate_variation.pl -filter -dbtype clinvar_20170130 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  nclinvar=`cat ${fam}_trio1_${suffix}.avinput.hg19_clinvar_20170130_dropped | wc -l`;
  echo $fam,$ndenovo,$npopmax,$nMAFpopmax,$nkaviar,$nMAFkaviar,$ndbsnp,$nclinvar >> rare_variants_${suffix}.csv
done < ../famids.txt

Had to run the code above outside the notebook because I wasn't getting an output to the screen until the whole thing was over, which is hard to monitor.

I'm using Kaviar above because it was the method that recognized the most variants, and then maxpop because it gives the maximum population frequency across many databases, althouh it doesn't include Kaviar. Now, the next step is to add how many of those rare variants also show up in the unaffected twin, and finally do a gene-based annotation in the variant. Let's focus on the popmax rare variants first, but it should be straight forward to apply this to any other sets:

In [29]:
%%bash

cd /data/NCR_SBRB/simplex/gatk_refine
suffix=hiConfDeNovo

echo 'famid,popmax_MAFltp01,affectedOnly' > rare_variants_affectedOnly_${suffix}.csv

while read fam; do
    awk '$2 < .01 {OFS=":"; print $3, $4}' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped > ${fam}_${suffix}_popmax_MAFltp01.txt;
    if [ -e ${fam}_trio2_${suffix}.vcf ]; then
        rm interesting_snvs.txt unaffected_snvs.txt
        cat ${fam}_trio[2..4]_${suffix}.vcf | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
        # for each possible SNV in affected trio, mark it as interesting if it's not
        # in the unnafected trios
        while read snv; do
            if ! grep -q "$snv" unaffected_snvs.txt; then
                echo $snv >> interesting_snvs.txt;
            fi;
        done < ${fam}_${suffix}_popmax_MAFltp01.txt;
        naffected=`cat interesting_snvs.txt | wc -l`
    else
        naffected='NA'
    fi;
    nMAFpopmax=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`
    echo $fam,$nMAFpopmax,$naffected >> rare_variants_affectedOnly_${suffix}.csv;
done < ../famids.txt

In [16]:
%%bash

module load annovar

cd /data/NCR_SBRB/simplex/gatk_refine
while read fam; do
    table_annovar.pl ${fam}_trio1_hiConfDeNovo.vcf $ANNOVAR_DATA/hg19 \
        --tempdir /lscratch/${SLURM_JOB_ID} --thread ${SLURM_CPUS_PER_TASK} \
        --buildver hg19 --out ${fam}_trio1_hiConfDeNovo --remove \
        --protocol refGene,exac03,avsnp147,clinvar_20170130,esp6500siv2_all \
        -operation g,f,f,f,f --nastring . --vcfinput
done < famids.txt

[+] Loading ANNOVAR 2017-07-16 ...
NOTICE: temporary files will be written to /lscratch/53272018/RB3He7X-

NOTICE: Running with system command <convert2annovar.pl -includeinfo -allsample -withfreq -format vcf4 10033_trio1_hiConfDeNovo.vcf > /lscratch/53272018/RB3He7X-/temp.avinput>
NOTICE: Finished reading 155 lines from VCF file
NOTICE: A total of 69 locus in VCF file passed QC threshold, representing 56 SNPs (36 transitions and 20 transversions) and 13 indels/substitutions
NOTICE: Finished writing allele frequencies based on 5544 SNP genotypes (3564 transitions and 1980 transversions) and 1287 indels/substitutions for 99 samples

NOTICE: Running with system command </usr/local/apps/ANNOVAR/2017-07-16/table_annovar.pl /lscratch/53272018/RB3He7X-/temp.avinput /fdb/annovar/2017-07-16/hg19 --tempdir /lscratch/53272018 --thread 32 --buildver hg19 -outfile /lscratch/53272018/RB3He7X-/temp --remove --protocol refGene,exac03,avsnp147,clinvar_20170130,esp6500siv2_all -operation g,f,f,f,f --na

# GATK all

In [None]:
%%bash

module load annovar

suffix=allDeNovo
cd /data/NCR_SBRB/simplex/gatk_refine

# for each family ID
echo "famid,denovo,popmax,popmax_MAFltp01,kaviar,kaviar_MAFltp01,dbSNP,ClinVar" > rare_variants_${suffix}.csv;
while read fam; do
  # get all SNVs in the affected trio in the family
  cut -f 1,2 ${fam}_trio1_${suffix}.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
  ndenovo=`cat ${fam}_possible_snvs_${suffix}.txt | wc -l`;
  # convert the file to ANNOVAR format
  convert2annovar.pl -format vcf4old ${fam}_trio1_${suffix}.vcf > ${fam}_trio1_${suffix}.avinput;
  # assign population statistics to the file accoring to different databases
  annotate_variation.pl -filter -dbtype popfreq_max_20150413 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  npopmax=`cat ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`;
  # filter based on rare variants
  nMAFpopmax=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`
  annotate_variation.pl -filter -dbtype kaviar_20150923 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  nkaviar=`cat ${fam}_trio1_${suffix}.avinput.hg19_kaviar_20150923_dropped | wc -l`;
  nMAFkaviar=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_kaviar_20150923_dropped | wc -l`
  annotate_variation.pl -filter -dbtype avsnp142 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  ndbsnp=`cat ${fam}_trio1_${suffix}.avinput.hg19_avsnp142_dropped | wc -l`;
  annotate_variation.pl -filter -dbtype clinvar_20170130 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  nclinvar=`cat ${fam}_trio1_${suffix}.avinput.hg19_clinvar_20170130_dropped | wc -l`;
  echo $fam,$ndenovo,$npopmax,$nMAFpopmax,$nkaviar,$nMAFkaviar,$ndbsnp,$nclinvar >> rare_variants_${suffix}.csv
done < ../famids.txt

In [30]:
%%bash

cd /data/NCR_SBRB/simplex/gatk_refine
suffix=allDeNovo

echo 'famid,popmax_MAFltp01,affectedOnly' > rare_variants_affectedOnly_${suffix}.csv

while read fam; do
    awk '$2 < .01 {OFS=":"; print $3, $4}' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped > ${fam}_${suffix}_popmax_MAFltp01.txt;
    if [ -e ${fam}_trio2_${suffix}.vcf ]; then
        rm interesting_snvs.txt unaffected_snvs.txt
        cat ${fam}_trio[2..4]_${suffix}.vcf | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
        # for each possible SNV in affected trio, mark it as interesting if it's not
        # in the unnafected trios
        while read snv; do
            if ! grep -q "$snv" unaffected_snvs.txt; then
                echo $snv >> interesting_snvs.txt;
            fi;
        done < ${fam}_${suffix}_popmax_MAFltp01.txt;
        naffected=`cat interesting_snvs.txt | wc -l`
    else
        naffected='NA'
    fi;
    nMAFpopmax=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`
    echo $fam,$nMAFpopmax,$naffected >> rare_variants_affectedOnly_${suffix}.csv;
done < ../famids.txt

rm: cannot remove `interesting_snvs.txt': No such file or directory
rm: cannot remove `unaffected_snvs.txt': No such file or directory


# Triodenovo

In [None]:
%%bash

module load annovar

suffix=denovo_v2
cd /data/NCR_SBRB/simplex/triodenovo

# for each family ID
echo "famid,denovo,popmax,popmax_MAFltp01,kaviar,kaviar_MAFltp01,dbSNP,ClinVar" > rare_variants_${suffix}.csv;
while read fam; do
  # get all SNVs in the affected trio in the family
  cut -f 1,2 ${fam}_trio1_${suffix}.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
  ndenovo=`cat ${fam}_possible_snvs_${suffix}.txt | wc -l`;
  # convert the file to ANNOVAR format
  convert2annovar.pl -format vcf4old ${fam}_trio1_${suffix}.vcf > ${fam}_trio1_${suffix}.avinput;
  # assign population statistics to the file accoring to different databases
  annotate_variation.pl -filter -dbtype popfreq_max_20150413 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  npopmax=`cat ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`;
  # filter based on rare variants
  nMAFpopmax=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`
  annotate_variation.pl -filter -dbtype kaviar_20150923 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  nkaviar=`cat ${fam}_trio1_${suffix}.avinput.hg19_kaviar_20150923_dropped | wc -l`;
  nMAFkaviar=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_kaviar_20150923_dropped | wc -l`
  annotate_variation.pl -filter -dbtype avsnp142 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  ndbsnp=`cat ${fam}_trio1_${suffix}.avinput.hg19_avsnp142_dropped | wc -l`;
  annotate_variation.pl -filter -dbtype clinvar_20170130 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
  nclinvar=`cat ${fam}_trio1_${suffix}.avinput.hg19_clinvar_20170130_dropped | wc -l`;
  echo $fam,$ndenovo,$npopmax,$nMAFpopmax,$nkaviar,$nMAFkaviar,$ndbsnp,$nclinvar >> rare_variants_${suffix}.csv
done < ../famids.txt

In [31]:
%%bash

cd /data/NCR_SBRB/simplex/triodenovo
suffix=denovo_v2

echo 'famid,popmax_MAFltp01,affectedOnly' > rare_variants_affectedOnly_${suffix}.csv

while read fam; do
    awk '$2 < .01 {OFS=":"; print $3, $4}' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped > ${fam}_${suffix}_popmax_MAFltp01.txt;
    if [ -e ${fam}_trio2_${suffix}.vcf ]; then
        rm interesting_snvs.txt unaffected_snvs.txt
        cat ${fam}_trio[2..4]_${suffix}.vcf | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
        # for each possible SNV in affected trio, mark it as interesting if it's not
        # in the unnafected trios
        while read snv; do
            if ! grep -q "$snv" unaffected_snvs.txt; then
                echo $snv >> interesting_snvs.txt;
            fi;
        done < ${fam}_${suffix}_popmax_MAFltp01.txt;
        naffected=`cat interesting_snvs.txt | wc -l`
    else
        naffected='NA'
    fi;
    nMAFpopmax=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`
    echo $fam,$nMAFpopmax,$naffected >> rare_variants_affectedOnly_${suffix}.csv;
done < ../famids.txt

rm: cannot remove `interesting_snvs.txt': No such file or directory


# GATK hiConf and triodenovo overlap

I don't think the de novo overlap will be big to begin with, so we can just try calculating that first, before checking how rare they are, etc.

In [33]:
%%bash


cd ~/data/tmp/

# for each family ID
while read fam; do
  # get all GATK SNVs in the affected trio in the family
  suffix=hiConfDeNovo
  cut -f 1,2 --output-delimiter=":" /data/NCR_SBRB/simplex/gatk_refine/${fam}_trio1_${suffix}.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
  suffix=denovo_v2
  cut -f 1,2 --output-delimiter=":" /data/NCR_SBRB/simplex/triodenovo/${fam}_trio1_${suffix}.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
  grep -Fx -f ${fam}_possible_snvs_hiConfDeNovo.txt ${fam}_possible_snvs_denovo_v2.txt > intersect.txt;
  echo $fam `cat intersect.txt | wc -l`
done < /data/NCR_SBRB/simplex/famids.txt

10033 38
10042 32
10090 33
10094 27
10128 29
10131 26
10153 35
10164 38
10173 29
10178 61
10182 25
10197 31
10215 34
10369 21
10406 32
10448 30
1892 26
1893 42
1895 34
1976 45
855 38


This turned out to be a bit higher than what I expected... let's calculate the full numbers for them:

In [34]:
%%bash


cd ~/data/tmp/
suffix=hiConfANDtriodenovo

echo "famid,denovo,popmax,popmax_MAFltp01,kaviar,kaviar_MAFltp01,dbSNP,ClinVar" > rare_variants_${suffix}.csv;
# for each family ID
while read fam; do
    # get all GATK SNVs in the affected trio in the family
    cut -f 1,2 --output-delimiter=":" /data/NCR_SBRB/simplex/gatk_refine/${fam}_trio1_hiConfDeNovo.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_hiConfDeNovo.txt;
    cut -f 1,2 --output-delimiter=":" /data/NCR_SBRB/simplex/triodenovo/${fam}_trio1_denovo_v2.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_denovo_v2.txt;
    grep -Fx -f ${fam}_possible_snvs_hiConfDeNovo.txt ${fam}_possible_snvs_denovo_v2.txt > intersect.txt;

    # the VCF from triodenovo is a bit cleaner, so let's grep from there
    grep "#" /data/NCR_SBRB/simplex/triodenovo/${fam}_trio1_denovo_v2.vcf > interesting_snvs.vcf;
    cut -f 2 -d ":" intersect.txt > snp_pos.txt;
    while read snv; do
        grep ${snv} /data/NCR_SBRB/simplex/triodenovo/${fam}_trio1_denovo_v2.vcf >> interesting_snvs.vcf;
    done < snp_pos.txt;
    
    # get all SNVs in the affected trio in the family
    cut -f 1,2 interesting_snvs.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
    ndenovo=`cat ${fam}_possible_snvs_${suffix}.txt | wc -l`;
    
    # convert the file to ANNOVAR format
    convert2annovar.pl -format vcf4old interesting_snvs.vcf > ${fam}_trio1_${suffix}.avinput;
    # assign population statistics to the file accoring to different databases
    annotate_variation.pl -filter -dbtype popfreq_max_20150413 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
    npopmax=`cat ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`;
    # filter based on rare variants
    nMAFpopmax=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`
    annotate_variation.pl -filter -dbtype kaviar_20150923 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
    nkaviar=`cat ${fam}_trio1_${suffix}.avinput.hg19_kaviar_20150923_dropped | wc -l`;
    nMAFkaviar=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_kaviar_20150923_dropped | wc -l`
    annotate_variation.pl -filter -dbtype avsnp142 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
    ndbsnp=`cat ${fam}_trio1_${suffix}.avinput.hg19_avsnp142_dropped | wc -l`;
    annotate_variation.pl -filter -dbtype clinvar_20170130 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
    nclinvar=`cat ${fam}_trio1_${suffix}.avinput.hg19_clinvar_20170130_dropped | wc -l`;
    echo $fam,$ndenovo,$npopmax,$nMAFpopmax,$nkaviar,$nMAFkaviar,$ndbsnp,$nclinvar >> rare_variants_${suffix}.csv
done < /data/NCR_SBRB/simplex/famids.txt

bash: line 40: snp_pos: No such file or directory
bash: line 26: convert2annovar.pl: command not found
bash: line 28: annotate_variation.pl: command not found
cat: 10033_trio1_hiConfANDtriodenovo.avinput.hg19_popfreq_max_20150413_dropped: No such file or directory
awk: cmd. line:1: fatal: cannot open file `10033_trio1_hiConfANDtriodenovo.avinput.hg19_popfreq_max_20150413_dropped' for reading (No such file or directory)
bash: line 32: annotate_variation.pl: command not found
cat: 10033_trio1_hiConfANDtriodenovo.avinput.hg19_kaviar_20150923_dropped: No such file or directory
awk: cmd. line:1: fatal: cannot open file `10033_trio1_hiConfANDtriodenovo.avinput.hg19_kaviar_20150923_dropped' for reading (No such file or directory)
bash: line 35: annotate_variation.pl: command not found
cat: 10033_trio1_hiConfANDtriodenovo.avinput.hg19_avsnp142_dropped: No such file or directory
bash: line 37: annotate_variation.pl: command not found
cat: 10033_trio1_hiConfANDtriodenovo.avinput.hg19_clinvar_201

# TODO
* calculate overlaps between GATK and triodenovo
* annotate based on where it is in genome
* why are some variants not annotated by frequency?
* would it go faster to use vcf format and then table_annovar?
* try snpEFF