Following the results from 11/06, let's look a bit more into the calls common to GATK and triodenovo:

In [1]:
import pandas as pd
from scipy import stats
import numpy as np

In [2]:
# need to make this less clunky in the future!
%env SLURM_JOB_ID=53272018
%env SLURM_CPUS_PER_TASK=32

env: SLURM_JOB_ID=53272018
env: SLURM_CPUS_PER_TASK=32


# GATK high conf and triodenovo

In [29]:
%%bash

cd ~/data/tmp/
suffix=hiConfANDtriodenovo

echo 'famid,popmax_MAFltp01,affectedOnly' > rare_variants_affectedOnly_${suffix}.csv

while read fam; do
    awk '$2 < .01 {OFS=":"; print $3, $4}' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped > ${fam}_${suffix}_popmax_MAFltp01.txt;
    if [ -e /data/NCR_SBRB/simplex/triodenovo/${fam}_trio2_denovo_v2.vcf ]; then
        rm possible_snvs.txt unaffected_snvs.txt
        # let's be conservative here and disregard the SNV if it was picked up
        # by hiConf OR triodenovo unaffected
        cat /data/NCR_SBRB/simplex/gatk_refine/${fam}_trio[2..4]_hiConfDeNovo.vcf | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
        cat /data/NCR_SBRB/simplex/triodenovo/${fam}_trio[2..4]_denovo_v2.vcf | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
        # for each possible SNV in affected trio, mark it as interesting if it's not
        # in the unnafected trios
        while read snv; do
            if ! grep -q "$snv" unaffected_snvs.txt; then
                echo $snv >> possible_snvs.txt;
            fi;
        done < ${fam}_${suffix}_popmax_MAFltp01.txt;
        naffected=`cat possible_snvs.txt | wc -l`
    else
        naffected='NA'
    fi;
    nMAFpopmax=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`
    echo $fam,$nMAFpopmax,$naffected >> rare_variants_affectedOnly_${suffix}.csv;
done < /data/NCR_SBRB/simplex/famids.txt

In [4]:
res = pd.read_csv('/home/sudregp/data/tmp/rare_variants_affectedOnly_hiConfANDtriodenovo.csv')
res

Unnamed: 0,famid,popmax_MAFltp01,affectedOnly
0,10033,3,3.0
1,10042,6,
2,10090,3,3.0
3,10094,3,3.0
4,10128,7,7.0
5,10131,4,1.0
6,10153,4,3.0
7,10164,2,2.0
8,10173,5,5.0
9,10178,11,10.0


So, we have a handful of candidates here... where are they in the genome?

In [5]:
%%bash

cd ~/data/tmp/
suffix=hiConfANDtriodenovo

while read fam; do
    awk '$2 < .01 {OFS=":"; print $3, $4}' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped > ${fam}_${suffix}_popmax_MAFltp01.txt;
    if [ -e /data/NCR_SBRB/simplex/triodenovo/${fam}_trio2_denovo_v2.vcf ]; then
        rm unaffected_snvs.txt
        # let's be conservative here and disregard the SNV if it was picked up
        # by hiConf OR triodenovo unaffected
        cat /data/NCR_SBRB/simplex/gatk_refine/${fam}_trio[2..4]_hiConfDeNovo.vcf | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
        cat /data/NCR_SBRB/simplex/triodenovo/${fam}_trio[2..4]_denovo_v2.vcf | grep -v '#' - | awk 'BEGIN {FS="\t"; OFS=":"}; {print $1, $2}' - | sort | uniq >> unaffected_snvs.txt;
        # for each possible SNV in affected trio, mark it as interesting if it's not
        # in the unnafected trios
        while read snv; do
            if ! grep -q "$snv" unaffected_snvs.txt; then
                echo $snv >> possible_snvs_${fam}_${suffix}.txt;
            fi;
        done < ${fam}_${suffix}_popmax_MAFltp01.txt;
    else
        cp ${fam}_${suffix}_popmax_MAFltp01.txt possible_snvs_${fam}_${suffix}.txt
    fi;
done < /data/NCR_SBRB/simplex/famids.txt

In [None]:
%%bash


module load annovar
cd ~/data/tmp/
suffix=hiConfANDtriodenovo

echo 'famid,affectedOnly,exonic,intronic,splicing,intergenic,utr3,downstream' > rare_variants_affectedOnly_refGene_${suffix}.csv

while read fam; do
    rm ${fam}_trio1_possibleOnly_${suffix}.avinput;
    for pos in `cut -f 2 -d ":" possible_snvs_${fam}_${suffix}.txt`; do
        grep $pos ${fam}_trio1_${suffix}.avinput >> ${fam}_trio1_possibleOnly_${suffix}.avinput;
    done
    annotate_variation.pl -buildver hg19 ${fam}_trio1_possibleOnly_${suffix}.avinput $ANNOVAR_DATA/hg19;
    naffected=`cat possible_snvs_${fam}_${suffix}.txt | wc -l`;
    nexon=`grep exonic ${fam}_trio1_possibleOnly_${suffix}.avinput.variant_function | wc -l`;
    nintron=`grep intronic ${fam}_trio1_possibleOnly_${suffix}.avinput.variant_function | wc -l`;
    nsplice=`grep splicing ${fam}_trio1_possibleOnly_${suffix}.avinput.variant_function | wc -l`;
    ninter=`grep intergenic ${fam}_trio1_possibleOnly_${suffix}.avinput.variant_function | wc -l`;
    nutr=`grep UTR3 ${fam}_trio1_possibleOnly_${suffix}.avinput.variant_function | wc -l`;
    ndown=`grep downstream ${fam}_trio1_possibleOnly_${suffix}.avinput.variant_function | wc -l`;
    echo $fam,$naffected,$nexon,$nintron,$nsplice,$ninter,$nutr,$ndown >> rare_variants_affectedOnly_refGene_${suffix}.csv;
done < /data/NCR_SBRB/simplex/famids.txt

# GATK all and triodenovo overlap

While we run the intersection between hiConf GATK and triodenovo, let's also check all denovo calls from GATK:

In [3]:
%%bash


cd ~/data/tmp/

# for each family ID
while read fam; do
  # get all GATK SNVs in the affected trio in the family
  suffix=allDeNovo
  cut -f 1,2 --output-delimiter=":" /data/NCR_SBRB/simplex/gatk_refine/${fam}_trio1_${suffix}.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
  suffix=denovo_v2
  cut -f 1,2 --output-delimiter=":" /data/NCR_SBRB/simplex/triodenovo/${fam}_trio1_${suffix}.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
  grep -Fx -f ${fam}_possible_snvs_allDeNovo.txt ${fam}_possible_snvs_denovo_v2.txt > intersect.txt;
  echo $fam `cat intersect.txt | wc -l`
done < /data/NCR_SBRB/simplex/famids.txt

10033 96
10042 76
10090 80
10094 69
10128 71
10131 57
10153 73
10164 92
10173 68
10178 142
10182 67
10197 103
10215 88
10369 49
10406 68
10448 75
1892 51
1893 89
1895 77
1976 128
855 80


As expected, this is much higher than the numbers using hiConf only. Let's calculate the full numbers for them:

In [34]:
%%bash

module load annovar
cd ~/data/tmp/
suffix=GATKallANDtriodenovo

echo "famid,denovo,popmax,popmax_MAFltp01,kaviar,kaviar_MAFltp01,dbSNP,ClinVar" > rare_variants_${suffix}.csv;
# for each family ID
while read fam; do
    # get all GATK SNVs in the affected trio in the family
    cut -f 1,2 --output-delimiter=":" /data/NCR_SBRB/simplex/gatk_refine/${fam}_trio1_allDeNovo.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_hiConfDeNovo.txt;
    cut -f 1,2 --output-delimiter=":" /data/NCR_SBRB/simplex/triodenovo/${fam}_trio1_denovo_v2.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_denovo_v2.txt;
    grep -Fx -f ${fam}_possible_snvs_allDeNovo.txt ${fam}_possible_snvs_denovo_v2.txt > intersect.txt;

    # the VCF from triodenovo is a bit cleaner, so let's grep from there
    grep "#" /data/NCR_SBRB/simplex/triodenovo/${fam}_trio1_denovo_v2.vcf > interesting_snvs.vcf;
    cut -f 2 -d ":" intersect.txt > snp_pos.txt;
    while read snv; do
        grep ${snv} /data/NCR_SBRB/simplex/triodenovo/${fam}_trio1_denovo_v2.vcf >> interesting_snvs.vcf;
    done < snp_pos.txt;
    
    # get all SNVs in the affected trio in the family
    cut -f 1,2 interesting_snvs.vcf | grep -v '#' - | sort | uniq > ${fam}_possible_snvs_${suffix}.txt;
    ndenovo=`cat ${fam}_possible_snvs_${suffix}.txt | wc -l`;
    
    # convert the file to ANNOVAR format
    convert2annovar.pl -format vcf4old interesting_snvs.vcf > ${fam}_trio1_${suffix}.avinput;
    # assign population statistics to the file accoring to different databases
    annotate_variation.pl -filter -dbtype popfreq_max_20150413 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
    npopmax=`cat ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`;
    # filter based on rare variants
    nMAFpopmax=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_popfreq_max_20150413_dropped | wc -l`
    annotate_variation.pl -filter -dbtype kaviar_20150923 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
    nkaviar=`cat ${fam}_trio1_${suffix}.avinput.hg19_kaviar_20150923_dropped | wc -l`;
    nMAFkaviar=`awk '$2 < .01' ${fam}_trio1_${suffix}.avinput.hg19_kaviar_20150923_dropped | wc -l`
    annotate_variation.pl -filter -dbtype avsnp142 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
    ndbsnp=`cat ${fam}_trio1_${suffix}.avinput.hg19_avsnp142_dropped | wc -l`;
    annotate_variation.pl -filter -dbtype clinvar_20170130 ${fam}_trio1_${suffix}.avinput $ANNOVAR_DATA/hg19 -build hg19;
    nclinvar=`cat ${fam}_trio1_${suffix}.avinput.hg19_clinvar_20170130_dropped | wc -l`;
    echo $fam,$ndenovo,$npopmax,$nMAFpopmax,$nkaviar,$nMAFkaviar,$ndbsnp,$nclinvar >> rare_variants_${suffix}.csv
done < /data/NCR_SBRB/simplex/famids.txt

bash: line 40: snp_pos: No such file or directory
bash: line 26: convert2annovar.pl: command not found
bash: line 28: annotate_variation.pl: command not found
cat: 10033_trio1_hiConfANDtriodenovo.avinput.hg19_popfreq_max_20150413_dropped: No such file or directory
awk: cmd. line:1: fatal: cannot open file `10033_trio1_hiConfANDtriodenovo.avinput.hg19_popfreq_max_20150413_dropped' for reading (No such file or directory)
bash: line 32: annotate_variation.pl: command not found
cat: 10033_trio1_hiConfANDtriodenovo.avinput.hg19_kaviar_20150923_dropped: No such file or directory
awk: cmd. line:1: fatal: cannot open file `10033_trio1_hiConfANDtriodenovo.avinput.hg19_kaviar_20150923_dropped' for reading (No such file or directory)
bash: line 35: annotate_variation.pl: command not found
cat: 10033_trio1_hiConfANDtriodenovo.avinput.hg19_avsnp142_dropped: No such file or directory
bash: line 37: annotate_variation.pl: command not found
cat: 10033_trio1_hiConfANDtriodenovo.avinput.hg19_clinvar_201

# Why aren't some variant being picked up by annovar?

For example, look at this:

In [11]:
res = pd.read_csv('/data/NCR_SBRB/simplex/gatk_refine/rare_variants_hiConfDeNovo.csv')
res

Unnamed: 0,famid,denovo,popmax,popmax_MAFltp01,kaviar,kaviar_MAFltp01,dbSNP,ClinVar
0,10033,69,19,9,38,33,21,0
1,10042,81,34,17,49,43,37,0
2,10090,89,46,16,64,56,52,0
3,10094,80,39,18,53,45,45,1
4,10128,54,21,11,34,29,22,0
5,10131,48,25,9,30,26,25,0
6,10153,70,23,10,36,27,25,0
7,10164,122,61,18,80,70,68,0
8,10173,101,57,22,73,59,61,0
9,10178,519,413,59,445,361,414,0


So, over half of the variants are not being picked up by popmax. Fewer by Kaviar, but still. Is there anything in common between those variants? I'll also try running snpEff in a separate notebook to see if it gives me more info.

snpEff had the same issue that some variants were not annotated... asked a question in Biostars about it.