We will continue with the plan delineated in the 12/07 note on going deeper into the simplex CNV analysis. But in this notebook let's focus on the XHMM analysis, while we run the joint calling analysis in parallel.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

%matplotlib inline

In [3]:
# figure out who is who in each trio
import glob
data_dir = '/data/sudregp/cnv/penncnv/'
ped_file = '/data/sudregp/cnv/simplex.ped'
wes_prefix = ['CLIA', 'CCGO', 'WPS']
trios = []
affected = []
controls = []
fid = open(ped_file, 'r')
for line in fid:
    famid, sid, fa, mo, sex, aff = line.rstrip().split('\t')
    if fa[:4] in wes_prefix and mo[:4] in wes_prefix and sid[:4] in wes_prefix:
        fam = {}
        fam['child'] = sid
        if aff == '1':
            affected.append(sid)
        else:
            controls.append(sid)
        fam['father'] = fa
        fam['mother'] = mo
        fam['famid'] = famid
        trios.append(fam)
fid.close()

# CCGO_800976 was removed during sample filtering step in XHMM, let's remove the family that has that sample, as it's a father
# we also remove 10042, as it only has an affected kid
trios = [t for t in trios if t['famid'] not in ['10042', '10369']]

For each family, let's get the affected kid and compute the difference in CNVs betweem them and the unaffected kids. We can do that for only denovo CNVs, or total:

In [4]:
%%bash
module load penncnv

cd /data/sudregp/cnv/penncnv/results
cat CCGO*.log CLIA*.log > all_simplex.log
cat CCGO*.jointcnv CLIA*.jointcnv > all_simplex.jointcnv
filter_cnv.pl all_simplex.jointcnv -qclogfile all_simplex.log \
    -qcpassout all_simplex.qcpass \
    -qcsumout all_simplex.qcsum -out all_simplex

NOTICE: the --qclrrsd argument is set as 0.3 by default
NOTICE: the --qcbafdrift argument is set as 0.01 by default
NOTICE: the --qcwf argument is set as 0.05 by default
NOTICE: Writting 0 file names that pass QC to qcpass file all_simplex.qcpass
NOTICE: Writting 0 records of QC summary to qcsum file all_simplex.qcsum


So, it turns out that the default values used in QC are not output to the log file if using the joint call. Does that mean that joint calls are not as affected by those metrics? Let's go on with the analysis without doing QC, and then we can check if there any any outliers. If yes, then we can do the usual rawcnv call just to get those QC metrics, and apply them to the joint calls.

So, before we try to import these results into pseq, let's do a quick check of the numbers:

In [3]:
%%bash
cd /data/sudregp/cnv/penncnv/results

for triocnv in `ls -1 C*.jointcnv`; do
    rm denovo.txt inherited.txt 2>/dev/null
    grep mother ${triocnv} > mom_snps;
    grep father ${triocnv} > dad_snps;
    cat mom_snps dad_snps > parent_snps;
    for snp in `grep offspring ${triocnv} | cut -d' ' -f 1`; do
        if ! grep -q $snp parent_snps; then
            echo "$snp" >> denovo.txt
        else
            echo "$snp" >> inherited.txt
        fi;
   done
   echo ${triocnv}: `cat denovo.txt 2>/dev/null | wc -l` denovo, `cat inherited.txt 2>/dev/null | wc -l` inherited;
   rm *_snps;
done

CCGO_800979.jointcnv: 1 denovo, 28 inherited
CCGO_800980.jointcnv: 3 denovo, 43 inherited
CLIA_400121.jointcnv: 0 denovo, 110 inherited
CLIA_400122.jointcnv: 3 denovo, 110 inherited
CLIA_400123.jointcnv: 3 denovo, 53 inherited
CLIA_400125.jointcnv: 283 denovo, 71 inherited
CLIA_400126.jointcnv: 382 denovo, 457 inherited
CLIA_400127.jointcnv: 10 denovo, 71 inherited
CLIA_400128.jointcnv: 1 denovo, 84 inherited
CLIA_400129.jointcnv: 1 denovo, 94 inherited
CLIA_400130.jointcnv: 0 denovo, 215 inherited
CLIA_400131.jointcnv: 4 denovo, 161 inherited
CLIA_400132.jointcnv: 77 denovo, 263 inherited
CLIA_400133.jointcnv: 1 denovo, 188 inherited
CLIA_400134.jointcnv: 3 denovo, 148 inherited
CLIA_400135.jointcnv: 8 denovo, 325 inherited
CLIA_400138.jointcnv: 16 denovo, 120 inherited
CLIA_400139.jointcnv: 49 denovo, 269 inherited
CLIA_400140.jointcnv: 2 denovo, 196 inherited
CLIA_400142.jointcnv: 29 denovo, 284 inherited
CLIA_400144.jointcnv: 0 denovo, 143 inherited
CLIA_400148.jointcnv: 5 denovo, 

Well, there are a few trios that would make you wonder about their data quality. Let me create a QC file and see if these are samples that would be excluded (in terminal):

In [None]:
%%bash
module load penncnv

cd /data/sudregp/cnv/penncnv

pfb_file=InfiniumExome.pfb
gc_file=InfiniumExome.hg19.gcmodel
detect_cnv.pl -test -hmm ~/autodenovo/penncnv_example.hmm -pfb $pfb_file -log results/all_simplex_but1.log InfiniumExome/* -out results/all_simplex_but1.rawcnv;

pfb_file=HumanExome.pfb
gc_file=HumanExome.hg19.gcmodel
detect_cnv.pl -test -hmm ~/autodenovo/penncnv_example.hmm -pfb $pfb_file -log results/fam_10369.log HumanExome/* -out results/fam_10369.rawcnv;

In [None]:
%%bash

module load penncnv

cd /data/sudregp/cnv/penncnv/results
cat all_simplex_but1.log fam_10369.log > all_simplex_rawcnv.log
cat all_simplex_but1.rawcnv fam_10369.rawcnv > all_simplex.rawcnv
filter_cnv.pl all_simplex.rawcnv -qclogfile all_simplex_rawcnv.log \
    -qcpassout all_simplex.qcpass \
    -qcsumout all_simplex.qcsum -out all_simplex

So, it does take a while, so let's just swarm it. But, to do that we need to figure out who is who in our pedigree. Note that I'm going to drop the convention I had before that the first trio in a family is the affected one, because we'll eventually play with extended families, and that convention will break down.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

%matplotlib inline

In [8]:
# figure out who is who in each trio
import glob
data_dir = '/data/sudregp/cnv/penncnv/'
ped_file = '/data/sudregp/cnv/simplex.ped'
wes_prefix = ['CLIA', 'CCGO', 'WPS']
trios = []
affected = []
controls = []
fid = open(ped_file, 'r')
for line in fid:
    famid, sid, fa, mo, sex, aff = line.rstrip().split('\t')
    if fa[:4] in wes_prefix and mo[:4] in wes_prefix and sid[:4] in wes_prefix:
        fam = {}
        fam['child'] = sid
        if aff == '1':
            affected.append(sid)
        else:
            controls.append(sid)
        fam['father'] = fa
        fam['mother'] = mo
        fam['famid'] = famid
        trios.append(fam)
fid.close()

In [18]:
fid = open('/data/sudregp/cnv/penncnv/swarm.joint_call', 'w')
str1 = 'cd /data/sudregp/cnv/penncnv/; '
str2 = 'detect_cnv.pl -joint -hmm penncnv_example.hmm -pfb %s.pfb %s/%s %s/%s %s/%s -out results/%s.jointcnv --log results/%s.log'

box = 'InfiniumExome'
for trio in trios:
    fid.write(str1)
    fid.write(str2 % (box, box, trio['father'], box, trio['mother'], box,
                      trio['child'], trio['child'], trio['child']) + '\n')
fid.close()

In [20]:
%%bash
cd /data/sudregp/cnv/penncnv/
swarm -g 4 --job-name penn_joint --time 2:00:00 -f swarm.joint_call \
    --module penncnv --partition quick --logdir trash

55729461


And I'll run the trio that's in a different box by itself:

In [19]:
diff_box = [fam for fam in trios if fam['famid']=='10369']
box = 'HumanExome'
for trio in diff_box:
    print str1 + str2 % (box, box, trio['father'], box, trio['mother'], box,
                         trio['child'], trio['child'], trio['child'])

cd /data/sudregp/cnv/penncnv/; detect_cnv.pl -joint -hmm penncnv_example.hmm -pfb HumanExome.pfb HumanExome/CCGO_800976 HumanExome/CCGO_800977 HumanExome/CCGO_800980 -out results/CCGO_800980.jointcnv --log results/CCGO_800980.log
cd /data/sudregp/cnv/penncnv/; detect_cnv.pl -joint -hmm penncnv_example.hmm -pfb HumanExome.pfb HumanExome/CCGO_800976 HumanExome/CCGO_800977 HumanExome/CCGO_800979 -out results/CCGO_800979.jointcnv --log results/CCGO_800979.log


# TODO

* Look into sex chromosomes? Something to the idea that adhd is more present in boys...
* Play with the minimum size of the CNV
* Play with the HMM parameters
* Check (and plot) within family differences; the ones with big differences are interesting.
* Include parent burden in the analysis
* Filter based on neural vs nonneural CNVs (i.e. expressed in the brain
* Look at literature-only CNVs (J Chia and A Thapaer for CNVs in ADHD
* Match with file of ranked simplex by Wendy (maybe blindly)?
* Do all of the above with XHMM and array data. Maybe start with array because it'll be simpler?