Even though there was not much difference between ADHD and NV in CNV burden, maybe there is some correlation with symptom counts? Or maybe even some brain metric? It would also be cool to come up with some sort of network metric. So, let's dig deeper into CNVs and simplex cases. After a quick chat with Philip, here are a few things to prioritize:

* Play with the minimum size of the CNV
* Play with the HMM parameters
* Check (and plot) within family differences; the ones with big differences are interesting.
* Include parent burden in the analysis
* Filter based on neural vs nonneural CNVs (i.e. expressed in the brain
* Look at literature-only CNVs (J Chia and A Thapaer for CNVs in ADHD
* Match with file of ranked simplex by Wendy (maybe blindly)?
* Do all of the above with XHMM and array data. Maybe start with array because it'll be simpler?

So, let's get on cracking.

# PennCNV

Going through their documentation, apparently the joint calling algorithm is better than the trio/quartet algorithm (http://penncnv.openbioinformatics.org/en/latest/user-guide/joint/). It's slower, but we don't care about that (to a certain point...). So, let's try to re-run it using the joint calling algorithm (terminal):

In [None]:
%%bash
cd /data/NCR_SBRB/simplex/penncnv/
cp -r HumanExome HumanExome.pfb HumanExome.hg19.gcmodel InfiniumExome InfiniumExome.hg19.gcmodel InfiniumExome.pfb ~/data/cnv/penncnv/
cd ..
cp simplex.ped ~/data/cnv/
cp ~/autodenovo/penncnv_example.hmm ~/data/cnv/penncnv/
cp ~/autodenovo/penncnv_example.hmm ~/data/cnv/penncnv/

So, it does take a while, so let's just swarm it. But, to do that we need to figure out who is who in our pedigree. Note that I'm going to drop the convention I had before that the first trio in a family is the affected one, because we'll eventually play with extended families, and that convention will break down.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

%matplotlib inline

In [8]:
# figure out who is who in each trio
import glob
data_dir = '/data/sudregp/cnv/penncnv/'
ped_file = '/data/sudregp/cnv/simplex.ped'
wes_prefix = ['CLIA', 'CCGO', 'WPS']
trios = []
affected = []
controls = []
fid = open(ped_file, 'r')
for line in fid:
    famid, sid, fa, mo, sex, aff = line.rstrip().split('\t')
    if fa[:4] in wes_prefix and mo[:4] in wes_prefix and sid[:4] in wes_prefix:
        fam = {}
        fam['child'] = sid
        if aff == '1':
            affected.append(sid)
        else:
            controls.append(sid)
        fam['father'] = fa
        fam['mother'] = mo
        fam['famid'] = famid
        trios.append(fam)
fid.close()

In [18]:
fid = open('/data/sudregp/cnv/penncnv/swarm.joint_call', 'w')
str1 = 'cd /data/sudregp/cnv/penncnv/; '
str2 = 'detect_cnv.pl -joint -hmm penncnv_example.hmm -pfb %s.pfb %s/%s %s/%s %s/%s -out results/%s.jointcnv --log results/%s.log'

box = 'InfiniumExome'
for trio in trios:
    fid.write(str1)
    fid.write(str2 % (box, box, trio['father'], box, trio['mother'], box,
                      trio['child'], trio['child'], trio['child']) + '\n')
fid.close()

In [20]:
%%bash
cd /data/sudregp/cnv/penncnv/
swarm -g 4 --job-name penn_joint --time 2:00:00 -f swarm.joint_call \
    --module penncnv --partition quick --logdir trash

55729461


And I'll run the trio that's in a different box by itself:

In [19]:
diff_box = [fam for fam in trios if fam['famid']=='10369']
box = 'HumanExome'
for trio in diff_box:
    print str1 + str2 % (box, box, trio['father'], box, trio['mother'], box,
                         trio['child'], trio['child'], trio['child'])

cd /data/sudregp/cnv/penncnv/; detect_cnv.pl -joint -hmm penncnv_example.hmm -pfb HumanExome.pfb HumanExome/CCGO_800976 HumanExome/CCGO_800977 HumanExome/CCGO_800980 -out results/CCGO_800980.jointcnv --log results/CCGO_800980.log
cd /data/sudregp/cnv/penncnv/; detect_cnv.pl -joint -hmm penncnv_example.hmm -pfb HumanExome.pfb HumanExome/CCGO_800976 HumanExome/CCGO_800977 HumanExome/CCGO_800979 -out results/CCGO_800979.jointcnv --log results/CCGO_800979.log


# TODO

* Check list in top of document
* Try PennCNV steps with adjusted pipeline to see if we get anything different
* Look into sex chromosomes? Something to the idea that adhd is more present in boys...