I did some more investigation into ParseCNV, and it really seemed quite buggy and poorly documented. I wasn't able to run it in the cluster due to library incompatibilities, and even the example scripts wouldn't run in my laptop, due to a problem running R scripts. 

Plus, I've been doing some more investigation in PennCNV, and apparently we can do everything we need through it. So, let's explore it a bit further, in terms of annotating the calls and doing some more extra filtering. We'll pick up from the 12/08 analysis:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

%matplotlib inline

In [2]:
# figure out who is who in each trio
import glob
data_dir = '/data/sudregp/cnv/penncnv/'
ped_file = '/data/sudregp/cnv/simplex.ped'
wes_prefix = ['CLIA', 'CCGO', 'WPS']
trios = []
affected = []
controls = []
samples = []
fid = open(ped_file, 'r')
for line in fid:
    famid, sid, fa, mo, sex, aff = line.rstrip().split('\t')
    if fa[:4] in wes_prefix and mo[:4] in wes_prefix and sid[:4] in wes_prefix:
        fam = {}
        fam['child'] = sid
        if aff == '1':
            affected.append(sid)
        else:
            controls.append(sid)
        fam['father'] = fa
        fam['mother'] = mo
        fam['famid'] = famid
        trios.append(fam)
        samples += [sid, fa, mo]
fid.close()
samples = set(samples)

In [3]:
df = pd.read_table('/data/sudregp/cnv/penncnv/results/all_simplex.qcsum')
# we ran for all samples, but let's look at only stats for samples in our simplex study
keep_me = [i for i in range(df.shape[0]) if df.File[i].split('/')[-1] in samples]
df = df.iloc[keep_me]

In [4]:
df20 = df[df.NumCNV <= 20]
df10 = df[df.NumCNV <= 10]

In [5]:
%%bash
cd /data/sudregp/cnv/penncnv/results

while read sample; do
    triocnv=${sample}'.jointcnv'
    rm denovo.txt inherited.txt 2>/dev/null
    grep mother ${triocnv} > mom_snps;
    grep father ${triocnv} > dad_snps;
    cat mom_snps dad_snps > parent_snps;
    for snp in `grep offspring ${triocnv} | cut -d' ' -f 1`; do
        if ! grep -q $snp parent_snps; then
            echo "$snp" >> denovo.txt
        else
            echo "$snp" >> inherited.txt
        fi;
   done
   echo ${triocnv}: `cat denovo.txt 2>/dev/null | wc -l` denovo, `cat inherited.txt 2>/dev/null | wc -l` inherited;
   rm *_snps;
done < ../good_kids_numCNVse20.txt

CLIA_400140.jointcnv: 2 denovo, 196 inherited
CLIA_400138.jointcnv: 16 denovo, 120 inherited
CLIA_400121.jointcnv: 0 denovo, 110 inherited
CLIA_400191.jointcnv: 0 denovo, 160 inherited
CLIA_400148.jointcnv: 5 denovo, 117 inherited
CLIA_400153.jointcnv: 7 denovo, 89 inherited
CLIA_400216.jointcnv: 6 denovo, 79 inherited
CLIA_400189.jointcnv: 0 denovo, 155 inherited
CLIA_400180.jointcnv: 0 denovo, 105 inherited
CLIA_400166.jointcnv: 11 denovo, 193 inherited
CLIA_400122.jointcnv: 3 denovo, 110 inherited
CLIA_400204.jointcnv: 0 denovo, 190 inherited
CCGO_800979.jointcnv: 1 denovo, 28 inherited
CLIA_400162.jointcnv: 95 denovo, 226 inherited
CLIA_400158.jointcnv: 45 denovo, 116 inherited
CLIA_400170.jointcnv: 0 denovo, 172 inherited
CLIA_400123.jointcnv: 3 denovo, 53 inherited
CLIA_400209.jointcnv: 3 denovo, 280 inherited
CLIA_400134.jointcnv: 3 denovo, 148 inherited
CLIA_400131.jointcnv: 4 denovo, 161 inherited
CLIA_400195.jointcnv: 4 denovo, 344 inherited
CLIA_400135.jointcnv: 8 denovo, 32

In [23]:
%%bash

module load penncnv
cd /data/sudregp/cnv/penncnv/results

while read sample; do
    cnv=${sample}'.jointcnv'
    gene=/fdb/annovar/current/hg19/hg19_refGene.txt
    link=/fdb/annovar/current/hg19/hg19_refLink.txt

    # there are some differences in the files...
    scan_region.pl ${cnv} $gene -refexon -reflink $link > ${cnv}_refexon
    scan_region.pl ${cnv} $gene -refgene -reflink $link > ${cnv}_refgene

    gene=/fdb/annovar/current/hg19/hg19_knownGene.txt
    link=/fdb/annovar/current/hg19/hg19_kgXref.txt
    scan_region.pl ${cnv} $gene -knowngene -kgxref $link > ${cnv}_knowngene
done < ../good_kids_numCNVse20.txt



A quick note that I tried adapting other tracks from the UCSC browser to use as databases, but I'm having trouble doing it. Maybe invest a bit more time on it later, but the same result can be obtained by simply annotating the calls with the default files (either knownGene or refSeq), and then only using the annotations that intersect with the Allen Gene List (downloaded from the UCSC website). 

# TODO

* recount only the intersecting CNVs
* Filter based on neural vs nonneural CNVs (i.e. expressed in the brain)
* Look at literature-only CNVs (J Chia and A Thapaer for CNVs in ADHD)
* Understand what we're QCing on!
* Try removing calls in immunoglobin, telomere and centromere regions (see PennCNV annotation page) 
* Worth calculating p-values? For that one trio it was always 0!
* Try PennCNV steps with adjusted pipeline to see if we get anything different
* Look into sex chromosomes? Something to the idea that adhd is more present in boys...
* Play with the minimum size of the CNV
* Play with the HMM parameters
* Check (and plot) within family differences; the ones with big differences are interesting.
* Include parent burden in the analysis
* Match with file of ranked simplex by Wendy (maybe blindly)?
* Do all of the above with XHMM and array data. Maybe start with array because it'll be simpler?

# Useful links

