Alright, so we dive a bit deeper into some of the results we saw on 12/15. Is there anything there?

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)
%matplotlib inline

In [5]:
# figure out who is who in each trio
import glob
data_dir = '/data/sudregp/cnv/penncnv/'
ped_file = '/data/sudregp/cnv/simplex.ped'
wes_prefix = ['CLIA', 'CCGO', 'WPS']
trios = []
affected = []
controls = []
samples = []
famids = []
fid = open(ped_file, 'r')
for line in fid:
    famid, sid, fa, mo, sex, aff = line.rstrip().split('\t')
    if fa[:4] in wes_prefix and mo[:4] in wes_prefix and sid[:4] in wes_prefix:
        fam = {}
        fam['child'] = sid
        if aff == '1':
            affected.append(sid)
        else:
            controls.append(sid)
        fam['father'] = fa
        fam['mother'] = mo
        fam['famid'] = famid
        trios.append(fam)
        samples += [sid, fa, mo]
        famids.append(famid)
fid.close()
samples = set(samples)
famids = set(famids)

In [6]:
def plot_cnvs(fname, t_str):
    import plotly.graph_objs as go
    from plotly import tools

    df = pd.read_csv(fname)
    df['total'] = df.denovo + df.inherited

    fig = tools.make_subplots(rows=3, cols=1, subplot_titles=('De Novo CNVs',
                                                              'Inherited CNVs',
                                                              'All CNVs'))

    for c, cnv in enumerate(['denovo', 'inherited', 'total']):
        x_red, x_green, red, green, xticks = [], [], [], [], []
        red_text, green_text = [], []

        # loop through families
        f = 0
        for fam in famids:
            fam_kids = [t['child'] for t in trios if t['famid'] == fam]
            found = False
            for kid in fam_kids:
                if kid in list(df.kid):
                    found = True
                    if kid in affected:
                        red.append(int(df[df.kid == kid][cnv]))
                        x_red.append(f)
                        red_text.append(kid)
                    else:
                        green.append(int(df[df.kid == kid][cnv]))
                        x_green.append(f)
                        green_text.append(kid)
            # only increase counter if we added a kid
            if found:
                xticks.append(fam)
                f += 1


        trace0 = go.Scatter(
            x = x_red,
            y = red,
            mode = 'markers',
            name = 'affected',
            marker = dict(size = 10, color = 'red'),
            text = red_text,
            hoverinfo='text+y',
            showlegend = False
        )
        trace1 = go.Scatter(
            x = x_green,
            y = green,
            mode = 'markers',
            name = 'unaffected',
            marker = dict(size = 10, color = 'green'),
            hovertext = green_text,
            hoverinfo='text+y',
            showlegend = False
        )
        fig.append_trace(trace0, c + 1, 1)
        fig.append_trace(trace1, c + 1, 1)

    fig['layout'].update(height=900, width=800, title=t_str,
                         xaxis1=dict(tickvals=range(len(xticks)),
                                    ticktext=xticks,
                                    zeroline = False),
                         xaxis2=dict(tickvals=range(len(xticks)),
                                    ticktext=xticks,
                                    zeroline = False),
                         xaxis3=dict(tickvals=range(len(xticks)),
                                    ticktext=xticks,
                                    zeroline = False),
                         hovermode='closest')
    fig['data'][0]['showlegend'] = True
    fig['data'][1]['showlegend'] = True
    iplot(fig)

A results we kept seeing had a few ADHD trios quite high in the number of CNVs, like this:

In [7]:
i = 'knowngene'
x = 100000
plot_cnvs('/data/sudregp/cnv/penncnv/results/' + 
          'summary_%s_cnvlenBT%d_qc20171215.csv' % (i, x),
          'In %s DB (CNV length > %d, qc20171215)' % (i, x))

This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]
[ (3,1) x3,y3 ]



What's going on with 1893, 1895, 10215, and even 855? Anything in common?

In [14]:
%%bash
cd /data/sudregp/cnv/penncnv/results

filter=knowngene;
x=100000;
for sample in CLIA_400125 CLIA_400132 CLIA_400163 CLIA_400216; do
    triocnv=${sample}'_adjusted.jointcnv_'${filter}_qc20171215
    rm denovo.txt inherited.txt 2>/dev/null
    grep mother ${triocnv} > mom_snps;
    grep father ${triocnv} > dad_snps;
    cat mom_snps dad_snps | grep -v NOT_FOUND | awk -v x=$x '{split($3,arr,"="); sub(/,/,"",arr[2]); if (arr[2] > x) print $1}' > parent_snps;
    for snp in `grep offspring ${triocnv} | grep -v NOT_FOUND | awk -v x=$x '{split($3,arr,"="); sub(/,/,"",arr[2]); if (arr[2] > x) print $1}'`; do
    if ! grep -q $snp parent_snps; then
            echo "$snp" >> denovo.txt
        else
            echo "$snp" >> inherited.txt
        fi;
    done
    echo $sample
    while read snp; do
        grep $snp $triocnv | awk '{print $1,$2,$3,$10}';
    done < denovo.txt;
    rm *_snps;
done

CLIA_400125
chr1:62953070-63069736 numsnp=24 length=116,667 ANGPTL3,DOCK7
chr1:91739270-91850744 numsnp=17 length=111,475 HFM1
chr1:93651937-94000386 numsnp=39 length=348,450 CCDC18,DR1,FNBP1L,LOC100131564,TRNA_Cys
chr3:25650812-25771579 numsnp=14 length=120,768 MIR4442,NGLY1,TOP2B,TRNA
chr5:38952527-39119027 numsnp=15 length=166,501 FYB,RICTOR
chr7:2472903-2578947 numsnp=27 length=106,045 BC034268,BRAT1,CHST12,LFNG,MIR4648
chr9:79829278-79936140 numsnp=22 length=106,863 VPS13A
chr13:48807588-48985925 numsnp=13 length=178,338 ITM2B,LINC00441,LPAR6,RB1
chr19:1108401-1226555 numsnp=17 length=118,155 SBNO2,STK11
chr1:205904942-206331193 numsnp=40 length=426,252 AVPR1B,C1orf186,CTSE,FAM72A,SLC26A9
chr11:51391587-51515912 numsnp=16 length=124,326 OR4A5,OR4C46
CLIA_400132
chr1:226924594-227069693 numsnp=8 length=145,100 ITPKB,PSEN2
chr1:228294567-228404377 numsnp=30 length=109,811 C1orf145,GJC2,GUK1,IBA57,MRPL55,OBSCN
chr7:2472903-2578857 numsnp=25 length=105,955 BC034268,BRAT1,CHST12,LFNG,M

There are a couple interesting hits, like chr1 and chr7. But we can certainly automatize this. There are also a couple things I want to do here, before we go further:

* finalize QC according to papers
* remove troublesome regions
* do gene-based analysis
* separate deletions and duplications

Let's rock and roll.

# Troublesome regions

PennCNV reports that telomere and centromere regions can be problematic. So, let's go ahead and remove those. 

It's true that they might not alter a comparison between siblings in a trio, but maybe it will reduce some of the overall numbers of CNVs we're getting across the board. The first step was to convert the coordinates they suggest for immunoglobin regions to hg19 using the Remap online tool.

But then I realized that much of what needs to be done requires the scan_regions tool, which seems a bit buggy. Why don't we convert everything to PLINK format, and do everything there? Not sure if it's easier to split the denovo mutations afterwards, so let's create 3 files (denovo, inherited, and all), and it's easy to re-apply all filters later (using new notebook because it's easier to do all these shenanigans in bash).

# TODO

* how about a gene-level burden test, where we count genes with a CNV, rather than cnvs?
* separate into deletions and duplications!
* Look into protective genes?
* try furter qc. for example, from Thapar 2012: 

```
First, PLINK {Purcell, 2007 1026 /id} was used to identify potential errors in the CNV calls by excluding CNVs that had more than 50% of their length spanning an interSNP-gap of at least 100 kb. Then, to account for the possibility that longer CNVs were mistakenly split by the Hidden Markov Model (HMM) into multiple shorter CNVs, we merged all adjacent CNVs that occurred in a single individual, where the total length of all gaps was less than 50% of the entire length of the newly combined CNV.
In this study, we focused only on rare CNVs, which were defined by excluding any that had more than 50% of their length spanning any of the following three tracks; a) CNVs occurring more than 28 times (i.e. with >1% frequency) in our entire sample, b) known segmental duplications present in the human genome build 18 (hg18), c) known common CNVs defined by the Genome Structural Variation Consortium that are available for download at http://projects.tcag.ca/variation/ng42m_cnv.php. This procedure generated a list of 1562 rare CNVs larger than 100 kb. These CNVs were taken forward for association analysis which, in accordance with previous findings {International Schizophrenia Consortium, 2008 1436 /id}, we performed after stratifying by CNV size (>100 kb, >500 kb). We considered PennCNV copy number calls of 0 and 1 as deletions and 3, 4 or 5 as duplications.
```

* Try using quality score from https://www.ncbi.nlm.nih.gov/pubmed/27402902
* Try removing calls in immunoglobin, telomere and centromere regions (see PennCNV annotation page) 
* try merging adjacent CNV calls? (PennCNV can do it)
* how about transforming XHMM calls to Penncnv format to handle everything with similar scripts? Or the other approach is to create plink files for denovo and inherited SNPs, and do all the filtering in plink
* Worth calculating p-values? For that one trio it was always 0!
* Look into sex chromosomes? Something to the idea that adhd is more present in boys...
* Play with the HMM parameters
* compile a PFB file for this specific population?
* Include parent burden in the analysis
* Match with file of ranked simplex by Wendy (maybe blindly)?
* Do all of the above with XHMM and array data. Maybe start with array because it'll be simpler?