It occurred to me that all the analysis we did before only dealt with denovo CNVs. LEt's see if there is anything to the total number of CNVs.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)
%matplotlib inline

In [2]:
# figure out who is who in each trio
import glob
data_dir = '/data/sudregp/cnv/penncnv/'
ped_file = '/data/sudregp/cnv/simplex.ped'
wes_prefix = ['CLIA', 'CCGO', 'WPS']
trios = []
affected = []
controls = []
samples = []
famids = []
fid = open(ped_file, 'r')
for line in fid:
    famid, sid, fa, mo, sex, aff = line.rstrip().split('\t')
    if fa[:4] in wes_prefix and mo[:4] in wes_prefix and sid[:4] in wes_prefix:
        fam = {}
        fam['child'] = sid
        if aff == '1':
            affected.append(sid)
        else:
            controls.append(sid)
        fam['father'] = fa
        fam['mother'] = mo
        fam['famid'] = famid
        trios.append(fam)
        samples += [sid, fa, mo]
        famids.append(famid)
fid.close()
samples = set(samples)
famids = set(famids)

fid = open(data_dir + 'good_kids_joint_qc20171215.txt', 'r')
good_kids = [line.rstrip() for line in fid]
fid.close()

In [3]:
def plot_plink_cnvs(fname, t_str, verbose=False):
    
    import plotly.graph_objs as go
    from plotly import tools

    df = pd.read_table(fname, delimiter='\s+', index_col=1)
    df.head()

    x_red, x_green, red, green, xticks = [], [], [], [], []
    red_text, green_text = [], []

    # loop through families
    f = 0
    for fam in famids:
        fam_kids = [t['child'] for t in trios if t['famid'] == fam]
        found = False
        for kid in fam_kids:
            if kid in good_kids:
                found = True
                idx = [i for i in range(df.shape[0]) if df.index[i].find(kid) > 0][0]
                if kid in affected:
                    red.append(int(df.iloc[idx]['NSEG']))
                    x_red.append(f)
                    red_text.append(kid)
                else:
                    green.append(int(df.iloc[idx]['NSEG']))
                    x_green.append(f)
                    green_text.append(kid)
        # only increase counter if we added a kid
        if found:
            xticks.append(fam)
            f += 1

    fig = go.Figure()
                
    trace0 = go.Scatter(
        x = x_red,
        y = red,
        mode = 'markers',
        name = 'affected',
        marker = dict(size = 10, color = 'red'),
        text = red_text,
        hoverinfo='text+y',
        showlegend = True
    )
    trace1 = go.Scatter(
        x = x_green,
        y = green,
        mode = 'markers',
        name = 'unaffected',
        marker = dict(size = 10, color = 'green'),
        hovertext = green_text,
        hoverinfo='text+y',
        showlegend = True
    )
    fig['data'] = [trace0, trace1]
    fig['layout'].update(height=400, width=800, title=t_str,
                             xaxis1=dict(tickvals=range(len(xticks)),
                                        ticktext=xticks,
                                        zeroline = False),
                             hovermode='closest')
    iplot(fig)
    
    # print family order
    if verbose:
        score = []
        for x in range(max(x_green + x_red)):
            idx = [i for i, val in enumerate(x_red) if val == x]
            if len(idx) == 0:
                aff = 0
            else:
                aff = red[idx[0]]
            idx = [i for i, val in enumerate(x_green) if val == x]
            if len(idx) == 0:
                unaff = 0
            else:
                unaff = max([green[i] for i in idx])
            score.append(aff - unaff)
        order = np.argsort(score)[::-1]
        fam_names = fig.layout.xaxis1['ticktext']
        disrupted = [fam_names[v] for v in order if score[v] > 0]
        print 'Best disrupted:', ', '.join(disrupted)
        protected = [fam_names[v] for v in order[::-1] if score[v] < 0]
        print 'Best protected:', ', '.join(protected)
            
            

Let's focus on the clean versions, as we have noticed that they produce the best results anyways.

In [5]:
files = glob.glob(data_dir + '/all*clean_lenBT0*cnv.indiv')
for f in files:
    t_str = '.'.join(f.split('/')[-1].split('.')[:-2])
#     print t_str
    plot_plink_cnvs(f, t_str, verbose=True)

Best disrupted: 10215, 1893, 1895, 855, 10197, 10369
Best protected: 10164, 1892, 10406, 10041, 10448, 10131, 10033, 10090, 10094, 10128


Best disrupted: 10215, 1893, 1895, 855, 10197, 10369
Best protected: 10164, 1892, 10406, 10041, 10448, 10131, 10033, 10090, 10094, 10128


Best disrupted: 10215, 10153
Best protected: 


Best disrupted: 10215, 1893, 1895, 855, 10164, 10128, 10369
Best protected: 1892, 10406, 10041, 10131, 10448, 10094


Best disrupted: 10215, 1893, 1895, 855, 10164, 10128, 10369
Best protected: 1892, 10406, 10041, 10131, 10448, 10094


Best disrupted: 10153
Best protected: 


Best disrupted: 10197, 10215, 1892
Best protected: 10164, 1895, 10131, 10033, 10128, 10090, 10448, 1893, 855


Best disrupted: 10197, 10215, 1892
Best protected: 10164, 1895, 10131, 10033, 10128, 10090, 10448, 1893, 855


Best disrupted: 10215
Best protected: 


Best disrupted: 10215, 1893, 1895, 855, 10197, 10369
Best protected: 10164, 1892, 10406, 10041, 10448, 10131, 10033, 10090, 10094, 10128


Best disrupted: 10215, 1893, 1895, 855, 10197, 10369
Best protected: 10164, 1892, 10406, 10041, 10448, 10131, 10033, 10090, 10094, 10128


Best disrupted: 10215, 10153
Best protected: 


Best disrupted: 10215, 1893, 1895, 855, 10164, 10128, 10369
Best protected: 1892, 10406, 10041, 10131, 10448, 10094


Best disrupted: 10215, 1893, 1895, 855, 10164, 10128, 10369
Best protected: 1892, 10406, 10041, 10131, 10448, 10094


Best disrupted: 10153
Best protected: 


Best disrupted: 10197, 10215, 1892
Best protected: 10164, 1895, 10131, 10033, 10128, 10090, 10448, 1893, 855


Best disrupted: 10197, 10215, 1892
Best protected: 10164, 1895, 10131, 10033, 10128, 10090, 10448, 1893, 855


Best disrupted: 10215
Best protected: 


Best disrupted: 10215, 1893, 1895, 855, 10197, 10369
Best protected: 1892, 10164, 10406, 10041, 10448, 10131, 10033, 10090, 10094, 10128


Best disrupted: 10215, 1893, 1895, 855, 10197, 10369
Best protected: 1892, 10164, 10406, 10041, 10448, 10131, 10033, 10090, 10094, 10128


Best disrupted: 10215, 10153
Best protected: 1892, 10164


Best disrupted: 10215, 1895, 1893, 855, 10164, 10128, 10369
Best protected: 1892, 10406, 10041, 10131, 10448, 10094


Best disrupted: 10215, 1895, 1893, 855, 10164, 10128, 10369
Best protected: 1892, 10406, 10041, 10131, 10448, 10094


Best disrupted: 10153
Best protected: 1892


Best disrupted: 10197, 10215, 1892
Best protected: 10164, 1895, 10131, 10033, 10128, 10090, 10448, 1893, 855


Best disrupted: 10197, 10215, 1892
Best protected: 10164, 1895, 10131, 10033, 10128, 10090, 10448, 1893, 855


Best disrupted: 10215
Best protected: 10164


The results are largely the same, especially when taking into consideratin which families have affected with more CNVs than unaffected.

In [6]:
plot_plink_cnvs(data_dir + '/all_comb.2_clean_lenBT50_genes.cnv.indiv', 
                'denovo_comb.2_clean_lenBT50_genes', verbose=True)
plot_plink_cnvs(data_dir + '/all_comb.2_clean_lenBT0_genes.cnv.indiv', 
                'denovo_comb.2_clean_lenBT0_genes', verbose=True)

Best disrupted: 1893, 10215, 1895
Best protected: 10164, 1892, 10131


Best disrupted: 10215, 1893, 1895, 855, 10197, 10369
Best protected: 10164, 1892, 10406, 10041, 10448, 10131, 10033, 10090, 10094, 10128


For example, let's see what are the genes that those 3 kids have disrupted:

In [7]:
%%bash
cd /home/sudregp/data/cnv/penncnv/results/
for kid in CLIA_400163 CLIA_400125 CLIA_400132; do
    echo $kid
    grep -B 3 $kid all_comb.2_clean_lenBT50_genes.reg | grep RANGE;
done

CLIA_400163
RANGE (+/- 0kb )  [ 2 120517206 120742474 PTPN4 ]
RANGE (+/- 0kb )  [ 2 202352143 202483905 ALS2CR11 ]
RANGE (+/- 0kb )  [ 6 64429875 66417118 EYS ]
RANGE (+/- 0kb )  [ 10 32735009 33171792 CCDC7 ]
RANGE (+/- 0kb )  [ 13 41885340 41951166 NAA16 ]
CLIA_400125
RANGE (+/- 0kb )  [ 1 93646272 93744287 CCDC18 ]
RANGE (+/- 0kb )  [ 1 93775665 93811368 LOC100131564 ]
RANGE (+/- 0kb )  [ 1 93811477 93828148 DR1 ]
RANGE (+/- 0kb )  [ 1 93913687 94020218 FNBP1L ]
RANGE (+/- 0kb )  [ 1 205882176 205912588 SLC26A9 ]
RANGE (+/- 0kb )  [ 1 206138439 206155074 FAM72C ]
RANGE (+/- 0kb )  [ 1 206138910 206155074 FAM72A ]
RANGE (+/- 0kb )  [ 1 206224282 206231482 AVPR1B ]
RANGE (+/- 0kb )  [ 1 206238871 206288647 C1orf186 ]
RANGE (+/- 0kb )  [ 1 206317458 206332104 CTSE ]
RANGE (+/- 0kb )  [ 2 39208689 39347604 SOS1 ]
RANGE (+/- 0kb )  [ 13 48807273 48836232 ITM2B ]
RANGE (+/- 0kb )  [ 13 48870648 48877797 LINC00441 ]
RANGE (+/- 0kb )  [ 13 48877882 49056026 RB1 ]
RANGE (+/- 0kb )  [ 13 4898

Unfortunately it doesn't look like we have any similar genes. Another cool analysis would be to check our the brain-expressed genes, and then see if their brain is also different:

In [8]:
files = glob.glob(data_dir + '/all*clean_lenBT*_brainGenes.cnv.indiv')
for f in files:
    t_str = '.'.join(f.split('/')[-1].split('.')[:-2])
#     print t_str
    plot_plink_cnvs(f, t_str, verbose=True)

Best disrupted: 10215, 10153
Best protected: 


Best disrupted: 10215, 10153
Best protected: 


Best disrupted: 10215, 10153
Best protected: 


Best disrupted: 10153
Best protected: 


Best disrupted: 10215, 10153
Best protected: 


Best disrupted: 10215, 10153
Best protected: 


Best disrupted: 10215, 10153
Best protected: 


Best disrupted: 10153
Best protected: 


Best disrupted: 10215, 10153
Best protected: 1892, 10164


Best disrupted: 10215, 10153
Best protected: 1892, 10164


Best disrupted: 10215, 10153
Best protected: 1892, 10164


Best disrupted: 10153
Best protected: 1892


Well, the best we can do here is one gene, so I'm not sure how valid this analysis would be...

In [10]:
%%bash
cd /home/sudregp/data/cnv/penncnv/results/
for kid in CLIA_400128 CLIA_400125; do
    echo $kid
    grep -B 3 $kid all_clean_lenBT0_brainGenes.reg | grep RANGE;
done

CLIA_400128
RANGE (+/- 0kb )  [ 15 32322685 32462384 CHRNA7 ]
CLIA_400125
RANGE (+/- 0kb )  [ 1 206224282 206231482 AVPR1B ]


Not the same gene either...

# TODO

* Try using quality score from https://www.ncbi.nlm.nih.gov/pubmed/27402902
* Worth calculating p-values? For that one trio it was always 0!
* Look into sex chromosomes? Something to the idea that adhd is more present in boys...
* Play with the HMM parameters
* compile a PFB file for this specific population?
* Include parent burden in the analysis
* Match with file of ranked simplex by Wendy (maybe blindly)?
* Try XHMM again