Let's try to go back to that gene-based analysis I started yesterday.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)
%matplotlib inline

In [3]:
# figure out who is who in each trio
import glob
ped_file = '/data/sudregp/cnv/simplex.ped'
wes_prefix = ['CLIA', 'CCGO', 'WPS']
trios = []
affected = []
controls = []
samples = []
famids = []
fid = open(ped_file, 'r')
for line in fid:
    famid, sid, fa, mo, sex, aff = line.rstrip().split('\t')
    if fa[:4] in wes_prefix and mo[:4] in wes_prefix and sid[:4] in wes_prefix:
        fam = {}
        fam['child'] = sid
        if aff == '1':
            affected.append(sid)
        else:
            controls.append(sid)
        fam['father'] = fa
        fam['mother'] = mo
        fam['famid'] = famid
        trios.append(fam)
        samples += [sid, fa, mo]
        famids.append(famid)
fid.close()
samples = set(samples)
famids = set(famids)

fid = open('/data/sudregp/cnv/kid_samples.txt', 'r')
good_kids = [line.rstrip() for line in fid]
fid.close()

In [4]:
fid = open('/home/sudregp/data/cnv/xhmm/denovo_lenBT0_genes.reg')
genes = {}
for line in fid:
    if line.find('RANGE') >= 0:
        gene = line.split(' ')[-2]
        genes[gene] = 0
    elif line.find('DUP') > 0 or line.find('DEL') > 0:
        genes[gene] += 1
fid.close()
df = pd.DataFrame.from_dict(genes, orient='index')
df.columns = ['count']
df.sort_values(by='count', ascending=False).head(10)

Unnamed: 0,count
NBPF20,8
NBPF9,8
GOLGA6L10,7
GOLGA6L9,6
GOLGA6L17P,6
CROCC,5
PI4KAP1,5
GSTT1,5
MIR548N,5
PPP1R10,5


So, these genes are hit by CNVs in 5 to 8 kids, out of the 49 trios we have. And these are all denovo results, which means that it'd be nice if any of those all belong to affected kids. Or maybe even a combination would work? How about some sort of deletion is bad, duplication is good scheme?

In [24]:
dfs = df.sort_values(by='count', ascending=False).head(10)
for index, count in dfs.iterrows():
    nlines = count[0]
    gene = index
    fid = open('/home/sudregp/data/cnv/xhmm/denovo_lenBT0_genes.reg')
    for line in fid:
        if line.find(gene) >= 0:
            print line.rstrip()
            for i in range(nlines + 1):
                line = fid.next()
                print line.rstrip()
    fid.close()

RANGE (+/- 0kb )  [ 1 144146810 146467744 NBPF20 ]
 FID           IID      PHE  CHR          BP1          BP2   TYPE       KB     OLAP   OLAP_U   OLAP_R
   1   CLIA_400170        1    1    144674717    144684957    DEL    10.24        1 0.004412 0.004412
   1   CLIA_400123        2    1    144816539    144855868    DEL    39.33        1  0.01695  0.01695
   1   CLIA_400181        1    1    144935052    144952344    DEL    17.29        1 0.007451 0.007451
   1   CLIA_400179        1    1    144955188    145075113    DUP    119.9        1  0.05167  0.05167
   1   CLIA_400170        1    1    144955188    145075903    DEL    120.7        1  0.05201  0.05201
   1   CLIA_400132        2    1    145248734    145290569    DUP    41.84        1  0.01803  0.01803
   1   CLIA_400166        1    1    145273133    145290569    DUP    17.44        1 0.007513 0.007513
   1   CLIA_400142        1    1    145281327    145295626    DEL     14.3        1 0.006161 0.006161
RANGE (+/- 0kb )  [ 1 144614958

In PLINK notation, 1 is unaffected, 2 is affected.

Not really... CROCC could be protective, but that's pretty much it. What if we use the clean denovo set only?

In [27]:
fid = open('/home/sudregp/data/cnv/xhmm/denovo_clean_lenBT0_genes.reg')
genes = {}
for line in fid:
    if line.find('RANGE') >= 0:
        gene = line.split(' ')[-2]
        genes[gene] = 0
    elif line.find('DUP') > 0 or line.find('DEL') > 0:
        genes[gene] += 1
fid.close()
df = pd.DataFrame.from_dict(genes, orient='index')
df.columns = ['count']
df.sort_values(by='count', ascending=False).head(10)

Unnamed: 0,count
ABHD13,4
LCORL,3
CHD9,2
NCAPG,2
SP3,2
CREBRF,2
ETNK1,2
DOCK7,1
MSH6,1
CCDC39,1


Let's narrow it down to 2 tops, even though that's already pushing it:

In [28]:
dfs = df.sort_values(by='count', ascending=False).head(7)
for index, count in dfs.iterrows():
    nlines = count[0]
    gene = index
    fid = open('/home/sudregp/data/cnv/xhmm/denovo_clean_lenBT0_genes.reg')
    for line in fid:
        if line.find(gene) >= 0:
            print line.rstrip()
            for i in range(nlines + 1):
                line = fid.next()
                print line.rstrip()
    fid.close()

RANGE (+/- 0kb )  [ 13 108870762 108886603 ABHD13 ]
 FID           IID      PHE  CHR          BP1          BP2   TYPE       KB     OLAP   OLAP_U   OLAP_R
   1   CLIA_400122        2   13    108882417    108884732    DUP    2.315        1   0.1462   0.1462
   1   CLIA_400158        1   13    108882417    108886344    DEL    3.927        1   0.2479   0.2479
   1   CLIA_400123        2   13    108882665    108885500    DUP    2.835        1    0.179    0.179
   1   CLIA_400178        2   13    108882844    108886540    DEL    3.696        1   0.2334   0.2334
RANGE (+/- 0kb )  [ 4 17844838 18023483 LCORL ]
 FID           IID      PHE  CHR          BP1          BP2   TYPE       KB     OLAP   OLAP_U   OLAP_R
   1   CLIA_400123        2    4     17845860     17879030    DUP    33.17        1   0.1857   0.1857
   1   CLIA_400178        2    4     17845860     17879761    DEL     33.9        1   0.1898   0.1898
   1   CLIA_400204        2    4     17878089     17878822    DEL    0.733        1 

Now this becomes a bit more interesting. Especially if we talk about combinations of genes. Say, you're screwed if you have two (or more) disrupted of the genes in this list. In that case, 400122 has 3, 400123 has 4, 400178 has 7. Does it vary with symptoms?

What if we do a similar analysis in all genes (not only denovo)?

# all CNVs

In [25]:
fid = open('/home/sudregp/data/cnv/xhmm/all_lenBT0_genes.reg')
genes = {}
for line in fid:
    if line.find('RANGE') >= 0:
        gene = line.split(' ')[-2]
        genes[gene] = 0
    elif line.find('DUP') > 0 or line.find('DEL') > 0:
        genes[gene] += 1
fid.close()
df = pd.DataFrame.from_dict(genes, orient='index')
df.columns = ['count']
df.sort_values(by='count', ascending=False).head(10)

Unnamed: 0,count
NBPF20,80
NBPF9,63
FCGBP,54
MIR548N,53
AHNAK2,45
GOLGA6L22,45
PCDHA6,43
PCDHA2,43
PCDHA3,43
PCDHA1,43


Now our counts go up to 98, because we're looking at all individuals. Still, some of those counts look quite high. Let's use the clean set instead:

In [30]:
fid = open('/home/sudregp/data/cnv/xhmm/all_clean_lenBT0_genes.reg')
genes = {}
for line in fid:
    if line.find('RANGE') >= 0:
        gene = line.split(' ')[-2]
        genes[gene] = 0
    elif line.find('DUP') > 0 or line.find('DEL') > 0:
        genes[gene] += 1
fid.close()
df = pd.DataFrame.from_dict(genes, orient='index')
df.columns = ['count']
df.sort_values(by='count', ascending=False).head(10)

Unnamed: 0,count
NXF2,9
ABHD13,8
SP3,6
CCDC168,5
THAP5,5
MAGEA6,5
PNPLA8,5
LYSMD3,4
MIER1,4
LCORL,4


This is better, but it's still out of 98. Anything comes out as interesting? (let's print only the kids)

In [39]:
dfs = df.sort_values(by='count', ascending=False).head(7)
for index, count in dfs.iterrows():
    nlines = count[0]
    gene = index
    fid = open('/home/sudregp/data/cnv/xhmm/all_clean_lenBT0_genes.reg')
    for line in fid:
        if line.find(gene) >= 0:
            print line.rstrip()
            for i in range(nlines + 1):
                line = fid.next()
                if line.find('FID') < 0:
                    sample = line.split('   ')[2]
                    if sample in affected + controls:
                        print line.rstrip()
                else:
                    print line.rstrip()
    fid.close()

RANGE (+/- 0kb )  [ X 101615315 101694929 NXF2 ]
 FID           IID      PHE  CHR          BP1          BP2   TYPE       KB     OLAP   OLAP_U   OLAP_R
   1   CLIA_400138        1   23    101615646    101620272    DUP    4.626        1  0.05812  0.05812
   1   CLIA_400128        2   23    101615646    101620272    DEL    4.626        1  0.05812  0.05812
   1   CLIA_400122        2   23    101615646    101620272    DUP    4.626        1  0.05812  0.05812
   1   CLIA_400149        2   23    101615646    101620272    DUP    4.626        1  0.05812  0.05812
   1   CLIA_400129        2   23    101615646    101620272    DEL    4.626        1  0.05812  0.05812
   1   CCGO_800980        2   23    101615646    101620272    DEL    4.626        1  0.05812  0.05812
   1   CCGO_800979        1   23    101615646    101620272    DUP    4.626        1  0.05812  0.05812
   1   CLIA_400142        1   23    101619868    101620272    DUP    0.404        1 0.005087 0.005087
   1   CLIA_400131        2   23 

CCDC could be something. Or we could look into the combinations deal as well.

# TODO

* Look at combination of genes?
* Come up with permutation test to get p-values here.
* Apply CNV length cleaning
* Play with quality knob.
* Follow QC rules from XHMM papers 
* Look into sex chromosomes? Something to the idea that adhd is more present in boys...
* Play with the HMM parameters
* do population frequency filtering first?
* Include parent burden in the analysis
* Match with file of ranked simplex by Wendy (maybe blindly)?
* Try other WES CNV callers?