
## Diabetes chr6_32604 example -- Which component is important?

- The specific context of this variant and disease is described in this google doc:
  - https://docs.google.com/document/d/16GuSasXWX-5qwvKAX5-4VxtrbmsIu9UgrP311_viqQc/edit?usp=sharing
- This notebook would show, 
  1. Given the SNP, identify which genomic bin contains the SNP
  1. Use genomic bin squared cosine score to find the top 3 important components for the genomic bin
  1. Investigate the top component for the genomic bins
    - Use assay contribution scores to see what assays are important for the component
    - Use genomic bin contribution scores to see what other gnomic bins are important for the component
    - Explorer the results of enrichment analysis


In [1]:
% matplotlib inline

import numpy as np
import pandas as pd
import matplotlib, collections, itertools, os, re, textwrap, logging, sys
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches
from functools import reduce

from logging.config import dictConfig
from logging import getLogger

dictConfig(dict(
    version = 1,
    formatters = {'f': {'format': '%(asctime)s %(name)-12s %(levelname)-8s %(message)s'}},
    handlers = {
        'h': {'class': 'logging.StreamHandler','formatter': 'f',
              'level': logging.DEBUG}},
    root = {'handlers': ['h'], 'level': logging.DEBUG,},
))

matplotlib.rc('font',**{'size':16, 'family':'sans-serif','sans-serif':['HelveticaNeue', 'Helvetica']})

logger = getLogger('notebook')


In [2]:
repo_dir=os.path.realpath(
    os.path.dirname(os.path.dirname(os.getcwd()))
)


In [3]:
data_dir=os.path.realpath(
    os.path.join(os.path.dirname(os.getcwd()), 'private_data')
)

In [4]:
enrichment_data_dir=os.path.join(repo_dir, 'enrichment', 'private_data')


In [5]:
sys.path.append(os.path.join(repo_dir, 'enrichment', 'src'))
from great import read_great_res_wrapper


In [6]:
metadata_dir=os.path.join(
    repo_dir, 'metadata'
)
metadata = pd.read_table(
    os.path.join(metadata_dir, 'sample_antibody_map_v2_with_metadata.tsv'),
    sep='\t',
)

### Step 1: SNP to genomic bin


In [7]:
genomic_bin_df=pd.read_csv(
    os.path.join(repo_dir, 'enrichment', 'private_data', 'loci_def.bed'),
    names=['chr', 'chromStart', 'chromEnd', 'name'],
    sep='\t'
)

In [8]:
genomic_bin_df[genomic_bin_df['name'] == 'chr6_32604']

Unnamed: 0,chr,chromStart,chromEnd,name
143278,chr6,32604000,32605000,chr6_32604


This means the index of the genomic bin of our interest is 143278

In [9]:
genomic_bin_idx=143278

### Step 2: Which component is important for a given genomic bin -- genomic bin squared cosine score
- Let's write our decomposition as X = UDV' where X is input feature matrix, D is diagonal singular value matrix, U is left singular vector matrix (on assay space), V is right singular vector matrix (on genomic bin space), and `'` denotes the transposition of the matrix.
- Genomic bin squared cosine score is defined as L2-normalized version of the matrix product (VD) so that any given slice for a given genomic bin has Euclidian norm of 1. 
- The interpretation of the score is it represents the relative importance of the component given a genomic bin.
- More formal definition:
  - https://docs.google.com/document/d/1YRuaIvHvjb_6SJwlml1dQDegiGlGbdfz_zN-5bneroE/edit?usp=sharing
 

#### read the decomposed matrices

In [10]:
def read_decomposed_matrix(filename, compression=None):
    if((compression is None) and (len(filename) > 3) and (filename[-3:] == '.gz')):
        compression='gzip'
    df = pd.read_csv(
        os.path.join(data_dir, filename),
        compression=compression
    )
    mat = df.iloc[:, 1:].as_matrix()
    idx = df.iloc[:, 0].as_matrix()
    return mat, idx

In [11]:
d_mat_temp, d_idx = read_decomposed_matrix(os.path.join(data_dir, 'diagonalScore.csv.gz'))
d_vec = d_mat_temp[:, 0]


In [12]:
u_mat, u_idx = read_decomposed_matrix(os.path.join(data_dir, 'uScore.csv.gz'))


In [14]:
v_mat, v_idx = read_decomposed_matrix(os.path.join(data_dir, 'vScore.csv.gz'))


In [15]:
d_vec.shape, u_mat.shape, v_mat.shape, d_idx.shape, u_idx.shape, v_idx.shape

((652,), (652, 652), (379541, 652), (652,), (652,), (379541,))

#### compute matrix products, UD and VD

In [16]:
u_dot_d = np.dot(u_mat, np.diag(d_vec))


In [17]:
v_dot_d = np.dot(v_mat, np.diag(d_vec))


In [18]:
u_dot_d.shape, v_dot_d.shape

((652, 652), (379541, 652))

#### compute normalized matrices
- v_dot_d_find_pcs: genomic bin --> which PC? genomic bin squared contribution score.
- u_dot_d_fine_pcs: assay       --> which PC? assay squared contribution score.
- v_dot_d_find_loci: PC --> which genomic bins? genomic bin contribution score.
- u_dot_d_find_assay: PC --> which assay? assay contribution score

In [19]:
v_dot_d_find_pcs = (v_dot_d ** 2 ) / (np.sum(v_dot_d ** 2, axis = 1)[:,np.newaxis])


In [20]:
u_dot_d_find_pcs = (u_dot_d ** 2 ) / (np.sum(u_dot_d ** 2, axis = 1)[:,np.newaxis])


In [21]:
v_dot_d_find_loci = (v_dot_d ** 2 ) / (np.sum(v_dot_d ** 2, axis = 0)[np.newaxis, :])


In [22]:
u_dot_d_find_assay = (u_dot_d ** 2 ) / (np.sum(u_dot_d ** 2, axis = 0)[np.newaxis, :])


#### let's identify the top 3 important components for the genomic bin chr17_38023 (index: 328517)

In [23]:
np.argsort(-v_dot_d_find_pcs[genomic_bin_idx, :])[:5]

array([39,  6, 10, 88,  5])

In [24]:
v_dot_d_find_pcs[genomic_bin_idx, np.argsort(-v_dot_d_find_pcs[genomic_bin_idx, :])[:5]]

array([0.06118546, 0.05984606, 0.05604134, 0.03388895, 0.03354926])

This means PC7 (0-based index) is the most important component for this bin with 7.6% of squared cosine score, PC39 is the second important one with 6.1%, etc ...

### Step 3: investigation of the components

#### PC39 (the top component)

We will investigate 

1. What assays are driving this component?
1. What genomic loci are driving this component?
1.  What are the top hits in the enrichment analysis?

In [25]:
component_idx=39

#### what assays are driving this component?

In [26]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([ 47, 413, 415, 416, 225])

In [27]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.20334155, 0.05140923, 0.03698147, 0.026487  , 0.02527061])

In [28]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
47,ERX329658,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"252527744,90.3,66.9,58525",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA19238,,,female
413,SRX627259,h3k27ac,hg19,Histone,H3K27ac,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"272510128,93.1,45.9,72497",GSM1420885: GM19138 H3K27ac; Homo sapiens; ChI...,...,,,,,,,GM19138,,Coriell Cell Repositories http://ccr.coriell.o...,
415,SRX627261,h3k27ac,hg19,Histone,H3K27ac,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"262509163,94.1,43.9,71289",GSM1420887: GM19201 H3K27ac; Homo sapiens; ChI...,...,,,,,,,GM19201,,Coriell Cell Repositories http://ccr.coriell.o...,
416,SRX627262,h3k27ac,hg19,Histone,H3K27ac,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"262571507,93.4,35.9,76151",GSM1420888: GM19119 H3K27ac; Homo sapiens; ChI...,...,,,,,,,GM19119,,Coriell Cell Repositories http://ccr.coriell.o...,
225,SRX2778371,,hg19,Histone,H3K27ac,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"63939145,87.2,10.0,32448",GSM2597211: H3K27Ac ChIPseq in HEP14 0079 LCLs...,...,,,,,,,,,,


#### what genomic bins are driving this component?

In [29]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([ 16462,  16461,   2563, 132423, 303976])

In [30]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([0.00017904, 0.00016561, 0.00014659, 0.00013263, 0.00012933])

These genomic bins are important for PC16. Note the genomic bin contribution scores are very small compared to assay contribution score. This is expected becasue of the large number of genomic bins in the whole-genome analysis.

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [31]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[
    genomic_bin_idx, component_idx])

4558

In [32]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[
    genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.012009242743208243

It's roughly on the top 1.2 percentile.


#### results of the enrichment analysis

In [33]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
4,HP:0004332,Abnormality of lymphocytes,1.508808e-10,2.019461
5,HP:0002846,Abnormality of B cells,3.167175e-10,2.186231
6,HP:0010701,Abnormal immunoglobulin level,4.113557e-10,2.198288
7,HP:0005372,Abnormality of B cell physiology,5.608107e-10,2.182674
9,HP:0002960,Autoimmunity,6.655998e-09,2.465546
11,HP:0002850,IgM deficiency,1.992062e-08,4.19633
13,HP:0002621,Atherosclerosis,5.459813e-08,2.228612
15,HP:0002634,Arteriosclerosis,8.772e-08,2.195991
16,HP:0004313,Hypogammaglobulinemia,8.781575e-08,2.101667
19,HP:0003049,Ulnar deviation of the wrist,2.243515e-07,7.139804


In [34]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
10,GO:0048872,homeostasis of number of cells,3.10768e-19,2.057292
19,GO:0002429,immune response-activating cell surface recept...,3.968115e-17,2.012773
35,GO:0030595,leukocyte chemotaxis,2.791925e-14,2.291774
47,GO:0006909,phagocytosis,3.099261e-13,2.005439
50,GO:0038096,Fc-gamma receptor signaling pathway involved i...,4.648642e-13,2.390534
51,GO:0002431,Fc receptor mediated stimulatory signaling pat...,4.84922e-13,2.388601
52,GO:0038094,Fc-gamma receptor signaling pathway,4.950468e-13,2.387656
55,GO:0002262,myeloid cell homeostasis,1.087754e-12,2.120652
73,GO:0046637,regulation of alpha-beta T cell differentiation,6.950873e-12,2.548473
74,GO:0097193,intrinsic apoptotic signaling pathway,7.276055e-12,2.023523


In [35]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
33,MP:0002498,abnormal acute inflammation,2.0416020000000003e-23,2.019778
45,MP:0005153,abnormal B cell proliferation,1.340759e-20,2.148382
46,MP:0005087,decreased acute inflammation,1.585143e-20,2.276109
50,MP:0008217,abnormal B cell activation,7.056544e-20,2.083458
53,MP:0005068,abnormal NK cell morphology,1.945316e-19,2.282522
56,MP:0008043,abnormal NK cell number,2.886092e-19,2.349993
61,MP:0000702,enlarged lymph nodes,6.721281999999999e-19,2.045753
62,MP:0001876,decreased inflammatory response,6.916684999999999e-19,2.01875
68,MP:0008781,abnormal B cell apoptosis,4.51998e-18,2.612855
70,MP:0008210,increased mature B cell number,1.4939090000000003e-17,2.155396


In [36]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
43,MP:0005153,abnormal B cell proliferation,6.658585999999999e-19,2.29431
46,MP:0008217,abnormal B cell activation,1.834708e-18,2.237208
50,MP:0011762,renal/urinary system inflammation,6.732876e-18,2.115589
51,MP:0005068,abnormal NK cell morphology,9.651842000000001e-18,2.359271
53,MP:0008043,abnormal NK cell number,9.046579000000001e-17,2.426455
61,MP:0002148,abnormal hypersensitivity reaction,7.926478e-16,2.213092
64,MP:0005095,decreased T cell proliferation,1.308548e-15,2.034887
69,MP:0005087,decreased acute inflammation,3.49431e-15,2.159936
71,MP:0004149,increased bone strength,4.202318e-15,5.57754
73,MP:0008781,abnormal B cell apoptosis,6.751364e-15,3.027338


In [37]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,GO:0005925,focal adhesion,9.764139e-16,2.154349
1,GO:0005924,cell-substrate adherens junction,4.952786e-15,2.107207
2,GO:0030055,cell-substrate junction,9.267956e-15,2.016888
20,GO:0031228,intrinsic to Golgi membrane,1.150887e-07,2.084988
22,GO:0030173,integral to Golgi membrane,7.428579e-07,2.035796
27,GO:0002102,podosome,1.708912e-06,2.902913
28,GO:0001673,male germ cell nucleus,2.70234e-06,3.481693
29,GO:0043073,germ cell nucleus,3.139255e-06,3.026767
32,GO:0031526,brush border membrane,4.623854e-06,2.315726
36,GO:0005826,actomyosin contractile ring,9.448244e-06,5.419206


In [38]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
21,GO:0042379,chemokine receptor binding,3.518052e-08,2.959699
31,GO:0071889,14-3-3 protein binding,5.683577e-07,3.162299
32,GO:0008009,chemokine activity,7.881759e-07,2.829485
33,GO:0004826,phenylalanine-tRNA ligase activity,8.912586e-07,6.249892
35,GO:0017112,Rab guanyl-nucleotide exchange factor activity,2.07332e-06,2.748608
39,GO:0035035,histone acetyltransferase binding,3.48109e-06,2.674145
40,GO:0042975,peroxisome proliferator activated receptor bin...,3.565991e-06,4.099304
46,GO:0005138,interleukin-6 receptor binding,1.492873e-05,4.078269
50,GO:0008432,JUN kinase binding,1.744005e-05,5.622004
54,GO:0005160,transforming growth factor beta receptor binding,3.408384e-05,2.281372


#### PC6 (the second component)


In [39]:
component_idx=6

#### what assays are driving this component?

In [40]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([ 10, 102,  43,  35,  55])

In [41]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.18663573, 0.086168  , 0.07523161, 0.05274594, 0.05010345])

In [42]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
10,ERX329621,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"219047965,92.5,75.8,70997",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA12891,,,male
102,ERX329713,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"235003947,93.0,82.0,55388",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA19239,,,male
43,ERX329654,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"74449496,92.0,31.2,25320",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA11831,,,male
35,ERX329646,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"62384167,93.2,16.4,27131",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA11881,,,male
55,ERX329666,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"76235942,92.7,15.1,28768",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA11840,,,female


#### what genomic bins are driving this component?

In [43]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([211314, 347349, 167403,  12806,  12805])

In [44]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([9.74653295e-05, 9.42709371e-05, 9.39630245e-05, 9.37192353e-05,
       9.21795390e-05])

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [45]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx])

30140

In [46]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.07941171046079343

It's roughly on the top 8 percentile.


#### results of the enrichment analysis

In [47]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
5,HP:0002633,Vasculitis,2.124781e-09,3.254925
7,HP:0010622,Neoplasm of the skeletal system,3.157741e-09,2.848685
8,HP:0003180,Flat acetabular roof,4.132922e-09,5.580264
12,HP:0002960,Autoimmunity,1.393192e-08,2.433056
15,HP:0002665,Lymphoma,2.574896e-08,2.209538
16,HP:0003170,Abnormality of the acetabulum,2.993326e-08,2.92599
19,HP:0001658,Myocardial infarction,5.354662e-08,4.502097
20,HP:0000132,Menorrhagia,5.8806e-08,4.965326
21,HP:0002135,Basal ganglia calcification,8.458347e-08,3.61186
22,HP:0002843,Abnormality of T cells,1.587267e-07,2.604574


In [48]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
1,GO:0002757,immune response-activating signal transduction,8.572361e-41,2.295897
2,GO:0050778,positive regulation of immune response,4.826825e-38,2.04654
3,GO:0002253,activation of immune response,2.004579e-37,2.166998
9,GO:0002429,immune response-activating cell surface recept...,5.641257000000001e-28,2.37349
15,GO:0045088,regulation of innate immune response,3.3659649999999996e-24,2.10956
16,GO:0019221,cytokine-mediated signaling pathway,5.15692e-24,2.048872
19,GO:0097190,apoptotic signaling pathway,1.714413e-22,2.009059
23,GO:0050851,antigen receptor-mediated signaling pathway,7.368084999999999e-20,2.371907
26,GO:0002221,pattern recognition receptor signaling pathway,3.2936719999999995e-19,2.240385
27,GO:0043122,regulation of I-kappaB kinase/NF-kappaB cascade,4.854493999999999e-19,2.085819


In [49]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
18,MP:0001828,abnormal T cell activation,1.6293429999999998e-36,2.054944
22,MP:0001844,autoimmune response,4.427369e-35,2.06649
23,MP:0005000,abnormal immune tolerance,5.041626e-35,2.052396
24,MP:0005005,abnormal self tolerance,2.4716549999999998e-34,2.044852
27,MP:0002425,altered susceptibility to autoimmune disorder,1.888043e-33,2.213413
28,MP:0005094,abnormal T cell proliferation,4.6412440000000006e-33,2.065962
31,MP:0008217,abnormal B cell activation,1.51646e-31,2.441083
33,MP:0002357,abnormal spleen white pulp morphology,6.913712e-31,2.08421
45,MP:0000702,enlarged lymph nodes,2.24223e-28,2.345243
46,MP:0005153,abnormal B cell proliferation,2.963331e-28,2.398298


In [50]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
6,MP:0002459,abnormal B cell physiology,3.216056e-49,2.078873
22,MP:0001844,autoimmune response,3.257563e-33,2.247717
24,MP:0005000,abnormal immune tolerance,1.5271820000000002e-32,2.213924
26,MP:0005005,abnormal self tolerance,2.477071e-32,2.214608
29,MP:0008217,abnormal B cell activation,7.836361e-31,2.706302
30,MP:0002425,altered susceptibility to autoimmune disorder,5.4160479999999996e-30,2.337811
34,MP:0008171,abnormal mature B cell morphology,6.807466000000001e-28,2.18201
35,MP:0000691,enlarged spleen,7.524606e-28,2.054668
37,MP:0001828,abnormal T cell activation,2.567321e-27,2.086158
41,MP:0008713,abnormal cytokine level,3.5508109999999996e-26,2.088451


In [51]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
9,GO:0031094,platelet dense tubular network,2.394674e-10,5.336451
15,GO:0071682,endocytic vesicle lumen,2.593774e-09,5.427595
20,GO:0031228,intrinsic to Golgi membrane,8.814481e-09,2.196942
22,GO:0042611,MHC protein complex,1.732951e-08,4.831563
23,GO:0097208,alveolar lamellar body,1.778326e-08,8.062619
26,GO:0030173,integral to Golgi membrane,6.166199e-08,2.153514
31,GO:0031095,platelet dense tubular network membrane,1.737179e-07,4.607586
33,GO:0042599,lamellar body,2.045241e-07,5.94683
34,GO:0097342,ripoptosome,3.640009e-07,8.765341
46,GO:0001520,outer dense fiber,3.061652e-06,6.119236


In [52]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
1,GO:0005126,cytokine receptor binding,2.201581e-17,2.035422
6,GO:0002020,protease binding,1.2714e-10,2.55885
10,GO:0016835,carbon-oxygen lyase activity,2.25188e-09,2.474315
11,GO:0051861,glycolipid binding,4.785386e-09,4.06044
16,GO:0032393,MHC class I receptor activity,6.452181e-08,7.20313
17,GO:0005086,ARF guanyl-nucleotide exchange factor activity,6.820259e-08,3.11655
19,GO:0016836,hydro-lyase activity,1.03472e-07,2.542034
20,GO:0045236,CXCR chemokine receptor binding,1.167359e-07,4.494398
25,GO:0003755,peptidyl-prolyl cis-trans isomerase activity,2.664467e-07,2.421004
27,GO:0046625,sphingolipid binding,3.306067e-07,3.698747


#### PC10 (the thrird component)


In [53]:
component_idx=10

#### what assays are driving this component?

In [54]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([10, 43, 35, 55, 83])

In [55]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.63088412, 0.04906086, 0.03482029, 0.03227257, 0.01871496])

In [56]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
10,ERX329621,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"219047965,92.5,75.8,70997",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA12891,,,male
43,ERX329654,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"74449496,92.0,31.2,25320",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA11831,,,male
35,ERX329646,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"62384167,93.2,16.4,27131",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA11881,,,male
55,ERX329666,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"76235942,92.7,15.1,28768",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA11840,,,female
83,ERX329694,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"235656031,90.4,63.9,65741",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA19240,,,female


#### what genomic bins are driving this component?

In [57]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([369257, 284510, 101812, 331080, 331081])

In [58]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([0.00013737, 0.00013737, 0.00013737, 0.00013737, 0.00013737])

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [59]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx])

10201

In [60]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.026877201672546577

It's roughly on the top 3 percentile.


#### results of the enrichment analysis

In [61]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,HP:0001230,Broad metacarpals,8.115565e-10,7.099386
1,HP:0000040,Enlarged penis,5.145653e-08,5.691914
5,HP:0000926,Platyspondyly,5.663768e-07,2.054329
6,HP:0002718,Recurrent bacterial infections,1.265508e-06,2.109088
7,HP:0000444,Convex nasal ridge,1.281831e-06,2.18483
8,HP:0000706,Unerupted tooth,1.469134e-06,5.434531
9,HP:0000244,Brachyturricephaly,3.353481e-06,4.120014
10,HP:0008213,Gonadotropin deficiency,4.043268e-06,3.852303
11,HP:0001783,Broad metatarsal,4.282452e-06,3.659682
14,HP:0002841,Recurrent fungal infections,8.237882e-06,2.782215


In [62]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
27,GO:0050864,regulation of B cell activation,1.246799e-12,2.069687
33,GO:0031348,negative regulation of defense response,7.671825e-12,2.075798
39,GO:0050871,positive regulation of B cell activation,3.014162e-11,2.276154
48,GO:0030888,regulation of B cell proliferation,1.547751e-10,2.209583
57,GO:0050728,negative regulation of inflammatory response,2.299149e-10,2.028347
66,GO:0090080,positive regulation of MAPKKK cascade by fibro...,1.411751e-09,6.367797
69,GO:0050821,protein stabilization,1.897199e-09,2.122147
70,GO:0001776,leukocyte homeostasis,2.14147e-09,2.12787
71,GO:0070664,negative regulation of leukocyte proliferation,2.21743e-09,2.126101
73,GO:0072661,protein targeting to plasma membrane,2.90639e-09,3.350675


In [63]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
36,MP:0003304,large intestinal inflammation,3.252003e-18,2.237431
48,MP:0001858,intestinal inflammation,6.562998000000001e-17,2.056753
50,MP:0002816,colitis,1.94407e-16,2.233376
58,MP:0008537,increased susceptibility to induced colitis,2.001701e-15,2.645459
65,MP:0005061,abnormal eosinophil morphology,5.328799e-15,2.355283
80,MP:0002602,abnormal eosinophil cell number,1.361115e-13,2.307031
81,MP:0005011,increased eosinophil cell number,1.735878e-13,2.819397
96,MP:0008734,decreased susceptibility to endotoxin shock,2.345921e-12,2.515747
108,MP:0001348,abnormal lacrimal gland physiology,7.265134e-12,3.695595
114,MP:0008781,abnormal B cell apoptosis,9.860526e-12,2.165142


In [64]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
36,MP:0000701,abnormal lymph node size,7.716374e-18,2.075912
37,MP:0001858,intestinal inflammation,1.9839160000000003e-17,2.37551
40,MP:0000702,enlarged lymph nodes,4.6663250000000003e-17,2.319838
55,MP:0005153,abnormal B cell proliferation,8.763034e-15,2.060162
57,MP:0003304,large intestinal inflammation,1.297915e-14,2.311044
58,MP:0002816,colitis,1.467412e-14,2.36144
60,MP:0008217,abnormal B cell activation,2.290063e-14,2.010439
71,MP:0005061,abnormal eosinophil morphology,1.115737e-12,2.422097
73,MP:0001243,abnormal dermal layer morphology,3.154795e-12,2.371622
74,MP:0004007,abnormal lung vasculature morphology,4.364921e-12,2.034649


In [65]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
4,GO:0060205,cytoplasmic membrane-bounded vesicle lumen,8.257653e-10,2.117215
5,GO:0034774,secretory granule lumen,2.27962e-08,2.075585
6,GO:0031093,platelet alpha granule lumen,4.920995e-08,2.168526
7,GO:0031091,platelet alpha granule,1.962686e-07,2.014432
10,GO:0005883,neurofilament,5.862962e-06,2.772038
19,GO:0060053,neurofilament cytoskeleton,3.662182e-05,2.347531
21,GO:0031526,brush border membrane,5.465379e-05,2.080998
22,GO:0005577,fibrinogen complex,5.520804e-05,3.004317
23,GO:0005890,sodium:potassium-exchanging ATPase complex,8.452345e-05,3.462752
35,GO:0005801,cis-Golgi network,0.0005340074,2.083606


In [66]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
10,GO:0001968,fibronectin binding,7.941593e-09,2.83918
17,GO:0017134,fibroblast growth factor binding,8.81579e-07,2.494686
19,GO:0005007,fibroblast growth factor-activated receptor ac...,3.67121e-06,4.336281
21,GO:0071813,lipoprotein particle binding,3.882586e-06,2.198986
22,GO:0002020,protease binding,4.184221e-06,2.003044
24,GO:0043184,vascular endothelial growth factor receptor 2 ...,7.350196e-06,4.351079
27,GO:0030169,low-density lipoprotein particle binding,1.136205e-05,2.249463
28,GO:0015101,organic cation transmembrane transporter activity,1.276221e-05,3.022884
29,GO:0043236,laminin binding,1.417658e-05,2.286159
31,GO:0015651,quaternary ammonium group transmembrane transp...,2.777014e-05,3.458736
