
## Asthma rs4795397 example -- Which component is important?

- The specific context of this variant and disease is described in this google doc:
  - https://docs.google.com/document/d/16GuSasXWX-5qwvKAX5-4VxtrbmsIu9UgrP311_viqQc/edit?usp=sharing
- This notebook would show, 
  1. Given the SNP, identify which genomic bin contains the SNP
  1. Use genomic bin squared cosine score to find the top 3 important components for the genomic bin
  1. Investigate the top component for the genomic bins
    - Use assay contribution scores to see what assays are important for the component
    - Use genomic bin contribution scores to see what other gnomic bins are important for the component
    - Explorer the results of enrichment analysis


In [1]:
% matplotlib inline

import numpy as np
import pandas as pd
import matplotlib, collections, itertools, os, re, textwrap, logging, sys
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches
from functools import reduce

from logging.config import dictConfig
from logging import getLogger

dictConfig(dict(
    version = 1,
    formatters = {'f': {'format': '%(asctime)s %(name)-12s %(levelname)-8s %(message)s'}},
    handlers = {
        'h': {'class': 'logging.StreamHandler','formatter': 'f',
              'level': logging.DEBUG}},
    root = {'handlers': ['h'], 'level': logging.DEBUG,},
))

matplotlib.rc('font',**{'size':16, 'family':'sans-serif','sans-serif':['HelveticaNeue', 'Helvetica']})

logger = getLogger('notebook')


In [2]:
repo_dir=os.path.realpath(
    os.path.dirname(os.path.dirname(os.getcwd()))
)


In [3]:
data_dir=os.path.realpath(
    os.path.join(os.path.dirname(os.getcwd()), 'private_data')
)

In [56]:
enrichment_data_dir=os.path.join(repo_dir, 'enrichment', 'private_data')


In [4]:
sys.path.append(os.path.join(repo_dir, 'enrichement', 'src'))
from great import read_great_res_wrapper


### Step 1: SNP to genomic bin
- `rs4795397` is on chr17:38023745 (hg19)
  - https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=4795397
- This means the corresponding bin is chr17_38023

In [34]:
genomic_bin_df=pd.read_csv(
    os.path.join(repo_dir, 'enrichment', 'private_data', 'loci_def.bed'),
    names=['chr', 'chromStart', 'chromEnd', 'name'],
    sep='\t'
)

In [36]:
genomic_bin_df[genomic_bin_df['name'] == 'chr17_38023']

Unnamed: 0,chr,chromStart,chromEnd,name
328517,chr17,38023000,38024000,chr17_38023


This means the index of the genomic bin of our interest is 328517

### Step 2: Which component is important for a given genomic bin -- genomic bin squared cosine score
- Let's write our decomposition as X = UDV' where X is input feature matrix, D is diagonal singular value matrix, U is left singular vector matrix (on assay space), V is right singular vector matrix (on genomic bin space), and `'` denotes the transposition of the matrix.
- Genomic bin squared cosine score is defined as L2-normalized version of the matrix product (VD) so that any given slice for a given genomic bin has Euclidian norm of 1. 
- The interpretation of the score is it represents the relative importance of the component given a genomic bin.
- More formal definition:
  - https://docs.google.com/document/d/1YRuaIvHvjb_6SJwlml1dQDegiGlGbdfz_zN-5bneroE/edit?usp=sharing
 

#### read the decomposed matrices

In [14]:
def read_decomposed_matrix(filename, compression=None):
    if((compression is None) and (len(filename) > 3) and (filename[-3:] == '.gz')):
        compression='gzip'
    df = pd.read_csv(
        os.path.join(data_dir, filename),
        compression=compression
    )
    mat = df.iloc[:, 1:].as_matrix()
    idx = df.iloc[:, 0].as_matrix()
    return mat, idx

In [28]:
d_mat_temp, d_idx = read_decomposed_matrix(os.path.join(data_dir, 'diagonalScore.csv.gz'))
d_vec = d_mat_temp[:, 0]


In [15]:
u_mat, u_idx = read_decomposed_matrix(os.path.join(data_dir, 'uScore.csv.gz'))


In [29]:
v_mat, v_idx = read_decomposed_matrix(os.path.join(data_dir, 'vScore.csv.gz'))


In [30]:
d_vec.shape, u_mat.shape, v_mat.shape, d_idx.shape, u_idx.shape, v_idx.shape

((652,), (652, 652), (379541, 652), (652,), (652,), (379541,))

#### compute matrix products, UD and VD

In [38]:
u_dot_d = np.dot(u_mat, np.diag(d_vec))


In [39]:
v_dot_d = np.dot(v_mat, np.diag(d_vec))


In [40]:
u_dot_d.shape, v_dot_d.shape

((652, 652), (379541, 652))

#### compute normalized matrices
- v_dot_d_find_pcs: genomic bin --> which PC? genomic bin squared contribution score.
- u_dot_d_fine_pcs: assay       --> which PC? assay squared contribution score.
- v_dot_d_find_loci: PC --> which genomic bins? genomic bin contribution score.
- u_dot_d_find_assay: PC --> which assay? assay contribution score

In [44]:
v_dot_d_find_pcs = (v_dot_d ** 2 ) / (np.sum(v_dot_d ** 2, axis = 1)[:,np.newaxis])


In [45]:
u_dot_d_find_pcs = (u_dot_d ** 2 ) / (np.sum(u_dot_d ** 2, axis = 1)[:,np.newaxis])


In [53]:
v_dot_d_find_loci = (v_dot_d ** 2 ) / (np.sum(v_dot_d ** 2, axis = 0)[np.newaxis, :])


In [54]:
u_dot_d_find_assay = (u_dot_d ** 2 ) / (np.sum(u_dot_d ** 2, axis = 0)[np.newaxis, :])


#### let's identify the top 3 important components for the genomic bin chr17_38023 (index: 328517)

In [91]:
np.argsort(-v_dot_d_find_pcs[328517, :])[:5]

array([ 7, 39,  5, 41,  1])

In [92]:
v_dot_d_find_pcs[328517, np.argsort(-v_dot_d_find_pcs[328517, :])[:5]]

array([0.07565113, 0.06052896, 0.05604721, 0.04363581, 0.03172747])

This means PC7 (0-based index) is the most important component for this bin with 7.6% of squared cosine score, PC39 is the second important one with 6.1%, etc ...

### Step 3: investigation of the components

#### PC7 (the top component)

We will investigate 

1. What assays are driving this component?
1. What genomic loci are driving this component?
1.  What are the top hits in the enrichment analysis?

#### what assays are driving this component?

In [57]:
np.argsort(-u_dot_d_find_assay[:, 7])[:5]

array([188, 441, 433, 434, 439])

In [58]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, 7])[:5], 7]

array([0.09254222, 0.08990549, 0.08470579, 0.06732091, 0.04859752])

The assays with the indices (where is the correspondance table?) are important for this component with 9.3%, 9.0%, etc. of *assay contribution score*

#### what genomic bins are driving this component?

In [60]:
np.argsort(-v_dot_d_find_loci[:, 7])[:5]

array([249953, 224313, 330510, 273660, 294763])

In [61]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, 7])[:5], 7]

array([6.58942556e-05, 6.22829820e-05, 6.12494759e-05, 6.06528861e-05,
       5.92396118e-05])

These genomic bins are important for PC7. Note the genomic bin contribution scores are very small compared to assay contribution score. This is expected becasue of the large number of genomic bins in the whole-genome analysis.

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [64]:
np.sum(v_dot_d_find_loci[:, 7] >= v_dot_d_find_loci[328517, 7])

19302

In [65]:
np.sum(v_dot_d_find_loci[:, 7] >= v_dot_d_find_loci[328517, 7]) / v_dot_d_find_loci.shape[0]

0.050856165737035

It's roughly on the top 5 percentile.


#### results of the enrichment analysis

In [72]:
read_great_res_wrapper(enrichment_data_dir, 7, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
1,HP:0010537,Wide cranial sutures,0.000546,2.153747
2,HP:0007648,Punctate cataract,0.000557,2.641658
3,HP:0004492,Widely patent fontanelles and sutures,0.000622,2.173684
5,HP:0000894,Short clavicles,0.001024,2.020724
7,HP:0100720,Hypoplasia of the ear cartilage,0.001549,2.317628
8,HP:0000064,Hypoplastic labia minora,0.001655,2.302681
13,HP:0000851,Congenital hypothyroidism,0.00223,2.302182
15,HP:0001194,Abnormalities of placenta and umbilical cord,0.003565,2.080636
16,HP:0003724,Shoulder girdle muscle atrophy,0.004261,2.945603
17,HP:0012056,Cutaneous melanoma,0.004444,2.586142


In [68]:
read_great_res_wrapper(enrichment_data_dir, 7, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
2,GO:0032211,negative regulation of telomere maintenance vi...,3.1e-05,3.811605
5,GO:1900746,regulation of vascular endothelial growth fact...,5.4e-05,3.840211
6,GO:0007004,telomere maintenance via telomerase,8.3e-05,3.018872
7,GO:0051974,negative regulation of telomerase activity,8.7e-05,3.13106
8,GO:0032210,regulation of telomere maintenance via telomerase,0.000168,2.840701
9,GO:0045663,positive regulation of myoblast differentiation,0.000267,2.001179
11,GO:0032205,negative regulation of telomere maintenance,0.000299,2.700491
12,GO:0016233,telomere capping,0.000395,3.819518
13,GO:0006278,RNA-dependent DNA replication,0.000426,2.460045
14,GO:0032202,telomere assembly,0.000491,3.711348


In [69]:
read_great_res_wrapper(enrichment_data_dir, 7, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,MP:0008453,decreased retinal rod cell number,0.00012,2.574294
5,MP:0009189,abnormal pancreatic epsilon cell morphology,0.000734,2.289573
7,MP:0006290,proboscis,0.000759,2.140978
9,MP:0005229,abnormal intervertebral disk development,0.000777,2.097987
31,MP:0009014,prolonged proestrus,0.002114,2.480344
52,MP:0001238,thin epidermis stratum spinosum,0.004226,2.377468
53,MP:0011016,increased core body temperature,0.004271,2.213799
60,MP:0010939,abnormal mandibular prominence morphology,0.004729,2.188475
63,MP:0006197,ocular hypotelorism,0.004913,2.117124
64,MP:0003701,elevated level of mitotic sister chromatid exc...,0.004945,2.54815


In [71]:
read_great_res_wrapper(enrichment_data_dir, 7, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
1,MP:0004122,abnormal sinus arrhythmia,0.000177,2.439509
2,MP:0009189,abnormal pancreatic epsilon cell morphology,0.000203,2.699333
12,MP:0003155,abnormal telomere length,0.001356,2.42164
14,MP:0009175,abnormal pancreatic beta cell differentiation,0.001372,2.222328
18,MP:0000869,abnormal cerebellum posterior vermis morphology,0.001503,2.15352
25,MP:0012055,abnormal phrenic nerve innervation pattern to ...,0.00186,3.345969
27,MP:0008727,enlarged heart right atrium,0.002134,2.854915
28,MP:0009957,abnormal cerebellum vermis lobule morphology,0.002209,2.127916
32,MP:0002704,tubular nephritis,0.002621,2.263936
36,MP:0010207,abnormal telomere morphology,0.003135,2.160905


In [73]:
read_great_res_wrapper(enrichment_data_dir, 7, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,GO:0030126,COPI vesicle coat,0.000686,2.68802
1,GO:0000783,nuclear telomere cap complex,0.000691,2.800919
2,GO:0005838,proteasome regulatory particle,0.000739,3.075104
3,GO:0030663,COPI-coated vesicle membrane,0.000841,2.633313
5,GO:0005869,dynactin complex,0.00134,5.11206
8,GO:0030137,COPI-coated vesicle,0.002135,2.312546
10,GO:0022624,proteasome accessory complex,0.004053,2.301945
11,GO:0000145,exocyst,0.005402,2.042295
13,GO:0000780,"condensed nuclear chromosome, centromeric region",0.007627,2.070883
22,GO:0005638,lamin filament,0.020245,3.257748


In [74]:
read_great_res_wrapper(enrichment_data_dir, 7, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
2,GO:0004716,receptor signaling protein tyrosine kinase act...,0.004002,2.011743
3,GO:0004364,glutathione transferase activity,0.00441,2.466547
4,GO:0004185,serine-type carboxypeptidase activity,0.005957,2.792532
5,GO:0030957,Tat protein binding,0.008178,2.18717
9,GO:0005172,vascular endothelial growth factor receptor bi...,0.011852,2.157702
11,GO:0004602,glutathione peroxidase activity,0.013293,2.208935
13,GO:0004952,dopamine neurotransmitter receptor activity,0.014252,2.415118
14,GO:0043175,RNA polymerase core enzyme binding,0.014268,2.10153
15,GO:0043546,molybdopterin cofactor binding,0.014603,2.782694
23,GO:0004784,superoxide dismutase activity,0.020976,2.581173


#### PC39 (the second component)


In [78]:
component_idx=39

#### what assays are driving this component?

In [79]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([ 47, 413, 415, 416, 225])

In [80]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.20334155, 0.05140923, 0.03698147, 0.026487  , 0.02527061])

The assays with the indices (where is the correspondance table?) are important for this component with 20%, 5.1%, etc. of *assay contribution score*

I wonder what is assay #47.

#### what genomic bins are driving this component?

In [81]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([ 16462,  16461,   2563, 132423, 303976])

In [82]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([0.00017904, 0.00016561, 0.00014659, 0.00013263, 0.00012933])

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [83]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[328517, component_idx])

4631

In [84]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[328517, component_idx]) / v_dot_d_find_loci.shape[0]

0.01220158032992483

It's roughly on the top 1 percentile.


#### results of the enrichment analysis

In [85]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
4,HP:0004332,Abnormality of lymphocytes,1.508808e-10,2.019461
5,HP:0002846,Abnormality of B cells,3.167175e-10,2.186231
6,HP:0010701,Abnormal immunoglobulin level,4.113557e-10,2.198288
7,HP:0005372,Abnormality of B cell physiology,5.608107e-10,2.182674
9,HP:0002960,Autoimmunity,6.655998e-09,2.465546
11,HP:0002850,IgM deficiency,1.992062e-08,4.19633
13,HP:0002621,Atherosclerosis,5.459813e-08,2.228612
15,HP:0002634,Arteriosclerosis,8.772e-08,2.195991
16,HP:0004313,Hypogammaglobulinemia,8.781575e-08,2.101667
19,HP:0003049,Ulnar deviation of the wrist,2.243515e-07,7.139804


In [86]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
10,GO:0048872,homeostasis of number of cells,3.10768e-19,2.057292
19,GO:0002429,immune response-activating cell surface recept...,3.968115e-17,2.012773
35,GO:0030595,leukocyte chemotaxis,2.791925e-14,2.291774
47,GO:0006909,phagocytosis,3.099261e-13,2.005439
50,GO:0038096,Fc-gamma receptor signaling pathway involved i...,4.648642e-13,2.390534
51,GO:0002431,Fc receptor mediated stimulatory signaling pat...,4.84922e-13,2.388601
52,GO:0038094,Fc-gamma receptor signaling pathway,4.950468e-13,2.387656
55,GO:0002262,myeloid cell homeostasis,1.087754e-12,2.120652
73,GO:0046637,regulation of alpha-beta T cell differentiation,6.950873e-12,2.548473
74,GO:0097193,intrinsic apoptotic signaling pathway,7.276055e-12,2.023523


In [87]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
33,MP:0002498,abnormal acute inflammation,2.0416020000000003e-23,2.019778
45,MP:0005153,abnormal B cell proliferation,1.340759e-20,2.148382
46,MP:0005087,decreased acute inflammation,1.585143e-20,2.276109
50,MP:0008217,abnormal B cell activation,7.056544e-20,2.083458
53,MP:0005068,abnormal NK cell morphology,1.945316e-19,2.282522
56,MP:0008043,abnormal NK cell number,2.886092e-19,2.349993
61,MP:0000702,enlarged lymph nodes,6.721281999999999e-19,2.045753
62,MP:0001876,decreased inflammatory response,6.916684999999999e-19,2.01875
68,MP:0008781,abnormal B cell apoptosis,4.51998e-18,2.612855
70,MP:0008210,increased mature B cell number,1.4939090000000003e-17,2.155396


In [88]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
43,MP:0005153,abnormal B cell proliferation,6.658585999999999e-19,2.29431
46,MP:0008217,abnormal B cell activation,1.834708e-18,2.237208
50,MP:0011762,renal/urinary system inflammation,6.732876e-18,2.115589
51,MP:0005068,abnormal NK cell morphology,9.651842000000001e-18,2.359271
53,MP:0008043,abnormal NK cell number,9.046579000000001e-17,2.426455
61,MP:0002148,abnormal hypersensitivity reaction,7.926478e-16,2.213092
64,MP:0005095,decreased T cell proliferation,1.308548e-15,2.034887
69,MP:0005087,decreased acute inflammation,3.49431e-15,2.159936
71,MP:0004149,increased bone strength,4.202318e-15,5.57754
73,MP:0008781,abnormal B cell apoptosis,6.751364e-15,3.027338


In [89]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,GO:0005925,focal adhesion,9.764139e-16,2.154349
1,GO:0005924,cell-substrate adherens junction,4.952786e-15,2.107207
2,GO:0030055,cell-substrate junction,9.267956e-15,2.016888
20,GO:0031228,intrinsic to Golgi membrane,1.150887e-07,2.084988
22,GO:0030173,integral to Golgi membrane,7.428579e-07,2.035796
27,GO:0002102,podosome,1.708912e-06,2.902913
28,GO:0001673,male germ cell nucleus,2.70234e-06,3.481693
29,GO:0043073,germ cell nucleus,3.139255e-06,3.026767
32,GO:0031526,brush border membrane,4.623854e-06,2.315726
36,GO:0005826,actomyosin contractile ring,9.448244e-06,5.419206


In [90]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
21,GO:0042379,chemokine receptor binding,3.518052e-08,2.959699
31,GO:0071889,14-3-3 protein binding,5.683577e-07,3.162299
32,GO:0008009,chemokine activity,7.881759e-07,2.829485
33,GO:0004826,phenylalanine-tRNA ligase activity,8.912586e-07,6.249892
35,GO:0017112,Rab guanyl-nucleotide exchange factor activity,2.07332e-06,2.748608
39,GO:0035035,histone acetyltransferase binding,3.48109e-06,2.674145
40,GO:0042975,peroxisome proliferator activated receptor bin...,3.565991e-06,4.099304
46,GO:0005138,interleukin-6 receptor binding,1.492873e-05,4.078269
50,GO:0008432,JUN kinase binding,1.744005e-05,5.622004
54,GO:0005160,transforming growth factor beta receptor binding,3.408384e-05,2.281372


#### PC5 (the thrird component)


In [93]:
component_idx=5

#### what assays are driving this component?

In [94]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([188, 433,  76, 434, 431])

In [95]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.1195496 , 0.08228496, 0.05328201, 0.03843322, 0.03278715])

#### what genomic bins are driving this component?

In [96]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([282969, 130825, 187463, 337732, 233484])

In [97]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([8.04405926e-05, 6.46164381e-05, 6.45142013e-05, 6.39899652e-05,
       6.37951179e-05])

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [98]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[328517, component_idx])

39281

In [99]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[328517, component_idx]) / v_dot_d_find_loci.shape[0]

0.1034960649837567

It's roughly on the top 10 percentile.


#### results of the enrichment analysis

In [100]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,HP:0002697,Parietal foramina,4.367679e-08,2.372433
1,HP:0004425,Flat forehead,1.568965e-06,2.73591
3,HP:0004442,Sagittal craniosynostosis,2.359962e-06,2.733096
4,HP:0002365,Hypoplasia of the brainstem,2.396735e-06,2.031767
5,HP:0010054,Abnormality of the first metatarsal,2.506458e-06,2.574793
7,HP:0006191,Deep palmar crease,4.189624e-06,2.97718
8,HP:0000557,Buphthalmos,4.706791e-06,2.287392
12,HP:0009836,Broad distal phalanx of finger,1.739019e-05,2.601329
18,HP:0003741,Congenital muscular dystrophy,3.098136e-05,2.094681
21,HP:0003535,3-Methylglutaconic aciduria,3.86184e-05,2.428799


In [101]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
62,GO:0035089,establishment of apical/basal cell polarity,8e-06,2.6585
76,GO:0075733,intracellular transport of virus,1.2e-05,2.073909
100,GO:0051775,response to redox state,2.1e-05,2.764178
108,GO:0061162,establishment of monopolar cell polarity,2.5e-05,2.320837
113,GO:0070127,tRNA aminoacylation for mitochondrial protein ...,2.7e-05,4.11596
116,GO:0071624,positive regulation of granulocyte chemotaxis,2.9e-05,2.016287
126,GO:0090023,positive regulation of neutrophil chemotaxis,3.4e-05,2.0201
148,GO:0018401,peptidyl-proline hydroxylation to 4-hydroxy-L-...,5.6e-05,2.373342
161,GO:0001672,regulation of chromatin assembly or disassembly,7.2e-05,3.185283
164,GO:0006999,nuclear pore organization,7.7e-05,2.528835


In [102]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
12,MP:0001771,abnormal circulating magnesium level,1.832349e-07,2.327122
19,MP:0003954,abnormal Reichert's membrane morphology,8.388949e-07,2.004437
53,MP:0002348,abnormal lymph node medulla morphology,9.682417e-06,2.903855
54,MP:0010092,increased circulating magnesium level,1.018919e-05,2.628006
59,MP:0009545,abnormal dermis papillary layer morphology,1.604188e-05,2.274394
60,MP:0010743,delayed suture closure,1.69713e-05,2.208579
65,MP:0001669,abnormal glucose absorption,2.245532e-05,2.618271
69,MP:0006210,abnormal orbit size,2.804297e-05,2.18163
77,MP:0002050,pheochromocytoma,3.830129e-05,2.589431
79,MP:0000666,decreased prostate gland duct number,4.3192e-05,2.700373


In [103]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
5,MP:0012129,failure of blastocyst formation,2.360537e-09,2.005173
8,MP:0002663,failure to form blastocele,3.603307e-09,2.004262
23,MP:0003954,abnormal Reichert's membrane morphology,8.388949e-07,2.004437
24,MP:0001771,abnormal circulating magnesium level,1.21186e-06,2.284786
33,MP:0010743,delayed suture closure,3.083532e-06,2.694599
34,MP:0002050,pheochromocytoma,3.48313e-06,3.01081
44,MP:0002348,abnormal lymph node medulla morphology,9.682417e-06,2.903855
48,MP:0009545,abnormal dermis papillary layer morphology,1.604188e-05,2.274394
60,MP:0000275,heart hyperplasia,4.592075e-05,2.012848
69,MP:0002031,increased adrenal gland tumor incidence,6.052274e-05,2.408676


In [104]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
5,GO:0005606,laminin-1 complex,1e-06,2.325389
7,GO:0043256,laminin complex,5e-06,2.148635
21,GO:0019031,viral envelope,0.00061,2.839336
48,GO:0031080,nuclear pore outer ring,0.002727,2.255517
54,GO:0005666,DNA-directed RNA polymerase III complex,0.004041,2.228519
58,GO:0019908,nuclear cyclin-dependent protein kinase holoen...,0.004746,2.188418
72,GO:0016461,unconventional myosin complex,0.007764,2.067299
81,GO:0005677,chromatin silencing complex,0.010734,2.11116
85,GO:0005736,DNA-directed RNA polymerase I complex,0.011807,2.494752
93,GO:0005851,eukaryotic translation initiation factor 2B co...,0.014343,2.793472


In [105]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
3,GO:0043022,ribosome binding,7.584495e-08,2.351977
11,GO:0031545,peptidyl-proline 4-dioxygenase activity,8.13874e-06,2.4667
14,GO:0043208,glycosphingolipid binding,2.462545e-05,2.49534
30,GO:0001056,RNA polymerase III activity,0.0003860479,3.131341
34,GO:0005007,fibroblast growth factor-activated receptor ac...,0.000521453,2.250279
48,GO:0032407,MutSalpha complex binding,0.001512241,2.575431
60,GO:0015377,cation:chloride symporter activity,0.002143232,2.088132
65,GO:0047499,calcium-independent phospholipase A2 activity,0.002303207,3.545742
69,GO:0032404,mismatch repair complex binding,0.002498749,2.155177
74,GO:0016421,CoA carboxylase activity,0.002932712,2.029746
