
## VDR chr10_6390 example -- Which component is important?

- The specific context of this variant and disease is described in this google doc:
  - https://docs.google.com/document/d/16GuSasXWX-5qwvKAX5-4VxtrbmsIu9UgrP311_viqQc/edit?usp=sharing
- This notebook would show, 
  1. Given the SNP, identify which genomic bin contains the SNP
  1. Use genomic bin squared cosine score to find the top 3 important components for the genomic bin
  1. Investigate the top component for the genomic bins
    - Use assay contribution scores to see what assays are important for the component
    - Use genomic bin contribution scores to see what other gnomic bins are important for the component
    - Explorer the results of enrichment analysis


In [1]:
% matplotlib inline

import numpy as np
import pandas as pd
import matplotlib, collections, itertools, os, re, textwrap, logging, sys
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches
from functools import reduce

from logging.config import dictConfig
from logging import getLogger

dictConfig(dict(
    version = 1,
    formatters = {'f': {'format': '%(asctime)s %(name)-12s %(levelname)-8s %(message)s'}},
    handlers = {
        'h': {'class': 'logging.StreamHandler','formatter': 'f',
              'level': logging.DEBUG}},
    root = {'handlers': ['h'], 'level': logging.DEBUG,},
))

matplotlib.rc('font',**{'size':16, 'family':'sans-serif','sans-serif':['HelveticaNeue', 'Helvetica']})

logger = getLogger('notebook')


In [2]:
repo_dir=os.path.realpath(
    os.path.dirname(os.path.dirname(os.getcwd()))
)


In [3]:
data_dir=os.path.realpath(
    os.path.join(os.path.dirname(os.getcwd()), 'private_data')
)

In [4]:
enrichment_data_dir=os.path.join(repo_dir, 'enrichment', 'private_data')


In [5]:
sys.path.append(os.path.join(repo_dir, 'enrichment', 'src'))
from great import read_great_res_wrapper


In [6]:
metadata_dir=os.path.join(
    repo_dir, 'metadata'
)
metadata = pd.read_table(
    os.path.join(metadata_dir, 'sample_antibody_map_v2_with_metadata.tsv'),
    sep='\t',
)

### Step 1: SNP to genomic bin


In [7]:
genomic_bin_df=pd.read_csv(
    os.path.join(repo_dir, 'enrichment', 'private_data', 'loci_def.bed'),
    names=['chr', 'chromStart', 'chromEnd', 'name'],
    sep='\t'
)

In [9]:
genomic_bin_df[genomic_bin_df['name'] == 'chr10_6390']

Unnamed: 0,chr,chromStart,chromEnd,name
216060,chr10,6390000,6391000,chr10_6390


This means the index of the genomic bin of our interest is 216060

In [10]:
genomic_bin_idx=216060

### Step 2: Which component is important for a given genomic bin -- genomic bin squared cosine score
- Let's write our decomposition as X = UDV' where X is input feature matrix, D is diagonal singular value matrix, U is left singular vector matrix (on assay space), V is right singular vector matrix (on genomic bin space), and `'` denotes the transposition of the matrix.
- Genomic bin squared cosine score is defined as L2-normalized version of the matrix product (VD) so that any given slice for a given genomic bin has Euclidian norm of 1. 
- The interpretation of the score is it represents the relative importance of the component given a genomic bin.
- More formal definition:
  - https://docs.google.com/document/d/1YRuaIvHvjb_6SJwlml1dQDegiGlGbdfz_zN-5bneroE/edit?usp=sharing
 

#### read the decomposed matrices

In [11]:
def read_decomposed_matrix(filename, compression=None):
    if((compression is None) and (len(filename) > 3) and (filename[-3:] == '.gz')):
        compression='gzip'
    df = pd.read_csv(
        os.path.join(data_dir, filename),
        compression=compression
    )
    mat = df.iloc[:, 1:].as_matrix()
    idx = df.iloc[:, 0].as_matrix()
    return mat, idx

In [12]:
d_mat_temp, d_idx = read_decomposed_matrix(os.path.join(data_dir, 'diagonalScore.csv.gz'))
d_vec = d_mat_temp[:, 0]


In [13]:
u_mat, u_idx = read_decomposed_matrix(os.path.join(data_dir, 'uScore.csv.gz'))


In [14]:
v_mat, v_idx = read_decomposed_matrix(os.path.join(data_dir, 'vScore.csv.gz'))


In [15]:
d_vec.shape, u_mat.shape, v_mat.shape, d_idx.shape, u_idx.shape, v_idx.shape

((652,), (652, 652), (379541, 652), (652,), (652,), (379541,))

#### compute matrix products, UD and VD

In [16]:
u_dot_d = np.dot(u_mat, np.diag(d_vec))


In [17]:
v_dot_d = np.dot(v_mat, np.diag(d_vec))


In [18]:
u_dot_d.shape, v_dot_d.shape

((652, 652), (379541, 652))

#### compute normalized matrices
- v_dot_d_find_pcs: genomic bin --> which PC? genomic bin squared contribution score.
- u_dot_d_fine_pcs: assay       --> which PC? assay squared contribution score.
- v_dot_d_find_loci: PC --> which genomic bins? genomic bin contribution score.
- u_dot_d_find_assay: PC --> which assay? assay contribution score

In [19]:
v_dot_d_find_pcs = (v_dot_d ** 2 ) / (np.sum(v_dot_d ** 2, axis = 1)[:,np.newaxis])


In [20]:
u_dot_d_find_pcs = (u_dot_d ** 2 ) / (np.sum(u_dot_d ** 2, axis = 1)[:,np.newaxis])


In [21]:
v_dot_d_find_loci = (v_dot_d ** 2 ) / (np.sum(v_dot_d ** 2, axis = 0)[np.newaxis, :])


In [22]:
u_dot_d_find_assay = (u_dot_d ** 2 ) / (np.sum(u_dot_d ** 2, axis = 0)[np.newaxis, :])


#### let's identify the top 3 important components for the genomic bin chr17_38023 (index: 328517)

In [23]:
np.argsort(-v_dot_d_find_pcs[genomic_bin_idx, :])[:5]

array([ 5,  1, 73, 14,  2])

In [24]:
v_dot_d_find_pcs[genomic_bin_idx, np.argsort(-v_dot_d_find_pcs[genomic_bin_idx, :])[:5]]

array([0.10332161, 0.04327748, 0.03535886, 0.02969716, 0.02020912])

This means PC5 (0-based index) is the most important component for this bin with 10.3% of squared cosine score, PC1 is the second important one with 4.3%, etc ...

### Step 3: investigation of the components

#### PC5 (the top component)

We will investigate 

1. What assays are driving this component?
1. What genomic loci are driving this component?
1.  What are the top hits in the enrichment analysis?

In [25]:
component_idx=5

#### what assays are driving this component?

In [26]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([188, 433,  76, 434, 431])

In [27]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.1195496 , 0.08228496, 0.05328201, 0.03843322, 0.03278715])

In [28]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
188,SRX1027619,,hg19,No description,,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"804451442,97.3,44.1,118575",ChIP-seq of Homo sapiens: H3K4me3,...,,,,,,,LCLs,blood,Coriell,pooled male and female
433,SRX651491,,hg19,Histone,H3K4me3,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"166047212,93.9,6.4,63296",GSM1435515: LCL19238 H3K4me3; Homo sapiens; Ch...,...,,,,,,,,,,
76,ERX329687,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"238094924,94.2,33.9,144042",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA12878,,,female
434,SRX651492,,hg19,Histone,H3K4me1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"112405175,94.2,1.6,121976",GSM1435516: LCL19239 H3K4me1; Homo sapiens; Ch...,...,,,,,,,,,,
431,SRX651470,,hg19,Histone,H3K4me3,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"117212731,95.4,4.0,47422",GSM1435517: LCL19239 H3K4me3; Homo sapiens; Ch...,...,,,,,,,,,,


#### what genomic bins are driving this component?

In [29]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([282969, 130825, 187463, 337732, 233484])

In [30]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([8.04405926e-05, 6.46164381e-05, 6.45142013e-05, 6.39899652e-05,
       6.37951179e-05])

These genomic bins are important for the PC. Note the genomic bin contribution scores are very small compared to assay contribution score. This is expected becasue of the large number of genomic bins in the whole-genome analysis.

#### where is our loci of interest in this ranking?

In [31]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[
    genomic_bin_idx, component_idx])

18420

In [32]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[
    genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.04853230612766473

It's roughly on the top 5 percentile.


#### results of the enrichment analysis

In [33]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,HP:0002697,Parietal foramina,4.367679e-08,2.372433
1,HP:0004425,Flat forehead,1.568965e-06,2.73591
3,HP:0004442,Sagittal craniosynostosis,2.359962e-06,2.733096
4,HP:0002365,Hypoplasia of the brainstem,2.396735e-06,2.031767
5,HP:0010054,Abnormality of the first metatarsal,2.506458e-06,2.574793
7,HP:0006191,Deep palmar crease,4.189624e-06,2.97718
8,HP:0000557,Buphthalmos,4.706791e-06,2.287392
12,HP:0009836,Broad distal phalanx of finger,1.739019e-05,2.601329
18,HP:0003741,Congenital muscular dystrophy,3.098136e-05,2.094681
21,HP:0003535,3-Methylglutaconic aciduria,3.86184e-05,2.428799


In [34]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
62,GO:0035089,establishment of apical/basal cell polarity,8e-06,2.6585
76,GO:0075733,intracellular transport of virus,1.2e-05,2.073909
100,GO:0051775,response to redox state,2.1e-05,2.764178
108,GO:0061162,establishment of monopolar cell polarity,2.5e-05,2.320837
113,GO:0070127,tRNA aminoacylation for mitochondrial protein ...,2.7e-05,4.11596
116,GO:0071624,positive regulation of granulocyte chemotaxis,2.9e-05,2.016287
126,GO:0090023,positive regulation of neutrophil chemotaxis,3.4e-05,2.0201
148,GO:0018401,peptidyl-proline hydroxylation to 4-hydroxy-L-...,5.6e-05,2.373342
161,GO:0001672,regulation of chromatin assembly or disassembly,7.2e-05,3.185283
164,GO:0006999,nuclear pore organization,7.7e-05,2.528835


In [35]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
12,MP:0001771,abnormal circulating magnesium level,1.832349e-07,2.327122
19,MP:0003954,abnormal Reichert's membrane morphology,8.388949e-07,2.004437
53,MP:0002348,abnormal lymph node medulla morphology,9.682417e-06,2.903855
54,MP:0010092,increased circulating magnesium level,1.018919e-05,2.628006
59,MP:0009545,abnormal dermis papillary layer morphology,1.604188e-05,2.274394
60,MP:0010743,delayed suture closure,1.69713e-05,2.208579
65,MP:0001669,abnormal glucose absorption,2.245532e-05,2.618271
69,MP:0006210,abnormal orbit size,2.804297e-05,2.18163
77,MP:0002050,pheochromocytoma,3.830129e-05,2.589431
79,MP:0000666,decreased prostate gland duct number,4.3192e-05,2.700373


In [36]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
5,MP:0012129,failure of blastocyst formation,2.360537e-09,2.005173
8,MP:0002663,failure to form blastocele,3.603307e-09,2.004262
23,MP:0003954,abnormal Reichert's membrane morphology,8.388949e-07,2.004437
24,MP:0001771,abnormal circulating magnesium level,1.21186e-06,2.284786
33,MP:0010743,delayed suture closure,3.083532e-06,2.694599
34,MP:0002050,pheochromocytoma,3.48313e-06,3.01081
44,MP:0002348,abnormal lymph node medulla morphology,9.682417e-06,2.903855
48,MP:0009545,abnormal dermis papillary layer morphology,1.604188e-05,2.274394
60,MP:0000275,heart hyperplasia,4.592075e-05,2.012848
69,MP:0002031,increased adrenal gland tumor incidence,6.052274e-05,2.408676


In [37]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
5,GO:0005606,laminin-1 complex,1e-06,2.325389
7,GO:0043256,laminin complex,5e-06,2.148635
21,GO:0019031,viral envelope,0.00061,2.839336
48,GO:0031080,nuclear pore outer ring,0.002727,2.255517
54,GO:0005666,DNA-directed RNA polymerase III complex,0.004041,2.228519
58,GO:0019908,nuclear cyclin-dependent protein kinase holoen...,0.004746,2.188418
72,GO:0016461,unconventional myosin complex,0.007764,2.067299
81,GO:0005677,chromatin silencing complex,0.010734,2.11116
85,GO:0005736,DNA-directed RNA polymerase I complex,0.011807,2.494752
93,GO:0005851,eukaryotic translation initiation factor 2B co...,0.014343,2.793472


In [38]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
3,GO:0043022,ribosome binding,7.584495e-08,2.351977
11,GO:0031545,peptidyl-proline 4-dioxygenase activity,8.13874e-06,2.4667
14,GO:0043208,glycosphingolipid binding,2.462545e-05,2.49534
30,GO:0001056,RNA polymerase III activity,0.0003860479,3.131341
34,GO:0005007,fibroblast growth factor-activated receptor ac...,0.000521453,2.250279
48,GO:0032407,MutSalpha complex binding,0.001512241,2.575431
60,GO:0015377,cation:chloride symporter activity,0.002143232,2.088132
65,GO:0047499,calcium-independent phospholipase A2 activity,0.002303207,3.545742
69,GO:0032404,mismatch repair complex binding,0.002498749,2.155177
74,GO:0016421,CoA carboxylase activity,0.002932712,2.029746


#### PC1 (the second component)


In [39]:
component_idx=1

#### what assays are driving this component?

In [40]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([  5, 434,  39, 181, 180])

In [41]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.25162818, 0.06753813, 0.06040728, 0.04239344, 0.04134171])

In [42]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
5,ERX329616,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"176973882,81.5,56.2,35831",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA19238,,,female
434,SRX651492,,hg19,Histone,H3K4me1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"112405175,94.2,1.6,121976",GSM1435516: LCL19239 H3K4me1; Homo sapiens; Ch...,...,,,,,,,,,,
39,ERX329650,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"159910142,78.7,57.6,7876",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA19239,,,male
181,SRX1027612,,hg19,No description,,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"148497099,94.2,23.5,47476",ChIP-seq of Homo sapiens: NFkB-replicate2,...,,,,,,,LCLs,blood,Coriell,pooled male and female
180,SRX1027611,,hg19,No description,,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"132887926,94.3,21.5,45014",ChIP-seq of Homo sapiens: NFkB-replicate1,...,,,,,,,LCLs,blood,Coriell,pooled male and female


#### what genomic bins are driving this component?

In [43]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([ 85,  36,  40,  86, 546])

In [44]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([3.17653819e-05, 3.17653819e-05, 3.17653819e-05, 3.17653819e-05,
       3.17653819e-05])

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [45]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx])

98102

In [46]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.2584753689324737

It's roughly on the top 26 percentile.


#### results of the enrichment analysis

In [47]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,HP:0004395,Malnutrition,5e-06,5.321346
1,HP:0001718,Mitral stenosis,6.3e-05,3.783898
2,HP:0001413,Micronodular cirrhosis,0.00012,2.821075
5,HP:0004333,Bone-marrow foam cells,0.000345,6.657703
6,HP:0003548,Subsarcolemmal accumulations of abnormally sha...,0.000527,6.137533
8,HP:0001414,Microvesicular hepatic steatosis,0.000632,3.131675
12,HP:0004975,Erlenmeyer flask deformity of the femurs,0.001488,3.197454
13,HP:0002725,Systemic lupus erythematosus,0.001533,3.788696
17,HP:0005938,Abnormal respiratory motile cilium morphology,0.002178,3.267208
21,HP:0001618,Dysphonia,0.003317,2.208404


In [48]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
5,GO:0043901,negative regulation of multi-organism process,1.31894e-07,2.168513
8,GO:0050830,defense response to Gram-positive bacterium,5.854487e-07,2.436218
9,GO:0043374,"CD8-positive, alpha-beta T cell differentiation",7.974458e-07,2.954704
14,GO:0006572,tyrosine catabolic process,1.401399e-06,7.532955
17,GO:0002286,T cell activation involved in immune response,3.720987e-06,2.339229
19,GO:0044130,negative regulation of growth of symbiont in host,7.317285e-06,2.802184
28,GO:0060706,cell differentiation involved in embryonic pla...,3.461585e-05,2.245746
30,GO:0032814,regulation of natural killer cell activation,3.942309e-05,2.472078
31,GO:0035404,histone-serine phosphorylation,4.352375e-05,3.327948
32,GO:0051851,modification by host of symbiont morphology or...,4.485152e-05,2.40471


In [49]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
5,MP:0008552,abnormal circulating tumor necrosis factor level,5.025154e-11,2.13741
25,MP:0006309,decreased retinal ganglion cell number,1.466071e-07,2.196054
38,MP:0008553,increased circulating tumor necrosis factor level,9.253086e-07,2.065763
40,MP:0008392,decreased primordial germ cell number,1.07923e-06,2.054944
54,MP:0008554,decreased circulating tumor necrosis factor level,2.946674e-06,2.397094
92,MP:0010819,primary atelectasis,2.198725e-05,2.336993
98,MP:0001499,abnormal kindling response,2.828631e-05,2.472546
101,MP:0000798,abnormal frontal lobe morphology,3.656873e-05,2.80789
111,MP:0008589,abnormal circulating interleukin-1 level,5.936768e-05,2.18055
116,MP:0004725,decreased platelet serotonin level,6.7036e-05,2.304622


In [50]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
16,MP:0008552,abnormal circulating tumor necrosis factor level,6.418123e-08,2.067143
47,MP:0009788,increased susceptibility to bacterial infectio...,3.363712e-06,2.019205
55,MP:0008554,decreased circulating tumor necrosis factor level,7.000738e-06,2.36779
61,MP:0009321,increased histiocytic sarcoma incidence,1.319572e-05,2.534863
75,MP:0006309,decreased retinal ganglion cell number,3.783763e-05,2.174031
83,MP:0009615,abnormal zinc homeostasis,5.456382e-05,4.900227
86,MP:0004721,abnormal platelet dense granule morphology,7.445109e-05,2.12458
88,MP:0008784,craniorachischisis,8.473798e-05,3.4619
89,MP:0004725,decreased platelet serotonin level,8.569051e-05,2.310633
90,MP:0003452,abnormal parotid gland morphology,9.398743e-05,4.18873


In [51]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
1,GO:0030125,clathrin vesicle coat,6.7e-05,2.55238
2,GO:0030934,anchoring collagen,0.000224,2.456392
5,GO:0005861,troponin complex,0.000632,7.499313
10,GO:0002199,zona pellucida receptor complex,0.00114,4.497283
12,GO:0031932,TORC2 complex,0.001389,3.227893
13,GO:0030122,AP-2 adaptor complex,0.001467,2.582954
15,GO:0005865,striated muscle thin filament,0.001538,2.682054
20,GO:0030014,CCR4-NOT complex,0.002663,2.416343
22,GO:0030130,clathrin coat of trans-Golgi network vesicle,0.004041,2.774748
35,GO:0036379,myofilament,0.009111,2.027585


In [52]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
4,GO:0016857,"racemase and epimerase activity, acting on car...",1.1e-05,4.864881
9,GO:0003810,protein-glutamine gamma-glutamyltransferase ac...,0.000161,3.933715
12,GO:0005143,interleukin-12 receptor binding,0.00057,3.378259
13,GO:0016854,racemase and epimerase activity,0.000614,3.142031
14,GO:0005149,interleukin-1 receptor binding,0.00065,3.576795
15,GO:0005024,transforming growth factor beta-activated rece...,0.000709,2.011484
16,GO:0004340,glucokinase activity,0.000847,4.737863
17,GO:0005536,glucose binding,0.000861,4.154994
23,GO:0051010,microtubule plus-end binding,0.001684,2.450133
24,GO:0017153,sodium:dicarboxylate symporter activity,0.002042,2.719939


#### PC73 (the thrird component)


In [53]:
component_idx=73

#### what assays are driving this component?

In [54]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([433,  64, 443, 316, 317])

In [55]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.10426188, 0.09234196, 0.07100191, 0.06702174, 0.04706068])

In [56]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
433,SRX651491,,hg19,Histone,H3K4me3,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"166047212,93.9,6.4,63296",GSM1435515: LCL19238 H3K4me3; Homo sapiens; Ch...,...,,,,,,,,,,
64,ERX329675,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"241109952,88.2,48.3,80858",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA19238,,,female
443,SRX651501,,hg19,Histone,H3K4me1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"64570748,96.0,3.0,52838",GSM1435526: LCL18507 H3K4me1; Homo sapiens; Ch...,...,,,,,,,,,,
316,SRX356721,h3k4me1,hg19,Histone,H3K4me1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"34201660,98.3,4.5,70303",GSM1234153: GM2255 H3K4me1 1; Homo sapiens; Ch...,...,,,,,,,2255,,,
317,SRX356722,h3k4me1,hg19,Histone,H3K4me1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"31838094,98.5,5.1,66804",GSM1234154: GM2255 H3K4me1 2; Homo sapiens; Ch...,...,,,,,,,2255,,,


#### what genomic bins are driving this component?

In [57]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([217546, 236653, 236652, 279512,  88007])

In [58]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([0.00019616, 0.00019205, 0.00018801, 0.00018011, 0.00017978])

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [59]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx])

4269

In [60]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.01124779668072751

It's roughly on the top 1 percentile.


#### results of the enrichment analysis

In [61]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,HP:0001878,Hemolytic anemia,1.126939e-07,2.501702
1,HP:0006530,Interstitial pulmonary disease,4.19688e-07,6.723312
2,HP:0002583,Colitis,1.441889e-06,4.406802
3,HP:0004369,Decreased purine levels,8.433215e-06,4.296864
4,HP:0001271,Polyneuropathy,1.031858e-05,2.744127
5,HP:0002173,Hypoglycemic seizures,1.22448e-05,3.705253
6,HP:0011145,Symptomatic seizures,1.431779e-05,3.486968
7,HP:0003565,Elevated erythrocyte sedimentation rate,1.969673e-05,8.817717
8,HP:0001241,Capitate-hamate fusion,2.311056e-05,7.126692
10,HP:0004429,Recurrent viral infections,3.450729e-05,2.740296


In [62]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
3,GO:0031348,negative regulation of defense response,6.349153e-12,2.133784
8,GO:0050688,regulation of defense response to virus,5.176126e-11,2.644539
13,GO:0050728,negative regulation of inflammatory response,2.599648e-09,2.003923
15,GO:0042098,T cell proliferation,5.820725e-09,2.501011
23,GO:0042992,negative regulation of transcription factor im...,6.555921e-08,2.659193
36,GO:0051081,nuclear envelope disassembly,4.33601e-07,2.499376
44,GO:0042347,negative regulation of NF-kappaB import into n...,8.39536e-07,3.095003
45,GO:0015758,glucose transport,1.050236e-06,2.12281
46,GO:0015749,monosaccharide transport,1.07404e-06,2.103554
47,GO:0008645,hexose transport,1.207382e-06,2.112212


In [63]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
35,MP:0008553,increased circulating tumor necrosis factor level,3.151759e-13,2.780833
52,MP:0004800,decreased susceptibility to experimental autoi...,5.01743e-11,2.319404
56,MP:0008180,abnormal marginal zone B cell morphology,1.07069e-10,2.032751
64,MP:0008125,abnormal dendritic cell number,1.146185e-09,2.171483
68,MP:0008539,decreased susceptibility to induced colitis,1.608991e-09,3.687543
70,MP:0008552,abnormal circulating tumor necrosis factor level,2.006196e-09,2.074254
76,MP:0005070,impaired natural killer cell mediated cytotoxi...,4.927415e-09,2.231583
83,MP:0008172,abnormal follicular B cell morphology,1.126701e-08,2.043186
84,MP:0008182,decreased marginal zone B cell number,1.572002e-08,2.037115
91,MP:0011719,abnormal natural killer cell mediated cytotoxi...,2.592665e-08,2.105791


In [64]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
41,MP:0010766,abnormal NK cell physiology,1.620638e-11,2.071812
47,MP:0008180,abnormal marginal zone B cell morphology,8.623327e-11,2.402066
58,MP:0010878,increased trabecular bone volume,6.890306e-10,3.473976
63,MP:0011518,abnormal cell chemotaxis,1.659544e-09,2.00568
65,MP:0008125,abnormal dendritic cell number,2.184636e-09,2.500405
68,MP:0005461,abnormal dendritic cell morphology,3.749675e-09,2.134901
70,MP:0008182,decreased marginal zone B cell number,5.953468e-09,2.581715
72,MP:0008539,decreased susceptibility to induced colitis,7.460242e-09,3.970646
77,MP:0008553,increased circulating tumor necrosis factor level,1.175295e-08,2.614738
81,MP:0002362,abnormal spleen marginal zone morphology,1.743614e-08,2.042762


In [65]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
7,GO:0031232,extrinsic to external side of plasma membrane,4.1e-05,4.603166
15,GO:0042765,GPI-anchor transamidase complex,0.000176,6.187982
20,GO:0031931,TORC1 complex,0.000637,4.977601
21,GO:0001533,cornified envelope,0.000724,2.416596
22,GO:0042101,T cell receptor complex,0.000914,2.846063
25,GO:0002102,podosome,0.001429,2.162778
26,GO:0005763,mitochondrial small ribosomal subunit,0.001595,2.803861
33,GO:0005827,polar microtubule,0.003306,3.711101
39,GO:0000307,cyclin-dependent protein kinase holoenzyme com...,0.006348,2.007279
40,GO:0005680,anaphase-promoting complex,0.006797,2.436506


In [66]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
3,GO:0005086,ARF guanyl-nucleotide exchange factor activity,6.820259e-08,3.11655
6,GO:0042608,T cell receptor binding,2.245745e-07,6.447217
10,GO:0004198,calcium-dependent cysteine-type endopeptidase ...,8.887818e-06,3.326388
12,GO:0042834,peptidoglycan binding,2.477201e-05,6.08497
13,GO:0017112,Rab guanyl-nucleotide exchange factor activity,3.772063e-05,2.478562
18,GO:0015038,glutathione disulfide oxidoreductase activity,0.0001636602,6.261373
19,GO:0003923,GPI-anchor transamidase activity,0.0001757277,6.187982
21,GO:0050699,WW domain binding,0.0002788431,2.260392
24,GO:0004623,phospholipase A2 activity,0.0004727523,2.509656
26,GO:0005072,"transforming growth factor beta receptor, cyto...",0.0005129675,2.571396
