
## VDR chr2_33701 example -- Which component is important?

- The specific context of this variant and disease is described in this google doc:
  - https://docs.google.com/document/d/16GuSasXWX-5qwvKAX5-4VxtrbmsIu9UgrP311_viqQc/edit?usp=sharing
- This notebook would show, 
  1. Given the SNP, identify which genomic bin contains the SNP
  1. Use genomic bin squared cosine score to find the top 3 important components for the genomic bin
  1. Investigate the top component for the genomic bins
    - Use assay contribution scores to see what assays are important for the component
    - Use genomic bin contribution scores to see what other gnomic bins are important for the component
    - Explorer the results of enrichment analysis


In [1]:
% matplotlib inline

import numpy as np
import pandas as pd
import matplotlib, collections, itertools, os, re, textwrap, logging, sys
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches
from functools import reduce

from logging.config import dictConfig
from logging import getLogger

dictConfig(dict(
    version = 1,
    formatters = {'f': {'format': '%(asctime)s %(name)-12s %(levelname)-8s %(message)s'}},
    handlers = {
        'h': {'class': 'logging.StreamHandler','formatter': 'f',
              'level': logging.DEBUG}},
    root = {'handlers': ['h'], 'level': logging.DEBUG,},
))

matplotlib.rc('font',**{'size':16, 'family':'sans-serif','sans-serif':['HelveticaNeue', 'Helvetica']})

logger = getLogger('notebook')


In [2]:
repo_dir=os.path.realpath(
    os.path.dirname(os.path.dirname(os.getcwd()))
)


In [3]:
data_dir=os.path.realpath(
    os.path.join(os.path.dirname(os.getcwd()), 'private_data')
)

In [4]:
enrichment_data_dir=os.path.join(repo_dir, 'enrichment', 'private_data')


In [5]:
sys.path.append(os.path.join(repo_dir, 'enrichment', 'src'))
from great import read_great_res_wrapper


In [6]:
metadata_dir=os.path.join(
    repo_dir, 'metadata'
)
metadata = pd.read_table(
    os.path.join(metadata_dir, 'sample_antibody_map_v2_with_metadata.tsv'),
    sep='\t',
)

### Step 1: SNP to genomic bin


In [7]:
genomic_bin_df=pd.read_csv(
    os.path.join(repo_dir, 'enrichment', 'private_data', 'loci_def.bed'),
    names=['chr', 'chromStart', 'chromEnd', 'name'],
    sep='\t'
)

In [8]:
genomic_bin_df[genomic_bin_df['name'] == 'chr2_33701']

Unnamed: 0,chr,chromStart,chromEnd,name
40973,chr2,33701000,33702000,chr2_33701


This means the index of the genomic bin of our interest is 40973

In [9]:
genomic_bin_idx=40973

### Step 2: Which component is important for a given genomic bin -- genomic bin squared cosine score
- Let's write our decomposition as X = UDV' where X is input feature matrix, D is diagonal singular value matrix, U is left singular vector matrix (on assay space), V is right singular vector matrix (on genomic bin space), and `'` denotes the transposition of the matrix.
- Genomic bin squared cosine score is defined as L2-normalized version of the matrix product (VD) so that any given slice for a given genomic bin has Euclidian norm of 1. 
- The interpretation of the score is it represents the relative importance of the component given a genomic bin.
- More formal definition:
  - https://docs.google.com/document/d/1YRuaIvHvjb_6SJwlml1dQDegiGlGbdfz_zN-5bneroE/edit?usp=sharing
 

#### read the decomposed matrices

In [10]:
def read_decomposed_matrix(filename, compression=None):
    if((compression is None) and (len(filename) > 3) and (filename[-3:] == '.gz')):
        compression='gzip'
    df = pd.read_csv(
        os.path.join(data_dir, filename),
        compression=compression
    )
    mat = df.iloc[:, 1:].as_matrix()
    idx = df.iloc[:, 0].as_matrix()
    return mat, idx

In [11]:
d_mat_temp, d_idx = read_decomposed_matrix(os.path.join(data_dir, 'diagonalScore.csv.gz'))
d_vec = d_mat_temp[:, 0]


In [12]:
u_mat, u_idx = read_decomposed_matrix(os.path.join(data_dir, 'uScore.csv.gz'))


In [13]:
v_mat, v_idx = read_decomposed_matrix(os.path.join(data_dir, 'vScore.csv.gz'))


In [14]:
d_vec.shape, u_mat.shape, v_mat.shape, d_idx.shape, u_idx.shape, v_idx.shape

((652,), (652, 652), (379541, 652), (652,), (652,), (379541,))

#### compute matrix products, UD and VD

In [15]:
u_dot_d = np.dot(u_mat, np.diag(d_vec))


In [16]:
v_dot_d = np.dot(v_mat, np.diag(d_vec))


In [17]:
u_dot_d.shape, v_dot_d.shape

((652, 652), (379541, 652))

#### compute normalized matrices
- v_dot_d_find_pcs: genomic bin --> which PC? genomic bin squared contribution score.
- u_dot_d_fine_pcs: assay       --> which PC? assay squared contribution score.
- v_dot_d_find_loci: PC --> which genomic bins? genomic bin contribution score.
- u_dot_d_find_assay: PC --> which assay? assay contribution score

In [18]:
v_dot_d_find_pcs = (v_dot_d ** 2 ) / (np.sum(v_dot_d ** 2, axis = 1)[:,np.newaxis])


In [19]:
u_dot_d_find_pcs = (u_dot_d ** 2 ) / (np.sum(u_dot_d ** 2, axis = 1)[:,np.newaxis])


In [20]:
v_dot_d_find_loci = (v_dot_d ** 2 ) / (np.sum(v_dot_d ** 2, axis = 0)[np.newaxis, :])


In [21]:
u_dot_d_find_assay = (u_dot_d ** 2 ) / (np.sum(u_dot_d ** 2, axis = 0)[np.newaxis, :])


#### let's identify the top 3 important components for the genomic bin

In [23]:
np.argsort(-v_dot_d_find_pcs[genomic_bin_idx, :])[:5]

array([ 0,  5,  2,  8, 39])

In [24]:
v_dot_d_find_pcs[genomic_bin_idx, np.argsort(-v_dot_d_find_pcs[genomic_bin_idx, :])[:5]]

array([0.17388874, 0.14292424, 0.07116638, 0.05649573, 0.02586899])

This means PC0 (0-based index) is the most important component for this bin with 17.4% of squared cosine score, PC5 is the second important one with 14.2%, etc ...

### Step 3: investigation of the components

#### PC0 (the top component)

We will investigate 

1. What assays are driving this component?
1. What genomic loci are driving this component?
1.  What are the top hits in the enrichment analysis?

In [25]:
component_idx=0

#### what assays are driving this component?

In [26]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([193, 192,  93, 194,  28])

In [27]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.49684131, 0.31340371, 0.05643114, 0.01414899, 0.00910366])

In [28]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
193,SRX106085,,hg19,Histone,H3K27me3,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"50863687,47.0,38.0,44910",GSM835575: WBS H3K27me3; Homo sapiens; ChIP-Seq,...,,,,,,,,,,
192,SRX106084,,hg19,Histone,H3K27me3,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"45470156,35.1,60.9,34413",GSM835574: SMS H3K27me3; Homo sapiens; ChIP-Seq,...,,,,,,,,,,
93,ERX329704,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"53432856,37.9,67.7,21958",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA11830,,,female
194,SRX106087,,hg19,Histone,H3K27me3,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"46755455,53.5,22.7,10811",GSM835577: Ctrl H3K27me3; Homo sapiens; ChIP-Seq,...,,,,,,,,,,
28,ERX329639,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"48478749,83.7,14.8,38430",Illumina Genome Analyzer IIx sequencing; Coord...,...,,,,,,,NA12892,,,female


#### what genomic bins are driving this component?

In [29]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([310943, 291154,  64617, 213360, 217066])

In [30]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([3.14733015e-05, 3.14360980e-05, 3.14233214e-05, 3.14025649e-05,
       3.14000381e-05])

These genomic bins are important for PC16. Note the genomic bin contribution scores are very small compared to assay contribution score. This is expected becasue of the large number of genomic bins in the whole-genome analysis.

#### where is our loci of interest in this ranking?

In [31]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[
    genomic_bin_idx, component_idx])

52379

In [32]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[
    genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.1380061706113437

It's roughly on the top 14 percentile.


#### results of the enrichment analysis

In [33]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
3,HP:0012140,Abnormality of cells of the lymphoid lineage,3.637427e-07,2.315639
5,HP:0001888,Lymphopenia,5.56784e-07,2.382784
11,HP:0001878,Hemolytic anemia,1.152956e-06,2.008804
13,HP:0002917,Hypomagnesemia,3.079205e-06,3.452539
20,HP:0004921,Abnormality of magnesium homeostasis,5.371737e-05,2.814905
22,HP:0000121,Nephrocalcinosis,8.832524e-05,2.448755
24,HP:0200114,Metabolic alkalosis,0.0001073928,4.508408
25,HP:0002643,Neonatal respiratory distress,0.0001307868,3.743395
27,HP:0000360,Tinnitus,0.0001513635,2.679365
28,HP:0001281,Tetany,0.0001785555,2.937869


In [34]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
19,GO:0080111,DNA demethylation,3.780717e-08,3.904828
20,GO:0046133,pyrimidine ribonucleoside catabolic process,4.080519e-08,6.249585
26,GO:0006703,estrogen biosynthetic process,9.438592e-08,3.843655
29,GO:0045916,negative regulation of complement activation,1.792544e-07,4.370029
35,GO:0033081,regulation of T cell differentiation in thymus,5.234933e-07,2.308571
40,GO:0002921,negative regulation of humoral immune response,7.788734e-07,3.955009
47,GO:0006369,termination of RNA polymerase II transcription,1.373182e-06,2.414571
48,GO:0008334,histone mRNA metabolic process,1.448484e-06,2.745745
66,GO:0006749,glutathione metabolic process,4.592398e-06,2.208914
71,GO:0010955,negative regulation of protein processing,5.448677e-06,2.330956


In [35]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
26,MP:0001856,myocarditis,3e-06,2.899189
34,MP:0001870,salivary gland inflammation,7e-06,2.334535
36,MP:0004041,increased susceptibility to kidney reperfusion...,8e-06,4.074445
44,MP:0002389,abnormal Peyer's patch follicle morphology,1.8e-05,2.146357
60,MP:0002392,abnormal Peyer's patch T cell area morphology,6.7e-05,3.54165
62,MP:0008862,asymmetric snout,7e-05,2.759692
68,MP:0003452,abnormal parotid gland morphology,8.7e-05,3.455479
69,MP:0003407,abnormal central nervous system regeneration,8.8e-05,2.634864
72,MP:0009488,abnormal pancreatic islet cell apoptosis,9.5e-05,3.107736
88,MP:0002565,delayed circadian phase,0.000161,2.458035


In [36]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
24,MP:0001856,myocarditis,3e-06,3.121352
29,MP:0004041,increased susceptibility to kidney reperfusion...,8e-06,4.074445
35,MP:0008126,increased dendritic cell number,1.7e-05,2.8822
59,MP:0008862,asymmetric snout,7e-05,2.759692
63,MP:0003452,abnormal parotid gland morphology,8.7e-05,3.455479
64,MP:0001870,salivary gland inflammation,9.2e-05,2.225581
65,MP:0003031,acidosis,9.5e-05,2.154495
81,MP:0002565,delayed circadian phase,0.000161,2.458035
84,MP:0000786,abnormal embryonic neuroepithelial layer diffe...,0.000176,2.571348
86,MP:0002389,abnormal Peyer's patch follicle morphology,0.000194,2.095309


In [37]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
4,GO:0042611,MHC protein complex,1.528314e-09,4.026781
17,GO:0071556,integral to lumenal side of endoplasmic reticu...,5.172804e-06,3.113123
20,GO:0042612,MHC class I protein complex,9.299293e-06,3.61047
21,GO:0042613,MHC class II protein complex,2.079056e-05,3.733819
26,GO:0005761,mitochondrial ribosome,4.195153e-05,2.000761
28,GO:0012507,ER to Golgi transport vesicle membrane,5.810453e-05,2.100427
32,GO:0005763,mitochondrial small ribosomal subunit,0.0001794133,2.642842
33,GO:0005689,U12-type spliceosomal complex,0.0002281263,2.342155
36,GO:0005683,U7 snRNP,0.0003130327,3.936596
43,GO:0000421,autophagic vacuole membrane,0.0006123051,2.136283


In [38]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,GO:0004303,estradiol 17-beta-dehydrogenase activity,5.50848e-09,4.707856
1,GO:0003823,antigen binding,2.894243e-08,2.479716
2,GO:0001848,complement binding,6.37365e-08,5.244508
3,GO:0033764,"steroid dehydrogenase activity, acting on the ...",3.355615e-07,2.687094
5,GO:0016755,"transferase activity, transferring amino-acyl ...",1.664452e-06,2.462163
6,GO:0032395,MHC class II receptor activity,3.954467e-06,4.956089
8,GO:0042605,peptide antigen binding,5.772647e-06,3.000111
10,GO:0003746,translation elongation factor activity,8.282372e-06,2.717842
11,GO:0016229,steroid dehydrogenase activity,1.135737e-05,2.281044
14,GO:0017091,AU-rich element binding,1.549784e-05,2.895299


#### PC5 (the second component)


In [39]:
component_idx=5

#### what assays are driving this component?

In [40]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([188, 433,  76, 434, 431])

In [41]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.1195496 , 0.08228496, 0.05328201, 0.03843322, 0.03278715])

In [42]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
188,SRX1027619,,hg19,No description,,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"804451442,97.3,44.1,118575",ChIP-seq of Homo sapiens: H3K4me3,...,,,,,,,LCLs,blood,Coriell,pooled male and female
433,SRX651491,,hg19,Histone,H3K4me3,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"166047212,93.9,6.4,63296",GSM1435515: LCL19238 H3K4me3; Homo sapiens; Ch...,...,,,,,,,,,,
76,ERX329687,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"238094924,94.2,33.9,144042",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA12878,,,female
434,SRX651492,,hg19,Histone,H3K4me1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"112405175,94.2,1.6,121976",GSM1435516: LCL19239 H3K4me1; Homo sapiens; Ch...,...,,,,,,,,,,
431,SRX651470,,hg19,Histone,H3K4me3,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"117212731,95.4,4.0,47422",GSM1435517: LCL19239 H3K4me3; Homo sapiens; Ch...,...,,,,,,,,,,


#### what genomic bins are driving this component?

In [43]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([282969, 130825, 187463, 337732, 233484])

In [44]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([8.04405926e-05, 6.46164381e-05, 6.45142013e-05, 6.39899652e-05,
       6.37951179e-05])

#### where is our loci of interest in this ranking?

In [45]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx])

10884

In [46]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.028676743751004503

It's roughly on the top 3 percentile.


#### results of the enrichment analysis

In [47]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,HP:0002697,Parietal foramina,4.367679e-08,2.372433
1,HP:0004425,Flat forehead,1.568965e-06,2.73591
3,HP:0004442,Sagittal craniosynostosis,2.359962e-06,2.733096
4,HP:0002365,Hypoplasia of the brainstem,2.396735e-06,2.031767
5,HP:0010054,Abnormality of the first metatarsal,2.506458e-06,2.574793
7,HP:0006191,Deep palmar crease,4.189624e-06,2.97718
8,HP:0000557,Buphthalmos,4.706791e-06,2.287392
12,HP:0009836,Broad distal phalanx of finger,1.739019e-05,2.601329
18,HP:0003741,Congenital muscular dystrophy,3.098136e-05,2.094681
21,HP:0003535,3-Methylglutaconic aciduria,3.86184e-05,2.428799


In [48]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
62,GO:0035089,establishment of apical/basal cell polarity,8e-06,2.6585
76,GO:0075733,intracellular transport of virus,1.2e-05,2.073909
100,GO:0051775,response to redox state,2.1e-05,2.764178
108,GO:0061162,establishment of monopolar cell polarity,2.5e-05,2.320837
113,GO:0070127,tRNA aminoacylation for mitochondrial protein ...,2.7e-05,4.11596
116,GO:0071624,positive regulation of granulocyte chemotaxis,2.9e-05,2.016287
126,GO:0090023,positive regulation of neutrophil chemotaxis,3.4e-05,2.0201
148,GO:0018401,peptidyl-proline hydroxylation to 4-hydroxy-L-...,5.6e-05,2.373342
161,GO:0001672,regulation of chromatin assembly or disassembly,7.2e-05,3.185283
164,GO:0006999,nuclear pore organization,7.7e-05,2.528835


In [49]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
12,MP:0001771,abnormal circulating magnesium level,1.832349e-07,2.327122
19,MP:0003954,abnormal Reichert's membrane morphology,8.388949e-07,2.004437
53,MP:0002348,abnormal lymph node medulla morphology,9.682417e-06,2.903855
54,MP:0010092,increased circulating magnesium level,1.018919e-05,2.628006
59,MP:0009545,abnormal dermis papillary layer morphology,1.604188e-05,2.274394
60,MP:0010743,delayed suture closure,1.69713e-05,2.208579
65,MP:0001669,abnormal glucose absorption,2.245532e-05,2.618271
69,MP:0006210,abnormal orbit size,2.804297e-05,2.18163
77,MP:0002050,pheochromocytoma,3.830129e-05,2.589431
79,MP:0000666,decreased prostate gland duct number,4.3192e-05,2.700373


In [50]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
5,MP:0012129,failure of blastocyst formation,2.360537e-09,2.005173
8,MP:0002663,failure to form blastocele,3.603307e-09,2.004262
23,MP:0003954,abnormal Reichert's membrane morphology,8.388949e-07,2.004437
24,MP:0001771,abnormal circulating magnesium level,1.21186e-06,2.284786
33,MP:0010743,delayed suture closure,3.083532e-06,2.694599
34,MP:0002050,pheochromocytoma,3.48313e-06,3.01081
44,MP:0002348,abnormal lymph node medulla morphology,9.682417e-06,2.903855
48,MP:0009545,abnormal dermis papillary layer morphology,1.604188e-05,2.274394
60,MP:0000275,heart hyperplasia,4.592075e-05,2.012848
69,MP:0002031,increased adrenal gland tumor incidence,6.052274e-05,2.408676


In [51]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
5,GO:0005606,laminin-1 complex,1e-06,2.325389
7,GO:0043256,laminin complex,5e-06,2.148635
21,GO:0019031,viral envelope,0.00061,2.839336
48,GO:0031080,nuclear pore outer ring,0.002727,2.255517
54,GO:0005666,DNA-directed RNA polymerase III complex,0.004041,2.228519
58,GO:0019908,nuclear cyclin-dependent protein kinase holoen...,0.004746,2.188418
72,GO:0016461,unconventional myosin complex,0.007764,2.067299
81,GO:0005677,chromatin silencing complex,0.010734,2.11116
85,GO:0005736,DNA-directed RNA polymerase I complex,0.011807,2.494752
93,GO:0005851,eukaryotic translation initiation factor 2B co...,0.014343,2.793472


In [52]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
3,GO:0043022,ribosome binding,7.584495e-08,2.351977
11,GO:0031545,peptidyl-proline 4-dioxygenase activity,8.13874e-06,2.4667
14,GO:0043208,glycosphingolipid binding,2.462545e-05,2.49534
30,GO:0001056,RNA polymerase III activity,0.0003860479,3.131341
34,GO:0005007,fibroblast growth factor-activated receptor ac...,0.000521453,2.250279
48,GO:0032407,MutSalpha complex binding,0.001512241,2.575431
60,GO:0015377,cation:chloride symporter activity,0.002143232,2.088132
65,GO:0047499,calcium-independent phospholipase A2 activity,0.002303207,3.545742
69,GO:0032404,mismatch repair complex binding,0.002498749,2.155177
74,GO:0016421,CoA carboxylase activity,0.002932712,2.029746


#### PC2 (the thrird component)


In [53]:
component_idx=2

#### what assays are driving this component?

In [54]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([305, 366, 336, 321, 351])

In [55]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.0767717 , 0.07214937, 0.06740654, 0.05664575, 0.04790769])

In [56]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
305,SRX356559,sa1,hg19,TFs and others,STAG1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"30569210,98.7,45.3,46902",GSM1233991: GM18486 SA1 1; Homo sapiens; ChIP-Seq,...,,,,,,,18486,,,
366,SRX356783,sa1,hg19,TFs and others,STAG1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"39045039,98.3,44.9,50318",GSM1234215: GM2630 SA1 2; Homo sapiens; ChIP-Seq,...,,,,,,,2630,,,
336,SRX356747,sa1,hg19,TFs and others,STAG1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"33660922,97.3,43.8,47950",GSM1234179: GM2588 SA1 2; Homo sapiens; ChIP-Seq,...,,,,,,,2588,,,
321,SRX356729,sa1,hg19,TFs and others,STAG1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"32218948,97.4,46.1,42241",GSM1234161: GM2255 SA1 2; Homo sapiens; ChIP-Seq,...,,,,,,,2255,,,
351,SRX356765,sa1,hg19,TFs and others,STAG1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"28293163,97.6,37.8,45107",GSM1234197: GM2610 SA1 2; Homo sapiens; ChIP-Seq,...,,,,,,,2610,,,


#### what genomic bins are driving this component?

In [57]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([298320, 298321,  54352, 114709,  46679])

In [58]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([4.89963650e-05, 4.77846741e-05, 4.77053255e-05, 4.73716488e-05,
       4.72138889e-05])

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [59]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx])

48874

In [60]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.12877133168748567

It's roughly on the top 13 percentile.


#### results of the enrichment analysis

In [61]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
8,HP:0012103,Abnormality of the mitochondrion,1.981297e-08,2.430178
11,HP:0003287,Abnormality of mitochondrial metabolism,5.547277e-08,2.397021
16,HP:0010972,Anemia of inadequate production,1.467984e-07,2.084019
20,HP:0200042,Skin ulcer,2.149489e-07,2.02101
22,HP:0001581,Recurrent skin infections,3.26783e-07,2.644626
23,HP:0002665,Lymphoma,3.357733e-07,2.074043
27,HP:0005406,Recurrent bacterial skin infections,4.352149e-07,4.113836
37,HP:0006429,Broad femoral neck,2.049863e-06,7.213434
38,HP:0002722,Recurrent abscess formation,2.050315e-06,3.547911
43,HP:0001733,Pancreatitis,3.059544e-06,2.46457


In [62]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
30,GO:0060333,interferon-gamma-mediated signaling pathway,9.010372e-11,2.559231
39,GO:0045047,protein targeting to ER,3.183959e-10,2.280882
44,GO:0071346,cellular response to interferon-gamma,5.468346e-10,2.15966
45,GO:0038096,Fc-gamma receptor signaling pathway involved i...,5.472179e-10,2.147846
47,GO:0002431,Fc receptor mediated stimulatory signaling pat...,5.672283e-10,2.146109
48,GO:0038094,Fc-gamma receptor signaling pathway,5.77281e-10,2.14526
57,GO:0072599,establishment of protein localization to endop...,2.336302e-09,2.173245
60,GO:0006614,SRP-dependent cotranslational protein targetin...,3.810641e-09,2.247171
65,GO:0015740,C4-dicarboxylate transport,7.328115e-09,4.436732
67,GO:0006613,cotranslational protein targeting to membrane,8.225698e-09,2.201283


In [63]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
17,MP:0005153,abnormal B cell proliferation,3.181409e-20,2.120136
21,MP:0008217,abnormal B cell activation,3.4360360000000002e-18,2.012082
37,MP:0005093,decreased B cell proliferation,7.882232e-15,2.142833
41,MP:0008180,abnormal marginal zone B cell morphology,2.404651e-14,2.22924
45,MP:0008495,decreased IgG1 level,9.733961e-14,2.015448
46,MP:0001806,decreased IgM level,1.53064e-13,2.108597
50,MP:0002362,abnormal spleen marginal zone morphology,4.167194e-13,2.034597
56,MP:0008182,decreased marginal zone B cell number,1.844706e-12,2.315187
68,MP:0005154,increased B cell proliferation,1.483511e-10,2.42464
92,MP:0003303,peritoneal inflammation,1.156149e-08,3.763917


In [64]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
29,MP:0005153,abnormal B cell proliferation,2.973987e-14,2.06846
40,MP:0005461,abnormal dendritic cell morphology,1.445606e-12,2.379902
42,MP:0001806,decreased IgM level,3.291375e-12,2.209544
53,MP:0005093,decreased B cell proliferation,1.718479e-10,2.145825
70,MP:0002418,increased susceptibility to viral infection,8.788483e-09,2.034574
72,MP:0000921,demyelination,1.427529e-08,2.140232
74,MP:0008125,abnormal dendritic cell number,1.474424e-08,2.382514
76,MP:0008088,abnormal T-helper 1 cell differentiation,2.115314e-08,3.146739
77,MP:0003303,peritoneal inflammation,2.189665e-08,4.346423
82,MP:0005154,increased B cell proliferation,4.557809e-08,2.279219


In [65]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
9,GO:0005758,mitochondrial intermembrane space,4.658954e-08,2.924884
15,GO:0000307,cyclin-dependent protein kinase holoenzyme com...,1.66169e-06,2.979384
20,GO:0000788,nuclear nucleosome,3.649093e-06,6.005715
27,GO:0044798,nuclear transcription factor complex,1.326775e-05,2.009501
31,GO:0031095,platelet dense tubular network membrane,2.117509e-05,3.726373
32,GO:0000786,nucleosome,2.948377e-05,2.232604
34,GO:0031094,platelet dense tubular network,6.315071e-05,3.377614
35,GO:0005826,actomyosin contractile ring,6.359909e-05,4.808933
39,GO:0005761,mitochondrial ribosome,0.0001168505,2.227518
40,GO:0090544,BAF-type complex,0.0001311371,2.334366


In [66]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
4,GO:0008140,cAMP response element binding protein binding,2.734837e-09,6.534605
9,GO:0035673,oligopeptide transmembrane transporter activity,1.666693e-08,7.323736
11,GO:0015198,oligopeptide transporter activity,1.119388e-07,6.256376
13,GO:0004715,non-membrane spanning protein tyrosine kinase ...,4.323711e-07,2.190691
15,GO:0005313,L-glutamate transmembrane transporter activity,6.843716e-07,3.050932
16,GO:0015197,peptide transporter activity,7.138872e-07,4.969029
17,GO:0015556,C4-dicarboxylate transmembrane transporter act...,7.730462e-07,5.308462
19,GO:0005310,dicarboxylic acid transmembrane transporter ac...,1.089172e-06,2.631758
21,GO:0005154,epidermal growth factor receptor binding,1.510183e-06,2.286666
23,GO:0031996,thioesterase binding,1.584143e-06,3.249673
