
## VDR chr5_1315 example -- Which component is important?

- The specific context of this variant and disease is described in this google doc:
  - https://docs.google.com/document/d/16GuSasXWX-5qwvKAX5-4VxtrbmsIu9UgrP311_viqQc/edit?usp=sharing
- This notebook would show, 
  1. Given the SNP, identify which genomic bin contains the SNP
  1. Use genomic bin squared cosine score to find the top 3 important components for the genomic bin
  1. Investigate the top component for the genomic bins
    - Use assay contribution scores to see what assays are important for the component
    - Use genomic bin contribution scores to see what other gnomic bins are important for the component
    - Explorer the results of enrichment analysis


In [1]:
% matplotlib inline

import numpy as np
import pandas as pd
import matplotlib, collections, itertools, os, re, textwrap, logging, sys
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches
from functools import reduce

from logging.config import dictConfig
from logging import getLogger

dictConfig(dict(
    version = 1,
    formatters = {'f': {'format': '%(asctime)s %(name)-12s %(levelname)-8s %(message)s'}},
    handlers = {
        'h': {'class': 'logging.StreamHandler','formatter': 'f',
              'level': logging.DEBUG}},
    root = {'handlers': ['h'], 'level': logging.DEBUG,},
))

matplotlib.rc('font',**{'size':16, 'family':'sans-serif','sans-serif':['HelveticaNeue', 'Helvetica']})

logger = getLogger('notebook')


In [2]:
repo_dir=os.path.realpath(
    os.path.dirname(os.path.dirname(os.getcwd()))
)


In [3]:
data_dir=os.path.realpath(
    os.path.join(os.path.dirname(os.getcwd()), 'private_data')
)

In [4]:
enrichment_data_dir=os.path.join(repo_dir, 'enrichment', 'private_data')


In [6]:
sys.path.append(os.path.join(repo_dir, 'enrichment', 'src'))
from great import read_great_res_wrapper


In [7]:
metadata_dir=os.path.join(
    repo_dir, 'metadata'
)
metadata = pd.read_table(
    os.path.join(metadata_dir, 'sample_antibody_map_v2_with_metadata.tsv'),
    sep='\t',
)

### Step 1: SNP to genomic bin
- `rs4975616` is on chr17:38023745 (hg19)
  - https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=4795397
- This means the corresponding bin is chr5_1315

In [8]:
genomic_bin_df=pd.read_csv(
    os.path.join(repo_dir, 'enrichment', 'private_data', 'loci_def.bed'),
    names=['chr', 'chromStart', 'chromEnd', 'name'],
    sep='\t'
)

In [9]:
genomic_bin_df[genomic_bin_df['name'] == 'chr5_1315']

Unnamed: 0,chr,chromStart,chromEnd,name
114953,chr5,1315000,1316000,chr5_1315


This means the index of the genomic bin of our interest is 114953

### Step 2: Which component is important for a given genomic bin -- genomic bin squared cosine score
- Let's write our decomposition as X = UDV' where X is input feature matrix, D is diagonal singular value matrix, U is left singular vector matrix (on assay space), V is right singular vector matrix (on genomic bin space), and `'` denotes the transposition of the matrix.
- Genomic bin squared cosine score is defined as L2-normalized version of the matrix product (VD) so that any given slice for a given genomic bin has Euclidian norm of 1. 
- The interpretation of the score is it represents the relative importance of the component given a genomic bin.
- More formal definition:
  - https://docs.google.com/document/d/1YRuaIvHvjb_6SJwlml1dQDegiGlGbdfz_zN-5bneroE/edit?usp=sharing
 

#### read the decomposed matrices

In [10]:
def read_decomposed_matrix(filename, compression=None):
    if((compression is None) and (len(filename) > 3) and (filename[-3:] == '.gz')):
        compression='gzip'
    df = pd.read_csv(
        os.path.join(data_dir, filename),
        compression=compression
    )
    mat = df.iloc[:, 1:].as_matrix()
    idx = df.iloc[:, 0].as_matrix()
    return mat, idx

In [11]:
d_mat_temp, d_idx = read_decomposed_matrix(os.path.join(data_dir, 'diagonalScore.csv.gz'))
d_vec = d_mat_temp[:, 0]


In [12]:
u_mat, u_idx = read_decomposed_matrix(os.path.join(data_dir, 'uScore.csv.gz'))


In [13]:
v_mat, v_idx = read_decomposed_matrix(os.path.join(data_dir, 'vScore.csv.gz'))


In [14]:
d_vec.shape, u_mat.shape, v_mat.shape, d_idx.shape, u_idx.shape, v_idx.shape

((652,), (652, 652), (379541, 652), (652,), (652,), (379541,))

#### compute matrix products, UD and VD

In [15]:
u_dot_d = np.dot(u_mat, np.diag(d_vec))


In [16]:
v_dot_d = np.dot(v_mat, np.diag(d_vec))


In [17]:
u_dot_d.shape, v_dot_d.shape

((652, 652), (379541, 652))

#### compute normalized matrices
- v_dot_d_find_pcs: genomic bin --> which PC? genomic bin squared contribution score.
- u_dot_d_fine_pcs: assay       --> which PC? assay squared contribution score.
- v_dot_d_find_loci: PC --> which genomic bins? genomic bin contribution score.
- u_dot_d_find_assay: PC --> which assay? assay contribution score

In [18]:
v_dot_d_find_pcs = (v_dot_d ** 2 ) / (np.sum(v_dot_d ** 2, axis = 1)[:,np.newaxis])


In [19]:
u_dot_d_find_pcs = (u_dot_d ** 2 ) / (np.sum(u_dot_d ** 2, axis = 1)[:,np.newaxis])


In [20]:
v_dot_d_find_loci = (v_dot_d ** 2 ) / (np.sum(v_dot_d ** 2, axis = 0)[np.newaxis, :])


In [21]:
u_dot_d_find_assay = (u_dot_d ** 2 ) / (np.sum(u_dot_d ** 2, axis = 0)[np.newaxis, :])


#### let's identify the top 3 important components for the genomic bin chr17_38023 (index: 328517)

In [22]:
genomic_bin_idx=114953

In [23]:
np.argsort(-v_dot_d_find_pcs[genomic_bin_idx, :])[:5]

array([16,  7, 14, 59, 38])

In [25]:
v_dot_d_find_pcs[genomic_bin_idx, np.argsort(-v_dot_d_find_pcs[genomic_bin_idx, :])[:5]]

array([0.06951082, 0.06718931, 0.04944998, 0.04625181, 0.03272474])

This means PC7 (0-based index) is the most important component for this bin with 7.6% of squared cosine score, PC39 is the second important one with 6.1%, etc ...

### Step 3: investigation of the components

#### PC16 (the top component)

We will investigate 

1. What assays are driving this component?
1. What genomic loci are driving this component?
1.  What are the top hits in the enrichment analysis?

In [26]:
component_idx=16

#### what assays are driving this component?

In [27]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([188,  40, 435, 433,  79])

In [28]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.15408894, 0.12973887, 0.06268478, 0.04890106, 0.04715601])

The assays with the indices (where is the correspondance table?) are important for this component with 15.4%, 13.0%, etc. of *assay contribution score*

In [29]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
188,SRX1027619,,hg19,No description,,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"804451442,97.3,44.1,118575",ChIP-seq of Homo sapiens: H3K4me3,...,,,,,,,LCLs,blood,Coriell,pooled male and female
40,ERX329651,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"288209714,54.2,20.3,101059",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA12891,,,male
435,SRX651493,,hg19,Histone,H3K4me1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"141491413,92.6,2.4,104065",GSM1435518: LCL19240 H3K4me1; Homo sapiens; Ch...,...,,,,,,,,,,
433,SRX651491,,hg19,Histone,H3K4me3,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"166047212,93.9,6.4,63296",GSM1435515: LCL19238 H3K4me3; Homo sapiens; Ch...,...,,,,,,,,,,
79,ERX329690,,hg19,Unclassified,Unclassified,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"292385011,44.1,11.4,121864",Illumina HiSeq 2000 sequencing; Coordinated ef...,...,,,,,,,NA12892,,,female


#### what genomic bins are driving this component?

In [30]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([354062,   7567, 237396, 222941, 213572])

In [31]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([8.59789418e-05, 8.45574660e-05, 8.31897290e-05, 8.28245542e-05,
       8.27984345e-05])

These genomic bins are important for PC16. Note the genomic bin contribution scores are very small compared to assay contribution score. This is expected becasue of the large number of genomic bins in the whole-genome analysis.

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [33]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[
    genomic_bin_idx, component_idx])

9896

In [34]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[
    genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.026073599426675906

It's roughly on the top 3 percentile.


#### results of the enrichment analysis

In [35]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
2,HP:0001803,Nail pits,2.891962e-09,6.075334
4,HP:0001805,Thick nail,6.483125e-09,5.133376
5,HP:0000964,Eczema,8.879538e-09,2.167302
6,HP:0004742,Abnormality of the renal collecting system,2.02927e-08,4.037247
9,HP:0000081,Duplicated collecting system,3.446247e-08,4.870098
10,HP:0002209,Sparse scalp hair,4.06413e-08,2.750177
11,HP:0008404,Nail dystrophy,4.449655e-08,2.82949
12,HP:0010515,Aplasia/Hypoplasia of the thymus,5.363755e-08,2.762773
14,HP:0000968,Ectodermal dysplasia,9.283266e-08,5.855273
17,HP:0004050,Absent hand,1.221419e-07,3.540401


In [36]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
19,GO:0030512,negative regulation of transforming growth fac...,2.540435e-12,2.395855
41,GO:0002251,organ or tissue specific immune response,9.475817e-11,6.223745
76,GO:0051402,neuron apoptotic process,2.882242e-09,2.85782
84,GO:0030888,regulation of B cell proliferation,5.604318e-09,2.138877
100,GO:0046634,regulation of alpha-beta T cell activation,1.860626e-08,2.007563
106,GO:0070997,neuron death,3.0028e-08,2.421651
112,GO:0048661,positive regulation of smooth muscle cell prol...,3.957243e-08,2.289168
118,GO:0050871,positive regulation of B cell activation,6.246529e-08,2.055533
131,GO:0045582,positive regulation of T cell differentiation,1.212096e-07,2.006075
146,GO:0046635,positive regulation of alpha-beta T cell activ...,2.261917e-07,2.056787


In [37]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
28,MP:0005087,decreased acute inflammation,1.644947e-16,2.117612
58,MP:0008734,decreased susceptibility to endotoxin shock,2.276543e-12,2.593534
64,MP:0010365,increased thymus tumor incidence,7.09107e-12,2.017243
70,MP:0003453,abnormal keratinocyte physiology,2.959545e-11,2.119364
73,MP:0001825,arrested T cell differentiation,5.264142e-11,2.134585
78,MP:0009788,increased susceptibility to bacterial infectio...,8.36341e-11,2.440007
83,MP:0008076,abnormal CD4-positive T cell differentiation,1.815346e-10,2.082001
86,MP:0009582,abnormal keratinocyte proliferation,2.818949e-10,2.368072
95,MP:0005567,decreased circulating total protein level,1.143115e-09,3.045056
97,MP:0002026,leukemia,1.724395e-09,2.103261


In [38]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
42,MP:0000702,enlarged lymph nodes,2.439737e-12,2.125303
56,MP:0009788,increased susceptibility to bacterial infectio...,9.431732e-11,2.579022
58,MP:0002497,increased IgE level,2.202587e-10,2.499897
63,MP:0005567,decreased circulating total protein level,6.75446e-10,3.102436
71,MP:0011888,abnormal circulating total protein level,1.965945e-09,2.815209
74,MP:0009582,abnormal keratinocyte proliferation,2.524284e-09,3.171219
78,MP:0003453,abnormal keratinocyte physiology,4.158586e-09,2.407992
79,MP:0005213,gastric metaplasia,4.829755e-09,4.058559
87,MP:0001241,absent epidermis stratum corneum,1.176606e-08,5.220477
88,MP:0002254,reproductive system inflammation,1.664196e-08,4.08098


In [39]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
8,GO:0000932,cytoplasmic mRNA processing body,2e-06,2.260428
10,GO:0016281,eukaryotic translation initiation factor 4F co...,2e-06,5.205518
11,GO:0031083,BLOC-1 complex,2e-06,3.999044
17,GO:0071565,nBAF complex,9e-06,3.095533
21,GO:0090544,BAF-type complex,3.2e-05,2.501544
24,GO:0031082,BLOC complex,8.6e-05,2.89802
26,GO:0031080,nuclear pore outer ring,0.000146,4.33877
29,GO:0034364,high-density lipoprotein particle,0.000281,3.074285
34,GO:0001527,microfibril,0.000858,2.62761
39,GO:0043205,fibril,0.001034,2.278649


In [40]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
8,GO:0019955,cytokine binding,4.168036e-08,2.041966
9,GO:0004950,chemokine receptor activity,4.486104e-08,3.42057
12,GO:0005160,transforming growth factor beta receptor binding,2.741073e-07,2.664701
22,GO:0004707,MAP kinase activity,2.718276e-06,3.480268
25,GO:0035184,histone threonine kinase activity,5.917467e-06,5.179741
29,GO:0005138,interleukin-6 receptor binding,1.401856e-05,4.101923
33,GO:0008432,JUN kinase binding,1.661109e-05,5.654611
34,GO:0005114,type II transforming growth factor beta recept...,1.769469e-05,3.428169
36,GO:0034713,type I transforming growth factor beta recepto...,2.220813e-05,2.908375
38,GO:0003823,antigen binding,2.743714e-05,2.429832


#### PC7 (the second component)


In [47]:
component_idx=7

#### what assays are driving this component?

In [42]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([188, 441, 433, 434, 439])

In [43]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.09254222, 0.08990549, 0.08470579, 0.06732091, 0.04859752])

In [44]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
188,SRX1027619,,hg19,No description,,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"804451442,97.3,44.1,118575",ChIP-seq of Homo sapiens: H3K4me3,...,,,,,,,LCLs,blood,Coriell,pooled male and female
441,SRX651499,,hg19,Histone,H3K4me1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"91066702,98.3,1.5,71137",GSM1435524: LCL12892 H3K4me1; Homo sapiens; Ch...,...,,,,,,,,,,
433,SRX651491,,hg19,Histone,H3K4me3,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"166047212,93.9,6.4,63296",GSM1435515: LCL19238 H3K4me3; Homo sapiens; Ch...,...,,,,,,,,,,
434,SRX651492,,hg19,Histone,H3K4me1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"112405175,94.2,1.6,121976",GSM1435516: LCL19239 H3K4me1; Homo sapiens; Ch...,...,,,,,,,,,,
439,SRX651497,,hg19,Histone,H3K4me1,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"95546628,97.9,1.9,63759",GSM1435522: LCL12891 H3K4me1; Homo sapiens; Ch...,...,,,,,,,,,,


#### what genomic bins are driving this component?

In [48]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([249953, 224313, 330510, 273660, 294763])

In [49]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([6.58942556e-05, 6.22829820e-05, 6.12494759e-05, 6.06528861e-05,
       5.92396118e-05])

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [50]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx])

22749

In [51]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.05993818849610451

It's roughly on the top 6 percentile.


#### results of the enrichment analysis

In [52]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
1,HP:0010537,Wide cranial sutures,0.000546,2.153747
2,HP:0007648,Punctate cataract,0.000557,2.641658
3,HP:0004492,Widely patent fontanelles and sutures,0.000622,2.173684
5,HP:0000894,Short clavicles,0.001024,2.020724
7,HP:0100720,Hypoplasia of the ear cartilage,0.001549,2.317628
8,HP:0000064,Hypoplastic labia minora,0.001655,2.302681
13,HP:0000851,Congenital hypothyroidism,0.00223,2.302182
15,HP:0001194,Abnormalities of placenta and umbilical cord,0.003565,2.080636
16,HP:0003724,Shoulder girdle muscle atrophy,0.004261,2.945603
17,HP:0012056,Cutaneous melanoma,0.004444,2.586142


In [53]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
2,GO:0032211,negative regulation of telomere maintenance vi...,3.1e-05,3.811605
5,GO:1900746,regulation of vascular endothelial growth fact...,5.4e-05,3.840211
6,GO:0007004,telomere maintenance via telomerase,8.3e-05,3.018872
7,GO:0051974,negative regulation of telomerase activity,8.7e-05,3.13106
8,GO:0032210,regulation of telomere maintenance via telomerase,0.000168,2.840701
9,GO:0045663,positive regulation of myoblast differentiation,0.000267,2.001179
11,GO:0032205,negative regulation of telomere maintenance,0.000299,2.700491
12,GO:0016233,telomere capping,0.000395,3.819518
13,GO:0006278,RNA-dependent DNA replication,0.000426,2.460045
14,GO:0032202,telomere assembly,0.000491,3.711348


In [54]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,MP:0008453,decreased retinal rod cell number,0.00012,2.574294
5,MP:0009189,abnormal pancreatic epsilon cell morphology,0.000734,2.289573
7,MP:0006290,proboscis,0.000759,2.140978
9,MP:0005229,abnormal intervertebral disk development,0.000777,2.097987
31,MP:0009014,prolonged proestrus,0.002114,2.480344
52,MP:0001238,thin epidermis stratum spinosum,0.004226,2.377468
53,MP:0011016,increased core body temperature,0.004271,2.213799
60,MP:0010939,abnormal mandibular prominence morphology,0.004729,2.188475
63,MP:0006197,ocular hypotelorism,0.004913,2.117124
64,MP:0003701,elevated level of mitotic sister chromatid exc...,0.004945,2.54815


In [55]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
1,MP:0004122,abnormal sinus arrhythmia,0.000177,2.439509
2,MP:0009189,abnormal pancreatic epsilon cell morphology,0.000203,2.699333
12,MP:0003155,abnormal telomere length,0.001356,2.42164
14,MP:0009175,abnormal pancreatic beta cell differentiation,0.001372,2.222328
18,MP:0000869,abnormal cerebellum posterior vermis morphology,0.001503,2.15352
25,MP:0012055,abnormal phrenic nerve innervation pattern to ...,0.00186,3.345969
27,MP:0008727,enlarged heart right atrium,0.002134,2.854915
28,MP:0009957,abnormal cerebellum vermis lobule morphology,0.002209,2.127916
32,MP:0002704,tubular nephritis,0.002621,2.263936
36,MP:0010207,abnormal telomere morphology,0.003135,2.160905


In [56]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,GO:0030126,COPI vesicle coat,0.000686,2.68802
1,GO:0000783,nuclear telomere cap complex,0.000691,2.800919
2,GO:0005838,proteasome regulatory particle,0.000739,3.075104
3,GO:0030663,COPI-coated vesicle membrane,0.000841,2.633313
5,GO:0005869,dynactin complex,0.00134,5.11206
8,GO:0030137,COPI-coated vesicle,0.002135,2.312546
10,GO:0022624,proteasome accessory complex,0.004053,2.301945
11,GO:0000145,exocyst,0.005402,2.042295
13,GO:0000780,"condensed nuclear chromosome, centromeric region",0.007627,2.070883
22,GO:0005638,lamin filament,0.020245,3.257748


In [57]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
2,GO:0004716,receptor signaling protein tyrosine kinase act...,0.004002,2.011743
3,GO:0004364,glutathione transferase activity,0.00441,2.466547
4,GO:0004185,serine-type carboxypeptidase activity,0.005957,2.792532
5,GO:0030957,Tat protein binding,0.008178,2.18717
9,GO:0005172,vascular endothelial growth factor receptor bi...,0.011852,2.157702
11,GO:0004602,glutathione peroxidase activity,0.013293,2.208935
13,GO:0004952,dopamine neurotransmitter receptor activity,0.014252,2.415118
14,GO:0043175,RNA polymerase core enzyme binding,0.014268,2.10153
15,GO:0043546,molybdopterin cofactor binding,0.014603,2.782694
23,GO:0004784,superoxide dismutase activity,0.020976,2.581173


#### PC5 (the thrird component)


In [58]:
component_idx=14

#### what assays are driving this component?

In [59]:
np.argsort(-u_dot_d_find_assay[:, component_idx])[:5]

array([341, 340, 188, 399, 402])

In [60]:
u_dot_d_find_assay[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], component_idx]

array([0.05683107, 0.05160962, 0.04877238, 0.03051568, 0.02483197])

In [61]:
metadata.iloc[np.argsort(-u_dot_d_find_assay[:, component_idx])[:5], :]

Unnamed: 0,sample_number,antibody,Genome assembly,Antigen class,Antigen,Cell type class,Cell type,Cell type description,Processing logs,Title,...,age,treatment,genotype,lab,age.1,health state,cell_type,tissue_type,provider,sex
341,SRX356752,h3k27ac,hg19,Histone,H3K27ac,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"32072974,98.1,8.2,49893",GSM1234184: GM2610 H3K27Ac 2; Homo sapiens; Ch...,...,,,,,,,2610,,,
340,SRX356751,h3k27ac,hg19,Histone,H3K27ac,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"32977761,98.4,6.2,50931",GSM1234183: GM2610 H3K27Ac 1; Homo sapiens; Ch...,...,,,,,,,2610,,,
188,SRX1027619,,hg19,No description,,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"804451442,97.3,44.1,118575",ChIP-seq of Homo sapiens: H3K4me3,...,,,,,,,LCLs,blood,Coriell,pooled male and female
399,SRX627245,h3k27ac,hg19,Histone,H3K27ac,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"167188190,92.8,40.1,66610",GSM1420871: GM18516 H3K27ac; Homo sapiens; ChI...,...,,,,,,,GM18516,,Coriell Cell Repositories http://ccr.coriell.o...,
402,SRX627248,h3k27ac,hg19,Histone,H3K27ac,Blood,Lymphoblastoid cell line,Tissue=blood|Lineage=mesoderm|Description=pare...,"174144959,91.5,53.9,64450",GSM1420874: GM18523 H3K27ac; Homo sapiens; ChI...,...,,,,,,,GM18523,,Coriell Cell Repositories http://ccr.coriell.o...,


#### what genomic bins are driving this component?

In [62]:
np.argsort(-v_dot_d_find_loci[:, component_idx])[:5]

array([271425, 263298, 186225, 131148, 107953])

In [63]:
v_dot_d_find_loci[np.argsort(-v_dot_d_find_loci[:, component_idx])[:5], component_idx]

array([0.00013288, 0.00013214, 0.00013037, 0.00012646, 0.00012551])

#### where is our loci of interest, chr17_38023 (index: 328517), in this ranking?

In [64]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx])

18496

In [65]:
np.sum(v_dot_d_find_loci[:, component_idx] >= v_dot_d_find_loci[genomic_bin_idx, component_idx]) / v_dot_d_find_loci.shape[0]

0.04873254799876693

It's roughly on the top 5 percentile.


#### results of the enrichment analysis

In [66]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'HumanPhenotypeOntology').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
1,HP:0000878,11 pairs of ribs,2.8e-05,3.050731
2,HP:0000921,Missing ribs,2.9e-05,2.266
4,HP:0010758,Abnormality of the premaxilla,0.000106,3.827927
5,HP:0004599,Absent or minimally ossified vertebral bodies,0.000115,2.516795
8,HP:0008921,Neonatal short-limb short stature,0.000159,2.854712
9,HP:0100569,Abnormal vertebral ossification,0.000159,2.261137
10,HP:0010182,Abnormality of the distal phalanges of the toes,0.00017,2.837249
11,HP:0006714,Aplasia/Hypoplasia of the sternum,0.000192,2.039965
16,HP:0100544,Neoplasm of the heart,0.000244,5.048471
17,HP:0001308,Tongue fasciculations,0.000265,3.092026


In [67]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOBiologicalProcess').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
12,GO:0006415,translational termination,3e-06,2.133207
15,GO:2001179,regulation of interleukin-10 secretion,8e-06,8.347535
19,GO:0019083,viral transcription,1.5e-05,2.073954
43,GO:0006657,CDP-choline pathway,0.000175,4.239445
44,GO:0003413,chondrocyte differentiation involved in endoch...,0.000188,2.556494
48,GO:0030224,monocyte differentiation,0.000254,2.121099
52,GO:0042501,serine phosphorylation of STAT protein,0.000323,4.309267
55,GO:0071364,cellular response to epidermal growth factor s...,0.000403,2.245723
58,GO:0001556,oocyte maturation,0.000463,2.321199
59,GO:0070305,response to cGMP,0.000511,2.663535


In [68]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenotype').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
0,MP:0008914,enlarged cerebellum,2.436369e-07,4.500253
6,MP:0006301,abnormal mesenchyme morphology,6.377667e-05,2.019502
12,MP:0008778,abnormal lymphangiogenesis,0.0001449509,2.418321
13,MP:0002711,decreased glucagon secretion,0.0001591809,2.221722
15,MP:0004372,bowed fibula,0.0002114836,2.181336
16,MP:0011121,decreased primordial ovarian follicle number,0.0002122111,3.162288
17,MP:0003877,abnormal serotonergic neuron morphology,0.0002279733,2.452955
19,MP:0004695,increased length of long bones,0.0002502958,2.034595
20,MP:0004680,small xiphoid process,0.0002517232,4.461269
22,MP:0003276,esophageal atresia,0.0002534461,2.323356


In [69]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'MGIPhenoSingleKO').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
7,MP:0005324,ascites,7.7e-05,2.149744
8,MP:0008778,abnormal lymphangiogenesis,0.000104,2.475064
10,MP:0004374,bowed radius,0.000126,2.038274
12,MP:0002762,ectopic cerebellar granule cells,0.000146,2.196864
13,MP:0001121,uterus hypoplasia,0.000184,2.433094
15,MP:0008828,abnormal lymph node cell ratio,0.000354,2.37059
17,MP:0003420,delayed intramembranous bone ossification,0.00039,2.063306
18,MP:0009184,abnormal PP cell morphology,0.000406,2.404322
19,MP:0004361,bowed ulna,0.000414,2.157365
24,MP:0010763,abnormal hematopoietic stem cell physiology,0.000483,2.0344


In [70]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOCellularComponent').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
7,GO:0022625,cytosolic large ribosomal subunit,8.6e-05,2.198213
9,GO:0000307,cyclin-dependent protein kinase holoenzyme com...,0.000148,2.471604
11,GO:0005761,mitochondrial ribosome,0.000161,2.183858
12,GO:0016282,eukaryotic 43S preinitiation complex,0.000271,2.82643
16,GO:0033290,eukaryotic 48S preinitiation complex,0.000621,2.714996
18,GO:0005763,mitochondrial small ribosomal subunit,0.000797,2.89012
19,GO:0072669,tRNA-splicing ligase complex,0.000905,4.121969
21,GO:0034451,centriolar satellite,0.001847,3.675576
22,GO:0005852,eukaryotic translation initiation factor 3 com...,0.002015,2.403722
23,GO:0034993,SUN-KASH complex,0.002117,2.857849


In [71]:
read_great_res_wrapper(enrichment_data_dir, component_idx, 'GOMolecularFunction').head(10)


Unnamed: 0,# ID,Desc,BPval,BFold
2,GO:0004679,AMP-activated protein kinase activity,0.000243,5.051433
3,GO:0019238,cyclohydrolase activity,0.00025,4.051887
4,GO:0019992,diacylglycerol binding,0.000282,2.216119
7,GO:0016814,"hydrolase activity, acting on carbon-nitrogen ...",0.000519,2.024586
8,GO:0005536,glucose binding,0.000687,4.304431
9,GO:0017169,CDP-alcohol phosphatidyltransferase activity,0.000705,3.858153
10,GO:0016780,"phosphotransferase activity, for other substit...",0.000897,2.522739
11,GO:0070016,armadillo repeat domain binding,0.001232,2.36908
12,GO:0004931,extracellular ATP-gated cation channel activity,0.00133,5.119754
13,GO:0030284,estrogen receptor activity,0.001406,2.710014
