In [1]:
!date
import pandas as pd
import numpy as np
import scipy.stats
from scipy.stats import hypergeom
import sys
sys.path.append('../GOCAM_Project/dev')
import os

import utils
import enrich
pd.options.display.max_colwidth = 100

Wed Jan  4 15:37:15 PST 2023


# Comparing Set and Fully Flattened Methods on the Cancer Dataset

In [2]:
def drop_threshold(filename, id_type, method, thresholds):
    results = {}
    for a in thresholds:
        #r = set(enrich.enrich(genes, cancer_test_set, uniprot2input, gocam_sizes, Dict, alpha = a)[4].title.values)
        r = set(enrich.enrich_wrapper(filename, id_type, method = method, alpha = a).title.values)
        results[a]=r
    return results
    

The next two cells evaluate the set differences between the results at alpha = .05 post-multiple testing correction from the set method (cancer) and the fully flattened method (cancer_ff). All significance levels below are post-FDR correction.

First, we take the set difference of cancer and c_dif to obtain c_dif and vice versa to obtain c_ff_dif. Focusing on c_dif, we then ask if there is this set difference is created by some results being pushed just beyond the arbitrary threshold of .05 (ie from .04 to .06), which is practically not different. We generate cancer_ff_10, cancer_ff_15 and, cancer_ff_50 by rerunning enrichment with .10,.15, and .50 as significance cutoffs post-correction. We repeat the procedure of cancer-cancer_ff with cancer_ff_10,15,50 substituted for cancer_ff to see how far those highly significant results in the normal set method got pushed down the result list in the fully flattened method. The sequence of 8,7,5,5 means that the 5 results from set enrichment at .05 significance (post-correction) which are still not captured at .15 in the fully flattened metod are still not captured at .50. This suggests that set method of enrichment yields results that are completely missed by the fully flattened method. These 5 comprise a significant fraction of the 21 results yielded by the set method.

Conversely, when repeating the above procedure for the set difference between the fully flattened method and the set method, the sequence is 25,21,17,9. For reference, there are 38 results at .05 significance. Comparing 8/21 and 25/38, we see an increased proportion of unique results in a head to head comparison at .05 significance. (Keep in mind that unique results are not necessarily a good thing overall because it means that the enrichment results are sensitive to the method of representation). However, there steady decrease in the sequence from 25 to 21, to 17, to 9, and both reach a proportion of 25% uniqueness.

Regarding the uniqueness, this is largely produced by the increased gocam size (ie increasing from 13 to 122 genes is a huge penalty). To some extent, this can be mitigated by overcounting of sets, which I believe is responsible for the uniqueness of results from the fully flattened methods. However, there theoretically is another case in which the fully flattened method is superior due to the increased background size from ~4000 to ~5000 (many genes appear in gocams only in sets and this method allows them to contribute to the background when they wouldn't have done so in the set method). Consider a relatively large gocam with 15 genes and no sets (so the gocam size is 10 in both methods). Let's say that there is an overlap of 8 genes with the target set. With an decreased background size of 80% and naively using a binomial approximation of the solution, we have P(choose a gene to overlap from N)=N/M in the fully flattened method be 80% of the value for the set method. The complement probability P(not choosing a gene in N)=1-N/M is effectively unchanged because M>>N. Because we raise (N/M) and (1-N/M) to the power of k, we expect the ratio of  binomial probability via the two methods with different values for M to be (.80)^k. For larger k, the smaller background will be more significantly penalized. In the case of k=8, the ratio is .16, giving an advantage to the fully flattened method. This likely cannot account for more than a single order of magnitude difference in p value for the overwhelming majority of gocams, as they tend to be small in size, but it may become a concern if multiple gocams are linked together into larger models



The next 3 cells investigate the relative and absolute changes in gocam size that result from the set method. On average, there is mean 22% reduction in gocam size, with a 75th percentile change of 40% and a 50th percentile change of 6%. Based on this, a significant proportion of models could greatly benefit from set representation.

In [3]:
x1 = pd.read_csv('../data/gocam_sizes_mouse.csv')
x2 = pd.read_csv('../data/gocam_sizes_mouse_ff.csv')
d = pd.Series(x2.sizes.values,index=x2.gocam).to_dict()
x1['sizes_ff'] = x1.gocam.apply(lambda x: d.get(x,'F'))
x1 = x1.query('sizes_ff != "F"')
x1['diff'] = x1['sizes_ff'] - x1['sizes']
x1.query('diff < 0')


Unnamed: 0,gocam,sizes,sizes_ff,diff
213,http://model.geneontology.org/R-HSA-4090294,12,9,-3
267,http://model.geneontology.org/R-HSA-3232118,10,9,-1
283,http://model.geneontology.org/R-HSA-3214858,10,9,-1
382,http://model.geneontology.org/R-HSA-69541,8,7,-1
446,http://model.geneontology.org/R-HSA-5649702,7,6,-1
492,http://model.geneontology.org/R-HSA-111465,6,5,-1
564,http://model.geneontology.org/R-HSA-9028731,5,4,-1
575,http://model.geneontology.org/R-HSA-9026527,5,4,-1
845,http://model.geneontology.org/R-HSA-2454202,3,2,-1


In [4]:
#absolute difference
s = x1['diff']
s = pd.to_numeric(s)
s.describe().apply(lambda x: int(x))

count    1211
mean        5
std        19
min        -3
25%         0
50%         1
75%         4
max       255
Name: diff, dtype: int64

In [5]:
#relative difference
s1 = -x1['diff']/x1['sizes_ff']*100
s1 = pd.to_numeric(s1)
s1.describe().apply(lambda x: int(x))

count    1211
mean      -20
std        27
min       -98
25%       -40
50%        -3
75%         0
max        50
dtype: int64

In [6]:
#setID2members.get('sset:EPH-ephrin oligomers')

# Repeating analysis across all datasets

In [7]:
import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

In [8]:
def make_plot(method,complement_method,datasets,colors,results,alphas):
    p = bokeh.plotting.figure(
    title=f"{method} Method at FDR=.05 vs {complement_method} Method",
    frame_height=400,
    frame_width=400,
    x_range=[0, 0.55],
    y_range=[-5, 110],
    y_axis_label = f'% unique results vs {complement_method} method',
    x_axis_label = f'alpha for {complement_method} method'
    )
    x = alphas
    d = list(datasets.keys())
    for i in range(len(d)):
        vals = results[i][:-1]
        c = colors[i]
        if i == 0:
            label = d[i][:-4]+', total = '+str(results[i][-1])
        else:
            label = d[i][:-4]+', '+str(results[i][-1])
        p.line(x,vals,legend_label = label,color = c,line_width = 2)
        p.circle(x,vals,legend_label = label,color = c,size = 4)
    p.xaxis.ticker=x
    p.grid.visible = False
    p.legend.location = 'right'
    return p


# Examining PI3K/AKT signaling in the P97 dataset

In [9]:
gene_list = pd.read_csv('../../Desktop/GOCAM/P97.csv',header=None,names = ['g'])
P97_test_set = gene_list.g
uniprot2input = pd.Series(P97_test_set.values,index=P97_test_set).to_dict()
x = pd.read_csv('../data/gocam_sizes_mouse.csv')
gocam_sizes = pd.Series(x.sizes.values,index=x.gocam)
Dict = utils.csv2dict('../data/ID2gocam_mouse.csv')
filtered_out_genes, filtered_list, setID2members_input_uni, setID2members_input, df_display= enrich.enrich(list(gene_list.g), P97_test_set, uniprot2input, gocam_sizes, Dict,alpha = .05)
df_display#[df_display.title.apply(lambda x: 'Hedgehog' in x)]

Unnamed: 0,title,pval (uncorrected),# genes in list,#genes in gocam,shared gene products in gocam,url
0,"PI5P, PP2A and IER3 Regulate PI3K/AKT Signaling - Reactome",2e-06,8,13,"[sset:PI4K2A/2B, sset:PP2A-subunit A, sset:PP2A-catalytic subunit C, sset:Activated SRC,LCK,EGFR...",http://model.geneontology.org/R-HSA-6811558
1,Unwinding of DNA - Reactome,2e-06,6,7,"[P25205, Q14566, P33991, P49736, P33993, P33992]",http://model.geneontology.org/R-HSA-176974


In [10]:
len(gene_list.g),len(filtered_list),len(filtered_out_genes)

(766, 338, 496)

In [11]:
#hypergeom.sf(count-1, background_gene_list_size,  gocam_size, gene_list_size)
hypergeom.sf(8-1, 5386,  122, 270),hypergeom.sf(8-1, 4008,  13, 338)

(0.26731596576293415, 2.085287471112948e-06)

# HGT is sensitive to relatively small parameter changes

Comparing enrichment of 'SCF(Skp2)-mediated degradation of p27/p21 - Reactome' in the P97 dataset in the fully flattened and set models:

In [12]:
#hypergeom.sf(count-1, background_gene_list_size,  gocam_size, gene_list_size)
hypergeom.sf(13-1, 5386,  51, 270),hypergeom.sf(12-1, 4008,  50, 318)

(8.116740023880341e-07, 0.0003898627483818345)

Effect of gene list size (change from 270 to 318 is due to addition of sets - removal of genes that only occur in sets)

In [13]:
#hypergeom.sf(count-1, background_gene_list_size,  gocam_size, gene_list_size)
hypergeom.sf(13-1, 5386,  51, 270),hypergeom.sf(12-1, 4008,  50, 270)

(8.116740023880341e-07, 8.212821239277402e-05)

Effect of background gene list size (analogous to above)

In [14]:
#hypergeom.sf(count-1, background_gene_list_size,  gocam_size, gene_list_size)
hypergeom.sf(13-1, 5386,  51, 270),hypergeom.sf(12-1, 5386,  50, 270)

(8.116740023880341e-07, 4.332611973728076e-06)

# Examining the Effect of Background Size

Background size refers to the number of entities across all models, and when the blue curve is below the orange, the result is "automatically" significant by itself at rank 5 or higher (higher being closer to 1), meaning its significance does not depend on the p-values of the rest of the results (see below). 

Caption and parameters: Blue = uncorrected p-value, orange = Benjamini-Hochberg critical value for rank 5 ( from multiple testing correction).
Alpha=.05. The number of GO CAMs (for multiple testing correction) is calculated as background gene list size / average gocam size, which I arbitrarily set equal to 5. I assume that the background list of genes increases due to the addition of more models, which increases the number of tests being done. Thus, the BH critical value is not constant and decreases with background size.

In [15]:
import numpy as np

In [17]:
def get_pval(k,gc,gc_array,size):
    pval = np.sum(gc_array.get(gc)[k:])/size
    return pval

In [46]:
gc_array = utils.csv2dict('sim_results__151x100000.csv')
temp = {}
for k,v in gc_array.items():
    vals = np.array([int(_[:-2]) for _ in v]) #int won't accept a string with a decimal place
    temp[k] = vals
gc_array = temp
df_nc = enrich.enrich_wrapper('cancer.csv','Gene Symbol',method='ncHGT',show_significant=False)

In [47]:
def drop_thresh_sim(thresholds):
    setID2members = utils.csv2dict('../data/setID2members.csv')
    gocam2ID = utils.csv2dict('../data/gocam2ID_mouse.csv')
    id2g_ff = utils.csv2dict('../data/ID2gocam_mouse_ff.csv')
    Dict = utils.csv2dict('../data/ID2gocam_mouse.csv')

    size = 100000
    
    results = {}
    for a in thresholds:
        alpha = a
        show_significant = True
        background_num_gocams = len(gocam2ID)
        df_sim = df_nc.copy()
        df_sim['pval (uncorrected)']=df_sim.apply(lambda x: get_pval(x['# genes in list'], x['url'],gc_array,size), axis =1)

        df_sim.sort_values('pval (uncorrected)',inplace=True)
        df_sim.reset_index(drop=True, inplace=True)
        df_sim['FDR_val'] = (df_sim.index+1)*alpha/background_num_gocams
        df_sim['Less_than'] = (df_sim['pval (uncorrected)'] < df_sim['FDR_val'])
        index = df_sim.Less_than.where(df_sim.Less_than==True).last_valid_index()
        df_significant = df_sim
        if (show_significant):
            df_significant = df_sim.loc[0:index].copy()
            if index == None:
                df_significant = pd.DataFrame(columns =['title', 'pval (uncorrected)', '# genes in list','#genes in gocam','shared gene products in gocam','url'])
        df_sim = df_significant[['title','pval (uncorrected)', '# genes in list', '#genes in gocam','shared gene products in gocam','url']].copy()
        r = set(df_sim.title.values)
        results[a]=r
    return results

In [55]:
path = '../../Desktop/GOCAM/'
datasets = {'cancer.csv':'Gene Symbol'}

results_combined_ff_sim = []
results_combined_nc_sim = []

results_combined_sim_ff = []
results_combined_sim_nc = []
alphas = [.05,.1,.15,.2,.3,.4,.5]
for dataset, symbol_type in datasets.items():
    filename = os.path.join(path,dataset)
    
    results_s = drop_thresh_sim(alphas)
    results_ff = drop_threshold(filename, symbol_type, 'standard', alphas)
    results_nc = drop_threshold(filename, symbol_type, 'ncHGT', alphas)
    
    s_ff = []
    s_nc = []
    s_05 = results_s[.05]
    s_05_len = len(s_05)
    if s_05_len == 0:
        s_05_len = 1 #to prevent division of 0 / 0
    
    ff_s = []
    ff_nc = []
    ff_05 = results_ff[.05]
    ff_05_len = len(ff_05)
    if ff_05_len == 0:
        ff_05_len = 1 #to prevent division of 0 / 0
    
    nc_s = []
    nc_ff = []
    nc_05 = results_nc[.05]
    nc_05_len = len(nc_05)
    if nc_05_len == 0:
        nc_05_len = 1 #to prevent division of 0 / 0
        
    for a in alphas:
        s_ff.append(len(s_05-results_ff[a])/s_05_len*100)
        ff_s.append(len(ff_05-results_s[a])/ff_05_len*100)
        s_nc.append(len(s_05-results_nc[a])/s_05_len*100)
        ff_nc.append(len(ff_05-results_nc[a])/ff_05_len*100)
        nc_ff.append(len(nc_05-results_ff[a])/nc_05_len*100)
        nc_s.append(len(nc_05-results_s[a])/nc_05_len*100)
        
    s_ff.append(len(s_05))
    s_nc.append(len(s_05))
    ff_s.append(len(ff_05))
    ff_nc.append(len(ff_05))
    nc_s.append(len(nc_05))
    nc_ff.append(len(nc_05))
  
    results_combined_sim_ff.append(s_ff)
    results_combined_sim_nc.append(s_nc)
    results_combined_ff_sim.append(ff_s)
    results_combined_nc_sim.append(nc_s)
        

In [56]:
colors = ['orange','crimson','red','salmon','purple']
p1 = make_plot('Sim','Standard',datasets,colors,results_combined_sim_ff,alphas)
p2 = make_plot('Standard (gene list)','Sim',datasets,colors,results_combined_ff_sim,alphas)
p3 = make_plot('Sim',"Fisher's ncHGT (weighted step)",datasets,colors,results_combined_sim_nc,alphas)
p5 = make_plot("Fisher's ncHGT (weighted step)",'Sim',datasets,colors,results_combined_nc_sim,alphas)

g2 = bokeh.layouts.grid([[p1,p2],[p3,p5]])
bokeh.io.show(g2)

In [31]:
results_s = drop_thresh_sim(alphas)
results_ff = drop_threshold(filename, symbol_type, 'standard', alphas)

In [32]:
s = results_s[.05]
f = results_ff[.05]
f-s

{'Activation of the mRNA upon binding of the cap-binding complex and eIFs, and subsequent binding to 43S - Reactome',
 'Inhibition of replication initiation of damaged DNA by RB1/E2F1 - Reactome',
 'Interleukin-1 signaling - Reactome',
 'MAPK6/MAPK4 signaling - Reactome',
 'Senescence-Associated Secretory Phenotype (SASP) - Reactome',
 'Synthesis of PC - Reactome',
 'Translation initiation complex formation - Reactome'}

In [50]:
results_s[.5]-results_s[.05]

{"'de novo' GMP biosynthetic process (Mouse)",
 'ARMS-mediated activation - Reactome',
 'Activation of NOXA and translocation to mitochondria - Reactome',
 'Beta-catenin phosphorylation cascade - Reactome',
 'Citric acid cycle (TCA cycle) - Reactome',
 'DNA Damage/Telomere Stress Induced Senescence - Reactome',
 'Dual Incision in GG-NER - Reactome',
 'EGFR downregulation - Reactome',
 'ERBB2 Activates PTK6 Signaling - Reactome',
 'ERBB2 Regulates Cell Motility - Reactome',
 'ERBB2-EGFR signaling pathway 1 (Mouse)',
 'ERBB2-EGFR signaling pathway 2 (Mouse) ',
 'ERBB2-EGFR signaling pathway 4 (Mouse) ',
 'ERBB2-EGFR signaling pathway 5 (Mouse)',
 'ERBB2-EGFR signaling pathway 6 (Mouse) ',
 'Epidermal growth factor receptor signaling pathway 10 (Mouse) ',
 'Erythropoietin activates Phosphoinositide-3-kinase (PI3K) - Reactome',
 'Extra-nuclear estrogen signaling - Reactome',
 'FCERI mediated MAPK activation - Reactome',
 'FLT3 Signaling - Reactome',
 'G beta:gamma signalling through PI3Kga