In [1]:
import pandas as pd
import numpy as np
import sys
import os
sys.path.append('../dev')

import utils
import enrich

pd.options.display.max_colwidth = 300
pd.set_option('display.max_rows', None)

!date

Wed Feb 14 12:27:46 PST 2024


# Documentation and Example

To do an enrichment of your own, you can substitute your file name and path into the following line:

In [2]:
file_name = 'platelets_up.csv'
file_path = '../test_data/processed/'

*Alternatively, you can specify both the path and the file name as in:* file_name = '../test_data/processed/platelets_up.csv' *but then file_path must be omitted as an argument (or you can set* file_path = '' *)*

**The file must be a csv with one gene or protein ID per line.**

***Permitted IDs types are:***
- UniProtKB accession numbers https://www.uniprot.org/help/accession_numbers
- HGNC symbols (the names not the IDs. ie GAPDH not HGNC:4141 https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:4141)
- MGI gene names (mouse IDs will be cross referenced with human IDs)

For other ID types, you can convert them here: http://mangolassi.caltech.edu/~azurebrd/cgi-bin/forms/agr_simplemine.cgi

Below is the function **enrich_wrapper()**, which should be called to perform enrichment given a file name, gene ID type, enrichment method, and false discovery rate. The first and second arguments, 'filename' and 'id_type', are required. The rest are optional, but 'method' specifies the enrichment method to be used (standard hypergeometric, unweighted step enrichment, or weighted step enrichment). 

**def enrich_wrapper(filename, id_type, method = 'set', return_all = False, FDR=.05,fpath= '../test_data', display_gene_symbol = True)**

***to specify enrichment method:*** default 'set'
- method = 'set' indicates unweighted, step-centric hypergeometric enrichment analysis. Sets and genes are weighted equally with a weight of 1.
- method = 'ncHGT' indicates weighted, step-centric enrichment analysis using Fisher's noncentral hypergeometric distribution and the BiasedUrn package.
- method = 'standard' indicates gene-list hypergeometric enrichment analysis. We implemented it here as results can vary from one tool to another based on the backend database of pathways used.

***return_all:*** default False
- if false, only returns the dataframe displaying results. 
- if true: returns (gene_list, filtered_out_genes, filtered_list, setID2members_input_uni, setID2members_input, df_display). User may want to know which of their input genes were filtered out as well as how the IDs were mapped, as uniprot IDs can sometimes map to more than one HGNC gene symbol
    
***display_gene_symbol:*** default True
- if true, display HGNC symbols on output regardless of input ID type
- if false, display output using the same ID type as the input

## Example: Platelets in SARS-CoV-2

Below, we will run enrichment analysis using all three methods at false discovery rates of 0.05. Then we will repeat the first enrichment but with a false discovery rate of 0.1.

Kanth manne et al., 2020

https://ashpublications.org/blood/article/136/11/1317/461106/Platelet-gene-expression-and-function-in-patients

RNAseq in platelets from patients with covid-19 vs healthy donors. There were 6 ICU and 4 nonICU patients.

**Enrichment analsis with the unweighted set method and FDR = 0.05**

In [3]:
results_set = enrich.enrich_wrapper(file_name,'Gene Symbol',method='set',FDR = 0.05,fpath = file_path)
results_set

100%|██████████████████████████████████████| 482/482 [00:00<00:00, 15814.31it/s]


Analysis run on 423 entities from 365 out of 1172 input genes


Unnamed: 0,title,pval (uncorrected),# entities in list,#entities in model,shared entities in gocam,url
0,Collagen biosynthesis and modifying enzymes - Reactome,8.974459e-09,10,12,"[PPIB, P3H1, PLOD3, P4HB, set:Prolyl 3-hydroxylases, set:COLGALT1,COLGALT2, set:4-Hyp collagen propeptides, set:Procollagen N-proteinases, set:Procollagen C-proteinases, set:Lysyl hydroxylases]",http://model.geneontology.org/R-HSA-1650814
1,Synthesis of PE - Reactome,4.97345e-05,8,15,"[PTDSS2, PCYT2, PISD, PHOSPHO1, set:LPIN, set:CHK/ETNK, set:PNPLA2/3, set:AGPAT]",http://model.geneontology.org/R-HSA-1483213


**Repeat with standard enrichment:**

In [4]:
results_standard = enrich.enrich_wrapper(file_name,'Gene Symbol',method='standard',FDR = 0.05,fpath = file_path)
results_standard

100%|██████████████████████████████████████| 482/482 [00:00<00:00, 17972.34it/s]


Unnamed: 0,title,pval (uncorrected),# entities in list,#entities in model,shared entities in gocam,url
0,ER-Phagosome pathway - Reactome,4.5e-05,13,52,"[SEC61A1, PSMD13, PSMA5, PSMD11, SEC61B, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, PSMD8, PSMB5, PSMB6]",http://model.geneontology.org/R-HSA-1236974
1,Hedgehog ligand biogenesis - Reactome,4.5e-05,13,52,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, PSMD8, PSMB5, P4HB, PSMB6, SYVN1]",http://model.geneontology.org/R-HSA-5358346
2,Regulation of APC/C activators between G1/S and early anaphase - Reactome,6.9e-05,13,54,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, CDC25B, PSMB7, CDK1, PSMD8, PSMB5, PSMB6]",http://model.geneontology.org/R-HSA-176408
3,Conversion from APC/C:Cdc20 to APC/C:Cdh1 in late anaphase - Reactome,0.000111,12,49,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, CDK1, PSMD8, PSMB5, PSMB6]",http://model.geneontology.org/R-HSA-176407
4,Neddylation - Reactome,0.000113,14,64,"[UBE2M, PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, CUL9, PSMB7, UCHL3, PSMD8, PSMB5, PSMB6]",http://model.geneontology.org/R-HSA-8951664
5,KEAP1-NFE2L2 pathway - Reactome,0.000126,13,57,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, CSNK2B, PSMD8, PSMB5, PSMB6, PRDX2]",http://model.geneontology.org/R-HSA-9755511
6,SCF(Skp2)-mediated degradation of p27/p21 - Reactome,0.000136,12,50,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, PSMD8, PSMB5, PSMB6, CKS1B]",http://model.geneontology.org/R-HSA-187577
7,The role of GTSE1 in G2/M progression after G2 checkpoint - Reactome,0.000136,12,50,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, CDC25B, PSMB7, PSMD8, PSMB5, PSMB6]",http://model.geneontology.org/R-HSA-8852276
8,SCF-beta-TrCP mediated degradation of Emi1 - Reactome,0.000167,12,51,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, CDK1, PSMD8, PSMB5, PSMB6]",http://model.geneontology.org/R-HSA-174113
9,GSK3B and BTRC:CUL1-mediated-degradation of NFE2L2 - Reactome,0.000175,11,44,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, PSMD8, PSMB5, PSMB6]",http://model.geneontology.org/R-HSA-9762114


**Repeat with weighted step enrichment:**

In [5]:
results_weighted = enrich.enrich_wrapper(file_name,'Gene Symbol',method='ncHGT',FDR = 0.05,fpath = file_path)
results_weighted

100%|█████████████████████████████████████████| 482/482 [01:19<00:00,  6.09it/s]


Analysis run on 423 entities from 365 out of 1172 input genes


Unnamed: 0,title,pval (uncorrected),# entities in list,#entities in model,shared entities in gocam,url
0,Collagen biosynthesis and modifying enzymes - Reactome,4.68093e-07,10,12,"[PPIB, P3H1, PLOD3, P4HB, set:Prolyl 3-hydroxylases, set:COLGALT1,COLGALT2, set:4-Hyp collagen propeptides, set:Procollagen N-proteinases, set:Procollagen C-proteinases, set:Lysyl hydroxylases]",http://model.geneontology.org/R-HSA-1650814
1,Hedgehog ligand biogenesis - Reactome,4.690733e-06,13,50,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, PSMD8, PSMB5, P4HB, PSMB6, SYVN1]",http://model.geneontology.org/R-HSA-5358346
2,ER-Phagosome pathway - Reactome,5.123754e-06,13,51,"[PSMD13, PSMA5, PSMD11, SEC61B, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, PSMD8, PSMB5, PSMB6, set:SEC61 alpha]",http://model.geneontology.org/R-HSA-1236974
3,Regulation of APC/C activators between G1/S and early anaphase - Reactome,7.226495e-06,13,52,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, CDK1, PSMD8, PSMB5, PSMB6, set:CDC25]",http://model.geneontology.org/R-HSA-176408
4,Neddylation - Reactome,9.256658e-06,14,62,"[UBE2M, PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, CUL9, PSMB7, PSMD8, PSMB5, PSMB6, set:UCHL3,SENP8]",http://model.geneontology.org/R-HSA-8951664
5,KEAP1-NFE2L2 pathway - Reactome,1.54032e-05,13,55,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, CSNK2B, PSMD8, PSMB5, PSMB6, set:PRDX1,2,5]",http://model.geneontology.org/R-HSA-9755511
6,The role of GTSE1 in G2/M progression after G2 checkpoint - Reactome,1.767235e-05,12,50,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, CDC25B, PSMB7, PSMD8, PSMB5, PSMB6]",http://model.geneontology.org/R-HSA-8852276
7,Conversion from APC/C:Cdc20 to APC/C:Cdh1 in late anaphase - Reactome,1.767235e-05,12,49,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, CDK1, PSMD8, PSMB5, PSMB6]",http://model.geneontology.org/R-HSA-176407
8,Degradation of beta-catenin by the destruction complex - Reactome,2.044777e-05,13,54,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, PSMD8, PSMB5, PSMB6, set:PP2A-subunit A, set:PP2A regulatory subunit B56]",http://model.geneontology.org/R-HSA-195253
9,SCF(Skp2)-mediated degradation of p27/p21 - Reactome,2.06935e-05,12,49,"[PSMD13, PSMA5, PSMD11, PSMA7, PSME2, PSMD4, PSMB1, PSMB7, PSMD8, PSMB5, PSMB6, CKS1B]",http://model.geneontology.org/R-HSA-187577


**Repeat but with FDR = 0.1 instead of FDR = 0.05:**

In [6]:
results_set = enrich.enrich_wrapper('platelets_down.csv','Gene Symbol',method='ncHGT',FDR = 0.5,fpath = file_path)
results_set

100%|█████████████████████████████████████████| 436/436 [02:12<00:00,  3.28it/s]


Analysis run on 309 entities from 267 out of 1088 input genes


Unnamed: 0,title,pval (uncorrected),# entities in list,#entities in model,shared entities in gocam,url
0,Role of phospholipids in phagocytosis - Reactome,1e-05,11,22,"[set:PLCdegh, set:PLC-beta, set:G-protein gamma subunit, set:PL(C)D4:3xCa2+, set:PLC-beta 1/2/3, set:PLCbz, set:G-protein alpha (q/11/14/15), set:PI3K-regulatory subunit, set:G(q) alpha 11,14,15,Q, set:PLD, set:PLC beta1,2,3]",http://model.geneontology.org/R-HSA-2029485
1,DAG and IP3 signaling - Reactome,1.9e-05,10,18,"[set:PLCdegh, set:PLC-beta, set:G-protein gamma subunit, set:PL(C)D4:3xCa2+, set:PLC-beta 1/2/3, set:PLCbz, set:G-protein alpha (q/11/14/15), set:PI3K-regulatory subunit, set:G(q) alpha 11,14,15,Q, set:PLC beta1,2,3]",http://model.geneontology.org/R-HSA-1489509
2,VEGFR2 mediated cell proliferation - Reactome,4.6e-05,11,27,"[KDR, set:PLCdegh, set:PLC-beta, set:G-protein gamma subunit, set:PL(C)D4:3xCa2+, set:PLC-beta 1/2/3, set:PLCbz, set:G-protein alpha (q/11/14/15), set:PI3K-regulatory subunit, set:G(q) alpha 11,14,15,Q, set:PLC beta1,2,3]",http://model.geneontology.org/R-HSA-5218921
3,PLC beta mediated events - Reactome,5e-05,9,15,"[set:PLCdegh, set:PLC-beta, set:G-protein gamma subunit, set:PL(C)D4:3xCa2+, set:PLC-beta 1/2/3, set:PLCbz, set:G-protein alpha (q/11/14/15), set:G(q) alpha 11,14,15,Q, set:PLC beta1,2,3]",http://model.geneontology.org/R-HSA-112043
4,Antigen activates B Cell Receptor (BCR) leading to generation of second messengers - Reactome,7.1e-05,11,25,"[PIK3R1, set:PLCdegh, set:PLC-beta, set:G-protein gamma subunit, set:PL(C)D4:3xCa2+, set:PLC-beta 1/2/3, set:PLCbz, set:G-protein alpha (q/11/14/15), set:LYN, p-SYK, set:G(q) alpha 11,14,15,Q, set:PLC beta1,2,3]",http://model.geneontology.org/R-HSA-983695
5,CLEC7A (Dectin-1) induces NFAT activation - Reactome,0.000403,9,22,"[set:PLCdegh, set:PLC-beta, set:G-protein gamma subunit, set:PL(C)D4:3xCa2+, set:PLC-beta 1/2/3, set:PLCbz, set:G-protein alpha (q/11/14/15), set:G(q) alpha 11,14,15,Q, set:PLC beta1,2,3]",http://model.geneontology.org/R-HSA-5607763
6,Regulation of insulin secretion - Reactome,0.000444,9,22,"[set:PLCdegh, set:PLC-beta, set:G-protein gamma subunit, set:PL(C)D4:3xCa2+, set:PLC-beta 1/2/3, set:PLCbz, set:G-protein alpha (q/11/14/15), set:G(q) alpha 11,14,15,Q, set:PLC beta1,2,3]",http://model.geneontology.org/R-HSA-422356
7,"Synthesis of IP2, IP, and Ins in the cytosol - Reactome",0.000664,7,17,"[set:PLCdegh, set:INPP5(3)/ITPK1, set:PL(C)D4:3xCa2+, set:PLCbz, set:IMPA1/2, set:Inositol polyphosphate 5-phosphatase, set:INPP5(4)]",http://model.geneontology.org/R-HSA-1855183
8,G alpha (12/13) signalling events - Reactome,0.000792,8,13,"[GNA13, set:Ligand:GPCR complexes that activate Gi, set:ARHGEF11,ARHGEF12, set:G-protein G12/G13 (inactive), set:G-protein gamma subunit, set:GEFs, set:G-protein alpha (12/13):GTP, set:G alpha (i)]",http://model.geneontology.org/R-HSA-416482
9,Ca2+ pathway - Reactome,0.001121,9,27,"[set:PLCdegh, set:PLC-beta, set:G-protein gamma subunit, set:PL(C)D4:3xCa2+, set:PLC-beta 1/2/3, set:PLCbz, set:G-protein alpha (q/11/14/15), set:G(q) alpha 11,14,15,Q, set:PLC beta1,2,3]",http://model.geneontology.org/R-HSA-4086398
