## Effects of Phase 2 (ignoring phase 1)
## hypoxic vs. control

GO analysis and GSEA with KEGG

In [1]:
# loading packages
library(clusterProfiler)
library(topGO)
library(dplyr)
library(KEGGREST)
library(ggplot2)



clusterProfiler v4.10.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141


Attaching package: ‘clusterProfiler’


The following object is masked from ‘package:stats’:

    filter


Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, r

What we need for GO analysis:
- list of DMGs with pvalues (feel like this should be log2FoldChange)

In [2]:
# load significant genes df for both vs. control for phase 2
data <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/significant_genes/sig_p2_hc_genes.csv')

# select needed columns (really might only need l2fc
data2 <- select(data, Row.names, log2FoldChange, pvalue, padj)

# renaming columns so they make more sense
colnames(data2) = c('gene', 'l2fc', 'pval', 'padj')
head(data2)

Unnamed: 0_level_0,gene,l2fc,pval,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
1,LOC111100406,-1.9786229,0.0003382968,0.036937003
2,LOC111101237,-2.0831676,0.0002697234,0.035842664
3,LOC111104284,-0.4820848,0.0003039575,0.035842664
4,LOC111105299,-2.5172192,1.231961e-05,0.006053034
5,LOC111105528,-3.1608623,9.945249e-05,0.026653268
6,LOC111106800,-2.3481348,3.482692e-05,0.011586848


In [3]:
# creating numeric vector of gene names and log2FoldChange value
geneList <- data2$l2fc
names(geneList) <- data2$gene

# double checking things look right
head(geneList)
class(geneList) # numeric, used in allGenes for topGO object

In [4]:
# loading conversion df of unique genes with associated GO ids
geneID2GO <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/GO_enrichment_analysis/geneID2GO.txt', sep='\t')

# renaming columns
colnames(geneID2GO) = c('gene','GO_id')

# checking things make sense
head(geneID2GO)
dim(geneID2GO) # have 22,654 unique genes that have GO annotations

Unnamed: 0_level_0,gene,GO_id
Unnamed: 0_level_1,<chr>,<chr>
1,LOC111133408,GO:2001070
2,LOC111121603,"GO:2000781,GO:2000781"
3,LOC111132389,GO:2000145
4,LOC111115105,"GO:1990904,GO:1990904"
5,LOC111129853,"GO:1990904,GO:1990904"
6,LOC111101512,GO:1990904


In [5]:
# have to create annotation file
geneID2GO <- readMappings(file = '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/GO_enrichment_analysis/geneID2GO.txt')
geneID2GO <- geneID2GO[-1] # removes header
head(geneID2GO)

In [6]:
geneNames <- names(geneID2GO)
head(geneNames)

In [7]:
topDiffGenes <- function(allScore) {
    return(allScore < 0.01)
}

x <- topDiffGenes(geneList)
sum(x) ## the number of selected genes

#### GO analysis: molecular function

In [8]:
# creating GO data object
GOdata_MF <- new("topGOdata", 
              description = 'DMGs in phase 2 both vs. control',
              ontology = "MF", 
              allGenes = geneList,
              geneSel = topDiffGenes,
              annot = annFUN.gene2GO, 
              gene2GO = geneID2GO)
GOdata_MF


Building most specific GOs .....

	( 30 GO terms found. )


Build GO DAG topology ..........

	( 111 GO terms and 136 relations. )


Annotating nodes ...............

	( 22 genes annotated to the GO terms. )




------------------------- topGOdata object -------------------------

 Description:
   -  DMGs in phase 2 both vs. control 

 Ontology:
   -  MF 

 31 available genes (all genes from the array):
   - symbol:  LOC111100406 LOC111101237 LOC111104284 LOC111105299 LOC111105528  ...
   - score :  -1.978622926 -2.08316756 -0.4820847941 -2.517219171 -3.160862302  ...
   - 29  significant genes. 

 22 feasible genes (genes that can be used in the analysis):
   - symbol:  LOC111100406 LOC111101237 LOC111104284 LOC111106800 LOC111107933  ...
   - score :  -1.978622926 -2.08316756 -0.4820847941 -2.348134804 0.8000492933  ...
   - 21  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 111 
   - number of edges = 136 

------------------------- topGOdata object -------------------------


In [9]:
# KS stat
resultKS_MF <- runTest(GOdata_MF, algorithm = "weight01", statistic = "ks")

# putting result into readable table
tab_MF <- GenTable(GOdata_MF, raw.p.value = resultKS_MF, topNodes = length(resultKS_MF@score), numChar = 120)

# showing top 10 GO term results
head(tab_MF, 10)


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 111 nontrivial nodes
		 parameters: 
			 test statistic: ks
			 score order: increasing


	 Level 9:	1 nodes to be scored	(0 eliminated genes)


	 Level 8:	6 nodes to be scored	(0 eliminated genes)


	 Level 7:	10 nodes to be scored	(5 eliminated genes)


	 Level 6:	17 nodes to be scored	(7 eliminated genes)


	 Level 5:	27 nodes to be scored	(12 eliminated genes)


	 Level 4:	24 nodes to be scored	(17 eliminated genes)


	 Level 3:	19 nodes to be scored	(21 eliminated genes)


	 Level 2:	6 nodes to be scored	(21 eliminated genes)


	 Level 1:	1 nodes to be scored	(22 eliminated genes)



Unnamed: 0_level_0,GO.ID,Term,Annotated,Significant,Expected,raw.p.value
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<dbl>,<chr>
1,GO:0051537,"2 iron, 2 sulfur cluster binding",1,1,0.95,0.091
2,GO:0008121,ubiquinol-cytochrome-c reductase activity,1,1,0.95,0.091
3,GO:0035091,phosphatidylinositol binding,1,1,0.95,0.136
4,GO:0003779,actin binding,1,1,0.95,0.136
5,GO:0005452,solute:inorganic anion antiporter activity,1,1,0.95,0.182
6,GO:0019829,ATPase-coupled monoatomic cation transmembrane transporter activity,1,1,0.95,0.227
7,GO:0015662,P-type ion transporter activity,1,1,0.95,0.227
8,GO:0046872,metal ion binding,9,8,8.59,0.251
9,GO:0070694,deoxyribonucleoside 5'-monophosphate N-glycosidase activity,1,1,0.95,0.273
10,GO:0016791,phosphatase activity,1,1,0.95,0.318


#### GO analysis for cellular component

In [10]:
# creating GO data object
GOdata_CC <- new("topGOdata", 
              description = 'DMGs in phase 2 both vs. control',
              ontology = "CC", 
              allGenes = geneList,
              geneSel = topDiffGenes,
              annot = annFUN.gene2GO, 
              gene2GO = geneID2GO)
GOdata_CC


Building most specific GOs .....

	( 13 GO terms found. )


Build GO DAG topology ..........

	( 46 GO terms and 71 relations. )


Annotating nodes ...............

	( 17 genes annotated to the GO terms. )




------------------------- topGOdata object -------------------------

 Description:
   -  DMGs in phase 2 both vs. control 

 Ontology:
   -  CC 

 31 available genes (all genes from the array):
   - symbol:  LOC111100406 LOC111101237 LOC111104284 LOC111105299 LOC111105528  ...
   - score :  -1.978622926 -2.08316756 -0.4820847941 -2.517219171 -3.160862302  ...
   - 29  significant genes. 

 17 feasible genes (genes that can be used in the analysis):
   - symbol:  LOC111101237 LOC111108080 LOC111108426 LOC111108790 LOC111110475  ...
   - score :  -2.08316756 -1.522307064 -0.67002702 -1.992056365 -1.555819067  ...
   - 17  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 46 
   - number of edges = 71 

------------------------- topGOdata object -------------------------


In [11]:
# KS stat
resultKS_CC <- runTest(GOdata_CC, algorithm = "weight01", statistic = "ks")

# putting result into readable table
tab_CC <- GenTable(GOdata_CC, raw.p.value = resultKS_CC, topNodes = length(resultKS_CC@score), numChar = 120)

# showing top 10 GO term results
head(tab_CC, 10)


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 46 nontrivial nodes
		 parameters: 
			 test statistic: ks
			 score order: increasing


	 Level 10:	1 nodes to be scored	(0 eliminated genes)


	 Level 9:	2 nodes to be scored	(0 eliminated genes)


	 Level 8:	3 nodes to be scored	(1 eliminated genes)


	 Level 7:	4 nodes to be scored	(2 eliminated genes)


	 Level 6:	8 nodes to be scored	(3 eliminated genes)


	 Level 5:	6 nodes to be scored	(4 eliminated genes)


	 Level 4:	8 nodes to be scored	(8 eliminated genes)


	 Level 3:	11 nodes to be scored	(8 eliminated genes)


	 Level 2:	2 nodes to be scored	(11 eliminated genes)


	 Level 1:	1 nodes to be scored	(17 eliminated genes)



Unnamed: 0_level_0,GO.ID,Term,Annotated,Significant,Expected,raw.p.value
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<dbl>,<chr>
1,GO:0005743,mitochondrial inner membrane,1,1,1,0.059
2,GO:0070469,respirasome,1,1,1,0.059
3,GO:0005634,nucleus,4,4,4,0.143
4,GO:0005783,endoplasmic reticulum,1,1,1,0.294
5,GO:0016020,membrane,11,11,11,0.364
6,GO:0005681,spliceosomal complex,1,1,1,0.412
7,GO:0005856,cytoskeleton,1,1,1,0.471
8,GO:0060170,ciliary membrane,1,1,1,0.471
9,GO:0005886,plasma membrane,3,3,3,0.475
10,GO:0005737,cytoplasm,5,5,5,0.552


#### GO analysis for biological process

In [12]:
# creating GO data object
GOdata_BP <- new("topGOdata", 
              description = 'DMGs in phase 2 both vs. control',
              ontology = "BP", 
              allGenes = geneList,
              geneSel = topDiffGenes,
              annot = annFUN.gene2GO, 
              gene2GO = geneID2GO)
GOdata_BP


Building most specific GOs .....

	( 15 GO terms found. )


Build GO DAG topology ..........

	( 145 GO terms and 274 relations. )


Annotating nodes ...............

	( 13 genes annotated to the GO terms. )




------------------------- topGOdata object -------------------------

 Description:
   -  DMGs in phase 2 both vs. control 

 Ontology:
   -  BP 

 31 available genes (all genes from the array):
   - symbol:  LOC111100406 LOC111101237 LOC111104284 LOC111105299 LOC111105528  ...
   - score :  -1.978622926 -2.08316756 -0.4820847941 -2.517219171 -3.160862302  ...
   - 29  significant genes. 

 13 feasible genes (genes that can be used in the analysis):
   - symbol:  LOC111101237 LOC111104284 LOC111106800 LOC111108080 LOC111108426  ...
   - score :  -2.08316756 -0.4820847941 -2.348134804 -1.522307064 -0.67002702  ...
   - 13  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 145 
   - number of edges = 274 

------------------------- topGOdata object -------------------------


In [13]:
# KS stat
resultKS_BP <- runTest(GOdata_BP, algorithm = "weight01", statistic = "ks")

# putting result into readable table
tab_BP <- GenTable(GOdata_BP, raw.p.value = resultKS_BP, topNodes = length(resultKS_BP@score), numChar = 120)

# showing top 10 GO term results
head(tab_BP, 10)


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 145 nontrivial nodes
		 parameters: 
			 test statistic: ks
			 score order: increasing


	 Level 13:	1 nodes to be scored	(0 eliminated genes)


	 Level 12:	1 nodes to be scored	(0 eliminated genes)


	 Level 11:	4 nodes to be scored	(1 eliminated genes)


	 Level 10:	10 nodes to be scored	(1 eliminated genes)


	 Level 9:	14 nodes to be scored	(2 eliminated genes)


	 Level 8:	13 nodes to be scored	(5 eliminated genes)


	 Level 7:	12 nodes to be scored	(7 eliminated genes)


	 Level 6:	25 nodes to be scored	(8 eliminated genes)


	 Level 5:	29 nodes to be scored	(10 eliminated genes)


	 Level 4:	18 nodes to be scored	(13 eliminated genes)


	 Level 3:	12 nodes to be scored	(13 eliminated genes)


	 Level 2:	5 nodes to be scored	(13 eliminated genes)


	 Level 1:	1 nodes to be scored	(13 eliminated genes)



Unnamed: 0_level_0,GO.ID,Term,Annotated,Significant,Expected,raw.p.value
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<dbl>,<chr>
1,GO:0006820,monoatomic anion transport,1,1,1,0.15
2,GO:0009116,nucleoside metabolic process,1,1,1,0.23
3,GO:0009159,deoxyribonucleoside monophosphate catabolic process,1,1,1,0.23
4,GO:0009117,nucleotide metabolic process,2,2,2,0.25
5,GO:0046856,phosphatidylinositol dephosphorylation,1,1,1,0.31
6,GO:0016310,phosphorylation,2,2,2,0.35
7,GO:0016567,protein ubiquitination,1,1,1,0.38
8,GO:0006511,ubiquitin-dependent protein catabolic process,1,1,1,0.38
9,GO:0006355,regulation of DNA-templated transcription,2,2,2,0.5
10,GO:0006189,'de novo' IMP biosynthetic process,1,1,1,0.54


## Gene Set Enrichment Analysis with clusterProfiler
looking for enriched KEGG pathways with a ranked gene list

In [21]:
# already have a df with DMGs and scores - need just gene and l2fc
df <- select(data2, gene, l2fc)
head(df)

Unnamed: 0_level_0,gene,l2fc
Unnamed: 0_level_1,<chr>,<dbl>
1,LOC111100406,-1.9786229
2,LOC111101237,-2.0831676
3,LOC111104284,-0.4820848
4,LOC111105299,-2.5172192
5,LOC111105528,-3.1608623
6,LOC111106800,-2.3481348


In [22]:
# need to have conversion table for gene name to entrez id
david_df <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/KEGG_pathway/p2_hypoxic_DAVID.txt', sep='\t')
# only selecting columns that I need
david_df <- select(david_df, From, To)
# renaming columns for merge
colnames(david_df) = c('gene', 'entrez_ID')
head(david_df)

Unnamed: 0_level_0,gene,entrez_ID
Unnamed: 0_level_1,<chr>,<int>
1,LOC111110475,111110475
2,LOC111123809,111123809
3,LOC111127492,111127492
4,LOC111124361,111124361
5,LOC111108790,111108790
6,LOC111105528,111105528


In [23]:
# matching up dataframes so entrez id has a log2FoldChange value
merge <- merge(david_df, df, by = 'gene', all=TRUE)

# grabbing just the entrez_ID and l2fc value
merge_df <- select(merge, entrez_ID, l2fc)
head(merge_df)

Unnamed: 0_level_0,entrez_ID,l2fc
Unnamed: 0_level_1,<int>,<dbl>
1,111100406,-1.9786229
2,111101237,-2.0831676
3,111104284,-0.4820848
4,111105299,-2.5172192
5,111105528,-3.1608623
6,111106800,-2.3481348


In [24]:
# checking that there's only unique genes
length(unique(merge_df$entrez_ID))
length(merge_df$entrez_ID)
# both have 31, so all good there

In [26]:
# Create a vector of the gene unuiverse
kegg_gene_list <- merge_df$l2fc

# Name vector with ENTREZ ids
names(kegg_gene_list) <- merge_df$entrez_ID

# omit any NA values 
kegg_gene_list<-na.omit(kegg_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)

head(kegg_gene_list)
class(kegg_gene_list) # numeric
length(kegg_gene_list) # 31 genes

In [27]:
kegg_organism = "cvn"
kk2 <- gseKEGG(geneList     = kegg_gene_list,
               organism     = kegg_organism,
               nPerm        = 10000,
               minGSSize    = 1,
               maxGSSize    = 800,
               pvalueCutoff = 1, # if this is set to 1, see more pathways, but 0.05 is statistically signif.
               pAdjustMethod = "BH", # Benjamini–Hochberg FDR (false discover rate)
               scoreType = "pos",
               keyType       = "kegg")

preparing geneSet collections...

GSEA analysis...

“We do not recommend using nPerm parameter incurrent and future releases”
“You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.”
leading edge analysis...

done...



In [28]:
kk2_df <- as.data.frame(kk2)
kk2_df$Description <- sub(" -.*", "", kk2_df$Description)
head(kk2_df) # actually shows the entire df since there's only 5 pathways with pval<0.05

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
cvn03265,cvn03265,Virion,1,0.7,1.4142843,0.3188681,0.886178,0.886178,10,"tags=100%, list=32%, signal=70%",111117672.0
cvn04142,cvn04142,Lysosome,1,0.7,1.4142843,0.3188681,0.886178,0.886178,10,"tags=100%, list=32%, signal=70%",111117672.0
cvn00910,cvn00910,Nitrogen metabolism,1,0.6666667,1.3469374,0.3511649,0.886178,0.886178,11,"tags=100%, list=35%, signal=67%",111135592.0
cvn03040,cvn03040,Spliceosome,1,0.5333333,1.0775499,0.4772523,0.886178,0.886178,15,"tags=100%, list=48%, signal=53%",111121854.0
cvn00230,cvn00230,Purine metabolism,1,0.4666667,0.9428562,0.5443456,0.886178,0.886178,16,"tags=100%, list=52%, signal=50%",
cvn04141,cvn04141,Protein processing in endoplasmic reticulum,1,0.3666667,0.7408156,0.6121388,0.886178,0.886178,13,"tags=100%, list=42%, signal=60%",


not very enriched for any pathway