## Given phase 2, what is the effect of phase 1?
## phase 2 = control, **looking at phase 1 hypoxic vs. control**

GO analysis and GSEA with KEGG

In [2]:
# loading packages
library(clusterProfiler)
library(topGO)
library(dplyr)
library(KEGGREST)
library(ggplot2)



clusterProfiler v4.10.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141


Attaching package: ‘clusterProfiler’


The following object is masked from ‘package:stats’:

    filter


Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, r

What we need for GO analysis:
- list of DMGs with pvalues (feel like this should be log2FoldChange)

In [3]:
# load significant genes df for both vs. control for phase 2
data <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/significant_genes/sig_p2c_p1hc_genes.csv')

# select needed columns (really might only need l2fc
data2 <- select(data, Row.names, log2FoldChange, pvalue, padj)

# renaming columns so they make more sense
colnames(data2) = c('gene', 'l2fc', 'pval', 'padj')
head(data2)

Unnamed: 0_level_0,gene,l2fc,pval,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
1,LOC111099424,2.345118,2.000648e-05,0.012218955
2,LOC111099567,-1.827096,5.963168e-05,0.021852028
3,LOC111100699,4.063722,9.445732e-05,0.026313561
4,LOC111101020,1.605386,5.629173e-05,0.021852028
5,LOC111101925,-2.272699,0.0002362005,0.04151111
6,LOC111103792,-2.021323,4.870634e-06,0.005099554


In [4]:
# creating numeric vector of gene names and log2FoldChange value
geneList <- data2$l2fc
names(geneList) <- data2$gene

# double checking things look right
head(geneList)
class(geneList) # numeric, used in allGenes for topGO object

In [5]:
# loading conversion df of unique genes with associated GO ids
geneID2GO <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/GO_enrichment_analysis/geneID2GO.txt', sep='\t')

# renaming columns
colnames(geneID2GO) = c('gene','GO_id')

# checking things make sense
head(geneID2GO)
dim(geneID2GO) # have 22,654 unique genes that have GO annotations

Unnamed: 0_level_0,gene,GO_id
Unnamed: 0_level_1,<chr>,<chr>
1,LOC111133408,GO:2001070
2,LOC111121603,"GO:2000781,GO:2000781"
3,LOC111132389,GO:2000145
4,LOC111115105,"GO:1990904,GO:1990904"
5,LOC111129853,"GO:1990904,GO:1990904"
6,LOC111101512,GO:1990904


In [6]:
# have to create annotation file
geneID2GO <- readMappings(file = '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/GO_enrichment_analysis/geneID2GO.txt')
geneID2GO <- geneID2GO[-1] # removes header
head(geneID2GO)

In [7]:
geneNames <- names(geneID2GO)
head(geneNames)

In [8]:
topDiffGenes <- function(allScore) {
    return(allScore < 0.01)
}

x <- topDiffGenes(geneList)
sum(x) ## the number of selected genes

#### GO analysis: molecular function

In [14]:
# creating GO data object
GOdata_MF <- new("topGOdata", 
              description = 'DMGs in phase 1 hypoxic vs. control, phase 2 control',
              ontology = "MF", 
              allGenes = geneList,
              geneSel = topDiffGenes,
              annot = annFUN.gene2GO, 
              gene2GO = geneID2GO)
GOdata_MF


Building most specific GOs .....

	( 21 GO terms found. )


Build GO DAG topology ..........

	( 78 GO terms and 91 relations. )


Annotating nodes ...............

	( 20 genes annotated to the GO terms. )




------------------------- topGOdata object -------------------------

 Description:
   -  DMGs in phase 1 hypoxic vs. control, phase 2 control 

 Ontology:
   -  MF 

 48 available genes (all genes from the array):
   - symbol:  LOC111099424 LOC111099567 LOC111100699 LOC111101020 LOC111101925  ...
   - score :  2.345118284 -1.827095809 4.063721828 1.605386194 -2.272698799  ...
   - 31  significant genes. 

 20 feasible genes (genes that can be used in the analysis):
   - symbol:  LOC111100699 LOC111101020 LOC111103792 LOC111105248 LOC111114112  ...
   - score :  4.063721828 1.605386194 -2.021322711 -0.7323141969 -1.112104645  ...
   - 10  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 78 
   - number of edges = 91 

------------------------- topGOdata object -------------------------


In [15]:
# KS stat
resultKS_MF <- runTest(GOdata_MF, algorithm = "weight01", statistic = "ks")

# putting result into readable table
tab_MF <- GenTable(GOdata_MF, raw.p.value = resultKS_MF, topNodes = length(resultKS_MF@score), numChar = 120)

# showing top 10 GO term results
head(tab_MF, 10)


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 78 nontrivial nodes
		 parameters: 
			 test statistic: ks
			 score order: increasing


	 Level 9:	1 nodes to be scored	(0 eliminated genes)


	 Level 8:	3 nodes to be scored	(0 eliminated genes)


	 Level 7:	7 nodes to be scored	(1 eliminated genes)


	 Level 6:	13 nodes to be scored	(2 eliminated genes)


	 Level 5:	15 nodes to be scored	(7 eliminated genes)


	 Level 4:	18 nodes to be scored	(11 eliminated genes)


	 Level 3:	15 nodes to be scored	(18 eliminated genes)


	 Level 2:	5 nodes to be scored	(20 eliminated genes)


	 Level 1:	1 nodes to be scored	(20 eliminated genes)



Unnamed: 0_level_0,GO.ID,Term,Annotated,Significant,Expected,raw.p.value
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<dbl>,<chr>
1,GO:0005509,calcium ion binding,1,1,0.5,0.05
2,GO:0005544,calcium-dependent phospholipid binding,1,1,0.5,0.05
3,GO:0003677,DNA binding,4,3,2.0,0.058
4,GO:0008270,zinc ion binding,3,2,1.5,0.103
5,GO:0000981,"DNA-binding transcription factor activity, RNA polymerase II-specific",1,1,0.5,0.25
6,GO:0004725,protein tyrosine phosphatase activity,1,1,0.5,0.3
7,GO:0004842,ubiquitin-protein transferase activity,1,1,0.5,0.35
8,GO:0016757,glycosyltransferase activity,1,1,0.5,0.45
9,GO:0005085,guanyl-nucleotide exchange factor activity,1,1,0.5,0.5
10,GO:0008061,chitin binding,1,0,0.5,0.55


#### GO analysis for cellular component

In [16]:
# creating GO data object
GOdata_CC <- new("topGOdata", 
              description = 'DMGs in phase 1 hypoxic vs. control, phase 2 control',
              ontology = "CC", 
              allGenes = geneList,
              geneSel = topDiffGenes,
              annot = annFUN.gene2GO, 
              gene2GO = geneID2GO)
GOdata_CC


Building most specific GOs .....

	( 7 GO terms found. )


Build GO DAG topology ..........

	( 22 GO terms and 32 relations. )


Annotating nodes ...............

	( 20 genes annotated to the GO terms. )




------------------------- topGOdata object -------------------------

 Description:
   -  DMGs in phase 1 hypoxic vs. control, phase 2 control 

 Ontology:
   -  CC 

 48 available genes (all genes from the array):
   - symbol:  LOC111099424 LOC111099567 LOC111100699 LOC111101020 LOC111101925  ...
   - score :  2.345118284 -1.827095809 4.063721828 1.605386194 -2.272698799  ...
   - 31  significant genes. 

 20 feasible genes (genes that can be used in the analysis):
   - symbol:  LOC111099424 LOC111099567 LOC111100699 LOC111104344 LOC111112399  ...
   - score :  2.345118284 -1.827095809 4.063721828 -1.025377149 -1.196601508  ...
   - 11  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 22 
   - number of edges = 32 

------------------------- topGOdata object -------------------------


In [17]:
# KS stat
resultKS_CC <- runTest(GOdata_CC, algorithm = "weight01", statistic = "ks")

# putting result into readable table
tab_CC <- GenTable(GOdata_CC, raw.p.value = resultKS_CC, topNodes = length(resultKS_CC@score), numChar = 120)

# showing top 10 GO term results
head(tab_CC, 10)


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 22 nontrivial nodes
		 parameters: 
			 test statistic: ks
			 score order: increasing


	 Level 8:	1 nodes to be scored	(0 eliminated genes)


	 Level 7:	1 nodes to be scored	(0 eliminated genes)


	 Level 6:	2 nodes to be scored	(1 eliminated genes)


	 Level 5:	3 nodes to be scored	(1 eliminated genes)


	 Level 4:	6 nodes to be scored	(7 eliminated genes)


	 Level 3:	7 nodes to be scored	(7 eliminated genes)


	 Level 2:	1 nodes to be scored	(10 eliminated genes)


	 Level 1:	1 nodes to be scored	(20 eliminated genes)



Unnamed: 0_level_0,GO.ID,Term,Annotated,Significant,Expected,raw.p.value
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<dbl>,<chr>
1,GO:0016020,membrane,11,7,6.05,0.12
2,GO:0005634,nucleus,6,4,3.3,0.29
3,GO:0005576,extracellular region,1,0,0.55,0.6
4,GO:0005886,plasma membrane,1,0,0.55,0.7
5,GO:0005737,cytoplasm,5,2,2.75,0.76
6,GO:0005730,nucleolus,1,0,0.55,0.85
7,GO:0005794,Golgi apparatus,1,0,0.55,0.95
8,GO:0005622,intracellular anatomical structure,9,4,4.95,1.0
9,GO:0070013,intracellular organelle lumen,1,0,0.55,1.0
10,GO:0031981,nuclear lumen,1,0,0.55,1.0


#### GO analysis for biological process

In [19]:
# creating GO data object
GOdata_BP <- new("topGOdata", 
              description = 'DMGs in phase 1 hypoxic vs. control, phase 2 control',
              ontology = "BP", 
              allGenes = geneList,
              geneSel = topDiffGenes,
              annot = annFUN.gene2GO, 
              gene2GO = geneID2GO)
GOdata_BP


Building most specific GOs .....

	( 11 GO terms found. )


Build GO DAG topology ..........

	( 145 GO terms and 288 relations. )


Annotating nodes ...............

	( 10 genes annotated to the GO terms. )




------------------------- topGOdata object -------------------------

 Description:
   -  DMGs in phase 1 hypoxic vs. control, phase 2 control 

 Ontology:
   -  BP 

 48 available genes (all genes from the array):
   - symbol:  LOC111099424 LOC111099567 LOC111100699 LOC111101020 LOC111101925  ...
   - score :  2.345118284 -1.827095809 4.063721828 1.605386194 -2.272698799  ...
   - 31  significant genes. 

 10 feasible genes (genes that can be used in the analysis):
   - symbol:  LOC111100699 LOC111119646 LOC111120850 LOC111128754 LOC111128755  ...
   - score :  4.063721828 -2.231953627 1.177318323 1.50442414 -4.766785816  ...
   - 5  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 145 
   - number of edges = 288 

------------------------- topGOdata object -------------------------


In [20]:
# KS stat
resultKS_BP <- runTest(GOdata_BP, algorithm = "weight01", statistic = "ks")

# putting result into readable table
tab_BP <- GenTable(GOdata_BP, raw.p.value = resultKS_BP, topNodes = length(resultKS_BP@score), numChar = 120)

# showing top 10 GO term results
head(tab_BP, 10)


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 145 nontrivial nodes
		 parameters: 
			 test statistic: ks
			 score order: increasing


	 Level 14:	2 nodes to be scored	(0 eliminated genes)


	 Level 13:	1 nodes to be scored	(0 eliminated genes)


	 Level 12:	2 nodes to be scored	(1 eliminated genes)


	 Level 11:	3 nodes to be scored	(1 eliminated genes)


	 Level 10:	7 nodes to be scored	(2 eliminated genes)


	 Level 9:	11 nodes to be scored	(2 eliminated genes)


	 Level 8:	11 nodes to be scored	(3 eliminated genes)


	 Level 7:	11 nodes to be scored	(4 eliminated genes)


	 Level 6:	18 nodes to be scored	(4 eliminated genes)


	 Level 5:	30 nodes to be scored	(4 eliminated genes)


	 Level 4:	26 nodes to be scored	(8 eliminated genes)


	 Level 3:	16 nodes to be scored	(9 eliminated genes)


	 Level 2:	6 nodes to be scored	(10 eliminated genes)


	 Level 1:	1 nodes to be scored	(10 eliminated genes)



Unnamed: 0_level_0,GO.ID,Term,Annotated,Significant,Expected,raw.p.value
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<dbl>,<chr>
1,GO:0035556,intracellular signal transduction,1,1,0.5,0.1
2,GO:0055070,copper ion homeostasis,1,1,0.5,0.2
3,GO:2000042,negative regulation of double-strand break repair via homologous recombination,1,1,0.5,0.3
4,GO:0006606,protein import into nucleus,1,1,0.5,0.4
5,GO:0016311,dephosphorylation,1,1,0.5,0.5
6,GO:0006914,autophagy,1,0,0.5,0.6
7,GO:0006508,proteolysis,1,0,0.5,0.7
8,GO:0006355,regulation of DNA-templated transcription,1,0,0.5,0.8
9,GO:0006629,lipid metabolic process,1,0,0.5,0.9
10,GO:0000096,sulfur amino acid metabolic process,1,0,0.5,1.0


## Gene Set Enrichment Analysis with clusterProfiler
looking for enriched KEGG pathways with a ranked gene list

In [27]:
# already have a df with DMGs and scores - need just gene and l2fc
df <- select(data2, gene, l2fc)
head(df)
dim(df) # 48 genes

Unnamed: 0_level_0,gene,l2fc
Unnamed: 0_level_1,<chr>,<dbl>
1,LOC111099424,2.345118
2,LOC111099567,-1.827096
3,LOC111100699,4.063722
4,LOC111101020,1.605386
5,LOC111101925,-2.272699
6,LOC111103792,-2.021323


In [29]:
# need to have conversion table for gene name to entrez id
# obtained from DAVID gene accession conversion tool
david_df <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/KEGG_pathway/p2c_p1hc_DAVID.txt', sep='\t')
# only selecting columns that I need
david_df <- select(david_df, From, To)
# renaming columns for merge
colnames(david_df) = c('gene', 'entrez_ID')
head(david_df)
dim(david_df)

Unnamed: 0_level_0,gene,entrez_ID
Unnamed: 0_level_1,<chr>,<int>
1,LOC111135994,111135994
2,LOC111136466,111136466
3,LOC111119748,111119748
4,LOC111117745,111117745
5,LOC111115741,111115741
6,LOC111134401,111134401


In [30]:
# matching up dataframes so entrez id has a log2FoldChange value
merge <- merge(david_df, df, by = 'gene', all=TRUE)

# grabbing just the entrez_ID and l2fc value
merge_df <- select(merge, entrez_ID, l2fc)
head(merge_df)

Unnamed: 0_level_0,entrez_ID,l2fc
Unnamed: 0_level_1,<int>,<dbl>
1,111099424,2.345118
2,111099567,-1.827096
3,111100699,4.063722
4,111101020,1.605386
5,111101925,-2.272699
6,111103792,-2.021323


In [31]:
# checking that there's only unique genes
length(unique(merge_df$entrez_ID))
length(merge_df$entrez_ID)
# both have 111, so all good there

In [32]:
# Create a vector of the gene unuiverse
kegg_gene_list <- merge_df$l2fc

# Name vector with ENTREZ ids
names(kegg_gene_list) <- merge_df$entrez_ID

# omit any NA values 
kegg_gene_list<-na.omit(kegg_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)

head(kegg_gene_list)
class(kegg_gene_list) # numeric
length(kegg_gene_list) # 111 genes

In [33]:
kegg_organism = "cvn"
kk2 <- gseKEGG(geneList     = kegg_gene_list,
               organism     = kegg_organism,
               nPerm        = 10000,
               minGSSize    = 1,
               maxGSSize    = 800,
               pvalueCutoff = 1, # if this is set to 1, see more pathways, but 0.05 is statistically signif.
               pAdjustMethod = "BH", # Benjamini–Hochberg FDR (false discover rate)
               scoreType = "pos",
               keyType       = "kegg")

Reading KEGG annotation online: "https://rest.kegg.jp/link/cvn/pathway"...

Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/cvn"...

preparing geneSet collections...

GSEA analysis...

“We do not recommend using nPerm parameter incurrent and future releases”
“You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.”
“There are duplicate gene names, fgsea may produce unexpected results.”
leading edge analysis...

done...



In [34]:
kk2_df <- as.data.frame(kk2)
kk2_df$Description <- sub(" -.*", "", kk2_df$Description)
head(kk2_df) # actually shows the entire df since there's only 5 pathways with pval<0.05

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
cvn00270,cvn00270,Cysteine and methionine metabolism,1,1.0,1.992336,0.02149785,0.3654635,0.3620691,1,"tags=100%, list=2%, signal=100%",111100699
cvn03008,cvn03008,Ribosome biogenesis in eukaryotes,1,0.8723404,1.737995,0.1489851,0.4030722,0.3993285,7,"tags=100%, list=15%, signal=87%",111129049
cvn00020,cvn00020,Citrate cycle (TCA cycle),1,0.8297872,1.653215,0.18968103,0.4030722,0.3993285,9,"tags=100%, list=19%, signal=83%",111117164
cvn00630,cvn00630,Glyoxylate and dicarboxylate metabolism,1,0.8297872,1.653215,0.18968103,0.4030722,0.3993285,9,"tags=100%, list=19%, signal=83%",111117164
cvn01200,cvn01200,Carbon metabolism,1,0.8297872,1.653215,0.18968103,0.4030722,0.3993285,9,"tags=100%, list=19%, signal=83%",111117164
cvn01210,cvn01210,2-Oxocarboxylic acid metabolism,1,0.8297872,1.653215,0.18968103,0.4030722,0.3993285,9,"tags=100%, list=19%, signal=83%",111117164
