## Given phase 2, what is the effect of phase 1?
## phase 2 = hypoxic, **looking at phase 1 hypoxic vs. control**

GO analysis and GSEA with KEGG

In [1]:
# loading packages
library(clusterProfiler)
library(topGO)
library(dplyr)
library(KEGGREST)
library(ggplot2)



clusterProfiler v4.10.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141


Attaching package: ‘clusterProfiler’


The following object is masked from ‘package:stats’:

    filter


Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, r

What we need for GO analysis:
- list of DMGs with pvalues (feel like this should be log2FoldChange)

In [2]:
# load significant genes df for both vs. control for phase 2
data <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/significant_genes/sig_p2h_p1hc_genes.csv')

# select needed columns (really might only need l2fc
data2 <- select(data, Row.names, log2FoldChange, pvalue, padj)

# renaming columns so they make more sense
colnames(data2) = c('gene', 'l2fc', 'pval', 'padj')
head(data2)

Unnamed: 0_level_0,gene,l2fc,pval,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
1,LOC111100393,0.6624208,0.0001535024,0.0330065
2,LOC111105858,-0.5596059,7.819983e-05,0.02180038
3,LOC111106800,-1.0316182,5.504449e-05,0.01726333
4,LOC111113273,-0.8104052,0.0002376525,0.04259072
5,LOC111113309,0.6891998,0.0001578629,0.0330065
6,LOC111117059,-1.2192004,8.515913e-12,2.136643e-08


In [3]:
# creating numeric vector of gene names and log2FoldChange value
geneList <- data2$l2fc
names(geneList) <- data2$gene

# double checking things look right
head(geneList)
class(geneList) # numeric, used in allGenes for topGO object

In [4]:
# loading conversion df of unique genes with associated GO ids
geneID2GO <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/GO_enrichment_analysis/geneID2GO.txt', sep='\t')

# renaming columns
colnames(geneID2GO) = c('gene','GO_id')

# checking things make sense
head(geneID2GO)
dim(geneID2GO) # have 22,654 unique genes that have GO annotations

Unnamed: 0_level_0,gene,GO_id
Unnamed: 0_level_1,<chr>,<chr>
1,LOC111133408,GO:2001070
2,LOC111121603,"GO:2000781,GO:2000781"
3,LOC111132389,GO:2000145
4,LOC111115105,"GO:1990904,GO:1990904"
5,LOC111129853,"GO:1990904,GO:1990904"
6,LOC111101512,GO:1990904


In [5]:
# have to create annotation file
geneID2GO <- readMappings(file = '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/GO_enrichment_analysis/geneID2GO.txt')
geneID2GO <- geneID2GO[-1] # removes header
head(geneID2GO)

In [6]:
geneNames <- names(geneID2GO)
head(geneNames)

In [7]:
topDiffGenes <- function(allScore) {
    return(allScore < 0.01)
}

x <- topDiffGenes(geneList)
sum(x) ## the number of selected genes

#### GO analysis: molecular function

In [9]:
# creating GO data object
GOdata_MF <- new("topGOdata", 
              description = 'DMGs in phase 1 hypoxic vs. control, phase 2 hypoxic',
              ontology = "MF", 
              allGenes = geneList,
              geneSel = topDiffGenes,
              annot = annFUN.gene2GO, 
              gene2GO = geneID2GO)
GOdata_MF


Building most specific GOs .....

	( 11 GO terms found. )


Build GO DAG topology ..........

	( 45 GO terms and 56 relations. )


Annotating nodes ...............

	( 7 genes annotated to the GO terms. )




------------------------- topGOdata object -------------------------

 Description:
   -  DMGs in phase 1 hypoxic vs. control, phase 2 hypoxic 

 Ontology:
   -  MF 

 14 available genes (all genes from the array):
   - symbol:  LOC111100393 LOC111105858 LOC111106800 LOC111113273 LOC111113309  ...
   - score :  0.6624208264 -0.5596058527 -1.031618178 -0.8104052125 0.6891998158  ...
   - 12  significant genes. 

 7 feasible genes (genes that can be used in the analysis):
   - symbol:  LOC111100393 LOC111105858 LOC111106800 LOC111113309 LOC111122794  ...
   - score :  0.6624208264 -0.5596058527 -1.031618178 0.6891998158 -1.148034325  ...
   - 5  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 45 
   - number of edges = 56 

------------------------- topGOdata object -------------------------


In [10]:
# KS stat
resultKS_MF <- runTest(GOdata_MF, algorithm = "weight01", statistic = "ks")

# putting result into readable table
tab_MF <- GenTable(GOdata_MF, raw.p.value = resultKS_MF, topNodes = length(resultKS_MF@score), numChar = 120)

# showing top 10 GO term results
head(tab_MF, 10)


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 45 nontrivial nodes
		 parameters: 
			 test statistic: ks
			 score order: increasing


	 Level 9:	1 nodes to be scored	(0 eliminated genes)


	 Level 8:	1 nodes to be scored	(0 eliminated genes)


	 Level 7:	3 nodes to be scored	(1 eliminated genes)


	 Level 6:	6 nodes to be scored	(1 eliminated genes)


	 Level 5:	10 nodes to be scored	(2 eliminated genes)


	 Level 4:	8 nodes to be scored	(4 eliminated genes)


	 Level 3:	12 nodes to be scored	(5 eliminated genes)


	 Level 2:	3 nodes to be scored	(5 eliminated genes)


	 Level 1:	1 nodes to be scored	(7 eliminated genes)



Unnamed: 0_level_0,GO.ID,Term,Annotated,Significant,Expected,raw.p.value
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<dbl>,<chr>
1,GO:0016491,oxidoreductase activity,1,1,0.71,0.14
2,GO:0005524,ATP binding,1,1,0.71,0.29
3,GO:0035091,phosphatidylinositol binding,1,1,0.71,0.29
4,GO:0004672,protein kinase activity,1,1,0.71,0.29
5,GO:0003779,actin binding,1,1,0.71,0.29
6,GO:0022857,transmembrane transporter activity,1,1,0.71,0.43
7,GO:0003677,DNA binding,1,1,0.71,0.57
8,GO:0008270,zinc ion binding,1,1,0.71,0.71
9,GO:0004222,metalloendopeptidase activity,1,0,0.71,0.86
10,GO:0032559,adenyl ribonucleotide binding,1,1,0.71,1.0


#### GO analysis for cellular component

In [11]:
# creating GO data object
GOdata_CC <- new("topGOdata", 
              description = 'DMGs in phase 1 hypoxic vs. control, phase 2 hypoxic',
              ontology = "CC", 
              allGenes = geneList,
              geneSel = topDiffGenes,
              annot = annFUN.gene2GO, 
              gene2GO = geneID2GO)
GOdata_CC


Building most specific GOs .....

	( 6 GO terms found. )


Build GO DAG topology ..........

	( 21 GO terms and 31 relations. )


Annotating nodes ...............

	( 6 genes annotated to the GO terms. )




------------------------- topGOdata object -------------------------

 Description:
   -  DMGs in phase 1 hypoxic vs. control, phase 2 hypoxic 

 Ontology:
   -  CC 

 14 available genes (all genes from the array):
   - symbol:  LOC111100393 LOC111105858 LOC111106800 LOC111113273 LOC111113309  ...
   - score :  0.6624208264 -0.5596058527 -1.031618178 -0.8104052125 0.6891998158  ...
   - 12  significant genes. 

 6 feasible genes (genes that can be used in the analysis):
   - symbol:  LOC111100393 LOC111117059 LOC111122649 LOC111124218 LOC111129021  ...
   - score :  0.6624208264 -1.219200435 -0.7448971595 -0.7301369327 -0.7095935444  ...
   - 5  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 21 
   - number of edges = 31 

------------------------- topGOdata object -------------------------


In [12]:
# KS stat
resultKS_CC <- runTest(GOdata_CC, algorithm = "weight01", statistic = "ks")

# putting result into readable table
tab_CC <- GenTable(GOdata_CC, raw.p.value = resultKS_CC, topNodes = length(resultKS_CC@score), numChar = 120)

# showing top 10 GO term results
head(tab_CC, 10)


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 21 nontrivial nodes
		 parameters: 
			 test statistic: ks
			 score order: increasing


	 Level 8:	2 nodes to be scored	(0 eliminated genes)


	 Level 7:	1 nodes to be scored	(0 eliminated genes)


	 Level 6:	3 nodes to be scored	(1 eliminated genes)


	 Level 5:	3 nodes to be scored	(1 eliminated genes)


	 Level 4:	3 nodes to be scored	(3 eliminated genes)


	 Level 3:	6 nodes to be scored	(3 eliminated genes)


	 Level 2:	2 nodes to be scored	(3 eliminated genes)


	 Level 1:	1 nodes to be scored	(6 eliminated genes)



Unnamed: 0_level_0,GO.ID,Term,Annotated,Significant,Expected,raw.p.value
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<dbl>,<chr>
1,GO:0005829,cytosol,2,2,1.67,0.2
2,GO:0000938,GARP complex,1,1,0.83,0.33
3,GO:0005768,endosome,1,1,0.83,0.33
4,GO:0005576,extracellular region,1,1,0.83,0.5
5,GO:0016020,membrane,3,2,2.5,0.75
6,GO:0005634,nucleus,2,2,1.67,0.8
7,GO:0005622,intracellular anatomical structure,3,3,2.5,1.0
8,GO:0031982,vesicle,1,1,0.83,1.0
9,GO:0110165,cellular anatomical entity,6,5,5.0,1.0
10,GO:0031410,cytoplasmic vesicle,1,1,0.83,1.0


#### GO analysis for biological process

In [13]:
# creating GO data object
GOdata_BP <- new("topGOdata", 
              description = 'DMGs in phase 1 hypoxic vs. control, phase 2 hypoxic',
              ontology = "BP", 
              allGenes = geneList,
              geneSel = topDiffGenes,
              annot = annFUN.gene2GO, 
              gene2GO = geneID2GO)
GOdata_BP


Building most specific GOs .....

	( 8 GO terms found. )


Build GO DAG topology ..........

	( 44 GO terms and 60 relations. )


Annotating nodes ...............

	( 4 genes annotated to the GO terms. )




------------------------- topGOdata object -------------------------

 Description:
   -  DMGs in phase 1 hypoxic vs. control, phase 2 hypoxic 

 Ontology:
   -  BP 

 14 available genes (all genes from the array):
   - symbol:  LOC111100393 LOC111105858 LOC111106800 LOC111113273 LOC111113309  ...
   - score :  0.6624208264 -0.5596058527 -1.031618178 -0.8104052125 0.6891998158  ...
   - 12  significant genes. 

 4 feasible genes (genes that can be used in the analysis):
   - symbol:  LOC111100393 LOC111106800 LOC111122649 LOC111133241  ...
   - score :  0.6624208264 -1.031618178 -0.7448971595 -1.087243109  ...
   - 3  significant genes. 

 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 44 
   - number of edges = 60 

------------------------- topGOdata object -------------------------


In [14]:
# KS stat
resultKS_BP <- runTest(GOdata_BP, algorithm = "weight01", statistic = "ks")

# putting result into readable table
tab_BP <- GenTable(GOdata_BP, raw.p.value = resultKS_BP, topNodes = length(resultKS_BP@score), numChar = 120)

# showing top 10 GO term results
head(tab_BP, 10)


			 -- Weight01 Algorithm -- 

		 the algorithm is scoring 44 nontrivial nodes
		 parameters: 
			 test statistic: ks
			 score order: increasing


	 Level 7:	3 nodes to be scored	(0 eliminated genes)


	 Level 6:	8 nodes to be scored	(0 eliminated genes)


	 Level 5:	11 nodes to be scored	(2 eliminated genes)


	 Level 4:	9 nodes to be scored	(4 eliminated genes)


	 Level 3:	9 nodes to be scored	(4 eliminated genes)


	 Level 2:	3 nodes to be scored	(4 eliminated genes)


	 Level 1:	1 nodes to be scored	(4 eliminated genes)



Unnamed: 0_level_0,GO.ID,Term,Annotated,Significant,Expected,raw.p.value
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<dbl>,<chr>
1,GO:0015031,protein transport,1,1,0.75,0.25
2,GO:0042147,"retrograde transport, endosome to Golgi",1,1,0.75,0.25
3,GO:0006869,lipid transport,1,1,0.75,0.25
4,GO:0007030,Golgi organization,1,1,0.75,0.25
5,GO:0016310,phosphorylation,1,1,0.75,0.5
6,GO:0006915,apoptotic process,1,1,0.75,0.75
7,GO:0006325,chromatin organization,1,1,0.75,0.75
8,GO:0043170,macromolecule metabolic process,1,0,0.75,1.0
9,GO:0010876,lipid localization,1,1,0.75,1.0
10,GO:0071702,organic substance transport,1,1,0.75,1.0


## Gene Set Enrichment Analysis with clusterProfiler
looking for enriched KEGG pathways with a ranked gene list

In [15]:
# already have a df with DMGs and scores - need just gene and l2fc
df <- select(data2, gene, l2fc)
head(df)
dim(df) # 48 genes

Unnamed: 0_level_0,gene,l2fc
Unnamed: 0_level_1,<chr>,<dbl>
1,LOC111100393,0.6624208
2,LOC111105858,-0.5596059
3,LOC111106800,-1.0316182
4,LOC111113273,-0.8104052
5,LOC111113309,0.6891998
6,LOC111117059,-1.2192004


In [16]:
# need to have conversion table for gene name to entrez id
# obtained from DAVID gene accession conversion tool
david_df <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/KEGG_pathway/p2h_p1hc_DAVID.txt', sep='\t')
# only selecting columns that I need
david_df <- select(david_df, From, To)
# renaming columns for merge
colnames(david_df) = c('gene', 'entrez_ID')
head(david_df)
dim(david_df)

Unnamed: 0_level_0,gene,entrez_ID
Unnamed: 0_level_1,<chr>,<int>
1,LOC111135443,111135443
2,LOC111129382,111129382
3,LOC111134678,111134678
4,LOC111100393,111100393
5,LOC111129021,111129021
6,LOC111113309,111113309


In [17]:
# matching up dataframes so entrez id has a log2FoldChange value
merge <- merge(david_df, df, by = 'gene', all=TRUE)

# grabbing just the entrez_ID and l2fc value
merge_df <- select(merge, entrez_ID, l2fc)
head(merge_df)

Unnamed: 0_level_0,entrez_ID,l2fc
Unnamed: 0_level_1,<int>,<dbl>
1,111100393,0.6624208
2,111105858,-0.5596059
3,111106800,-1.0316182
4,111113273,-0.8104052
5,111113309,0.6891998
6,111117059,-1.2192004


In [18]:
# checking that there's only unique genes
length(unique(merge_df$entrez_ID))
length(merge_df$entrez_ID)
# both have 111, so all good there

In [22]:
# Create a vector of the gene unuiverse
kegg_gene_list <- merge_df$l2fc

# Name vector with ENTREZ ids
names(kegg_gene_list) <- merge_df$entrez_ID

# omit any NA values 
kegg_gene_list<-na.omit(kegg_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)

head(kegg_gene_list)
class(kegg_gene_list) # numeric
length(kegg_gene_list) # 111 genes

In [23]:
kegg_organism = "cvn"
kk2 <- gseKEGG(geneList     = kegg_gene_list,
               organism     = kegg_organism,
               nPerm        = 10000,
               minGSSize    = 1,
               maxGSSize    = 800,
               pvalueCutoff = 1, # if this is set to 1, see more pathways, but 0.05 is statistically signif.
               pAdjustMethod = "BH", # Benjamini–Hochberg FDR (false discover rate)
               scoreType = "pos",
               keyType       = "kegg")

preparing geneSet collections...

--> Expected input gene ID: 111123515,111123467,111105314,111133902,111112920,111103956



ERROR: Error in check_gene_id(geneList, geneSets): --> No gene can be mapped....


In [21]:
kk2_df <- as.data.frame(kk2)
kk2_df$Description <- sub(" -.*", "", kk2_df$Description)
head(kk2_df) # actually shows the entire df since there's only 5 pathways with pval<0.05

ERROR: Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'kk2' not found
