- Quick start
- Introduction
- How to install
- Inputs
- Parameters
- Data preparation for running coGSEA
- Output
coGSEA(ElistObject = elist, contrastMatrix = contrast, ENTREZGenesId = elist$genes$ENTREZ, geneSetCollection = "H", specie = "Mus musculus", directoryPath = "/path/to/existing/dir")
coGSEA (comparative Gene Set Enrichment Analysis) is a R package developped to help you compare and combine up to 12 different methods of Gene Set Enrichment Analysis. In the current version, those 12 methods include :
- camera
- gage Results below generated with gage
2.24.0
Install manually from tar.gz archive provided here in dependancies - globaltest
- gsva
- ssgsea
- zscore
- plage
- ora
- padog
- roast
- safe
- setrank
Disclaimer : This tool is largely inspired by the eGSEA R package (and contains some of its code).
A longer and more detailed description and test of coGSEA can be found in the doc directory
devtools::install_github("maxibor/coGSEA",build_vignettes = TRUE)
-
ElistObject : a Elist object. Your Elist object must be either a limma-voom object in case of RNAseq data, or have a estimated dispersion in case of MicroArray data. Dispersion estimates can be performed using the
estimateDisp()
function from the edgeR package. -
contrastMatrix : the contrast matrix object for your experiment. For some help to make a contrast matrix, you can check the vignette of the limma package.
-
ENTREZGenesIds : the Elist object containing the ENTREZ gene identifiers. For example
elist$genes$ENTREZ
-
geneSetCollection : A geneset collection from the MSigDB database. Currently only three are supported :
H
(Hallmark) and two subset ofC2_KEGG
andC2_REACTOME
(Kegg and Reactome). You can also use your own geneset collection. In this case, you need to provide an list object with genesets and genes. Such an example file can be found here -
Specie : The organism from which the data were extracted. Currently only
Homo sapiens
andMus musculus
are supported. Default :Mus musculus
-
directoryPath : a path to an existing directory where coGSEA results and plots will be saved. Default :
"./"
Example :"~/coGSEA_results"
-
alpha : Alpha p value threshold. Default :
0.05
example :0.05
-
pvalAdjMethod : p value adjustment method for multiple testing. Default :
BH
.
To select among the following methods :holm
hochberg
hommel
bonferroni
BH
(Benjamini Hochberg)BY
(Benjamini Yekutieli)fdr
(False Discovery rate)none
-
pvalCombMethod : the method to combine all the p.values of one geneset accros the different GSEA methods. Default :
sumlog
To select among the following methods :sumz
(sum z method)votep
(vote counting method)minimump
(Wilkinson's method)sumlog
(Fisher's method) - selected by defaultsump
(sum p method)logitp
(logit method)meanp
(mean p method)maximump
(Wilkinson's method)
-
min.intersection.size : Graphical Parameter to select the minimum size of an intersection of genesets between different methods. Default :
1
Example :2
-
GSEA.Methods : between 1 and 12 methods to select from the methods listed in the introduction. Default :
c("camera", "gage","globaltest", "gsva", "ssgsea", "zscore", "plage", "ora", "padog", "roast","safe")
Example :c("camera", "gage","globaltest")
-
num.workers : number of thread for multithreading to decrease execution time. Default :
4
Example :4
-
shinyMode : Boolean value. Shouldn't be changed. Only used when running coGSEA is running in the background of a shiny application. Default :
FALSE
This example dataset is from a article about Astrocymotas tumors published in 2011 by Liu et al.
You can download it on the Gene Expression Omnibus database with the accession number GSE19728
library(affy)
library(hgu133plus2.db)
library(edgeR)
library(gage)
library(coGSEA)
setwd("~/GitLab/GSE19728/data/")
celfiles = ReadAffy()
celfiles = rma(celfiles)
intensity = exprs(celfiles)
intensity = cbind(rownames(intensity), intensity)
colnames(intensity)[1] = "PROBEID"
intensity = as.data.frame(intensity)
intensity$PROBEID = as.character(intensity$PROBEID)
annots = select(hgu133plus2.db, intensity$PROBEID, "ENTREZID", "PROBEID")
res = merge(intensity, annots, by= "PROBEID")
resmin = res[,2:ncol(res)]
cname = colnames(resmin)
resmin = apply(resmin, 2, as.numeric)
colnames(resmin)= cname
resmin = as.data.frame(resmin)
result = aggregate(. ~ ENTREZID, resmin, mean)
result$ENTREZID = levels(as.factor(as.character(resmin$ENTREZID)))
rownames(result) = result$ENTREZID
result = result[,-1]
boxplot(result, las = 2)
colnames(result) = gsub(".CEL","", colnames(result))
print(colnames(result))
result2 = cbind(result$GSM492649_Astrocytomas_N, result$GSM525014, result$GSM525015, result$GSM525016, result$`GSM492662_Astrocytomas_T4-1`, result$`GSM492663_Astrocytomas_T4-2` , result$`GSM492664_Astrocytomas_T4-3`, result$`GSM492665_Astrocytomas_T4-4`, result$`GSM492666_Astrocytomas_T4-5`)
colnames(result2) = c("GSM492649_Astrocytomas_N", "GSM525014", "GSM525015", "GSM525016","GSM492662_Astrocytomas_T4-1", "GSM492663_Astrocytomas_T4-2", "GSM492664_Astrocytomas_T4-3", "GSM492665_Astrocytomas_T4-4", "GSM492666_Astrocytomas_T4-5" )
rownames(result2) = rownames(result)
Normal = c(rep(1,4),rep(0,5))
Tumor = c(rep(0,4),rep(1,5))
design = cbind(Normal, Tumor)
rownames(design) = colnames(result2)
contr.matrix = makeContrasts(NormalVSTumor = Normal - Tumor, levels = design)
temp = new("EList")
temp$design = design
temp$E = as.matrix(result2)
rownames(temp$E) = as.numeric(rownames(temp$E))
temp$genes$ENTREZ = rownames(result2)
temp$common.dispersion = estimateDisp(temp$E, design = temp$design)$common.dispersion
temp$samples = colnames(result2)
gs = gage::kegg.gsets(species = "hsa", id.type = "entrez")
geneset = gs$kg.sets
Some GSEA methods do not work properly with exotic gene names, so we need to simplify them
nameshorter = function(names){
namemod = c()
for (i in seq(1,length(names))){
namemod[i] = paste(strsplit(names[i], " ")[[1]][-1], sep = "", collapse = " ")
namemod[i] = gsub("/","", names[i])
namemod[i] = gsub(" ","_", names[i])
}
return(namemod)
}
names(geneset) = nameshorter(names(geneset))
names(geneset) = gsub("/","_",names(geneset))
saveRDS(contr.matrix, "~/GitLab/GSE19728/contrast.rds")
saveRDS(temp, "~/GitLab/GSE19728/elist.rds")
saveRDS(geneset, "~/GitLab/GSE19728/geneset.rds")
elist = readRDS("~/GitLab/GSE19728/elist.rds")
contrast = readRDS("~/GitLab/GSE19728/contrast.rds")
geneset = readRDS("~/GitLab/GSE19728/geneset.rds")
coGSEA(ElistObject = elist, contrastMatrix = contrast, ENTREZGenesIds = elist$genes$ENTREZ, geneSetCollection = geneset,specie = "Homo sapiens", directoryPath = "~/GitLab/GSE19728/results", alpha = 0.05, pvalAdjMethod = "BH", pvalCombMethod = "sumlog",min.intersection.size = 1, GSEA.Methods = c("camera", "gage","globaltest", "gsva", "ssgsea", "zscore", "ora", "padog", "roast","safe"), num.workers = 4, shinyMode = FALSE)
You can download this example dataset on the Gene Expression Omnibus database with the accession number GSE63310.
This dataset was analyzed in a very detailed article on how to do differential expression analysis that we strongly advise you to read.
The file you're looking for is : GSE63310_RAW.tar
library(edgeR)
library(limma)
library(Mus.musculus)
library(coGSEA)
setwd("~/GitLab/GSE63310/data/")
files <- c(
"GSM1545535_10_6_5_11.txt",
"GSM1545536_9_6_5_11.txt",
"GSM1545538_purep53.txt",
"GSM1545539_JMS8-2.txt",
"GSM1545540_JMS8-3.txt",
"GSM1545541_JMS8-4.txt",
"GSM1545542_JMS8-5.txt",
"GSM1545544_JMS9-P7c.txt",
"GSM1545545_JMS9-P8c.txt")
x <- readDGE(files, columns=c(1,3))
colnames(x) = substring(colnames(x), 12, nchar(colnames(x)))
group <- as.factor(c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP"))
x$samples$group <- group
lane <- as.factor(rep(c("L004","L006","L008"), c(3,4,2)))
x$samples$lane <- lane
x$samples
geneid <- rownames(x)
genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"),keytype="ENTREZID")
dim(genes)
head(genes)
genes <- genes[!duplicated(genes$ENTREZID),]
cpm <- cpm(x)
#Removing genes lowly expressed (genes with 0 expression across all samples)
table(rowSums(x$counts==0)==9)
keep.exprs <- rowSums(cpm>1)>=3
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
dim(x)
x <- calcNormFactors(x, method = "TMM")
x$samples$norm.factors
design <- model.matrix( ~ 0 + group + lane)
colnames(design) <- gsub("group", "", colnames(design))
rownames(design) = colnames(x)
contr.matrix <- makeContrasts(
BasalvsLP = Basal-LP,
BasalvsML = Basal - ML,
LPvsML = LP - ML,
levels = colnames(design))
contr.matrix
v <- voom(x, design, plot=F)
v$genes$ENTREZ = rownames(v$E)
saveRDS(v, "~/GitLab/GSE63310/elist.rds")
saveRDS(contr.matrix, "~/GitLab/GSE63310/contrast.rds")
elist = readRDS("~/GitLab/GSE63310/elist.rds")
contrast = readRDS("~/GitLab/GSE63310/contrast.rds")
coGSEA(ElistObject = elist, contrastMatrix = contrast, ENTREZGenesIds = elist$genes$ENTREZ, geneSetCollection = "C2_KEGG",specie = "Mus musculus", directoryPath = "~/GitLab/GSE63310/results", alpha = 0.05, pvalAdjMethod = "BH", pvalCombMethod = "sumlog",min.intersection.size = 1, GSEA.Methods = c("camera", "gage","globaltest", "gsva", "ssgsea", "zscore", "ora", "padog", "roast","safe"), num.workers = 4, shinyMode = FALSE)
Files
-
abbreviations_[condition_name].csv
: gene-sets abbreviations mapping to gene-sets accession numbers and names. -
geneset_collection_genes.csv
: genes of the dataset present in the gene-sets -
result_[condition_name].csv
: result table with indivudual methods ranking, p.value, adjusted p.value. Average ranks, Average logFC, and combined p.value
Plots
-
coGSEA_methods_runtime.pdf
: barplot of the runtime of the different GSE methods (in seconds) -
[condition_name]_clustering.pdf
: Clustering of the GSE methods on the gene-sets ranks (bases on raw p.values). -
[condition_name]_correlation.pdf
: Correlation Plot of the GSEA methods on the gene-sets ranks (bases on raw p.values). -
[condition_name]_pca.pdf
: PCA of the GSEA methods on the gene-sets ranks (bases on raw p.values). -
[condition_name]_eigen_fall.pdf
: Fall of the eigen values of the PCA. -
[condition_name]_snailplot.pdf
: Mset Ensemblist plot showing the intersection of the different gene-sets found significantly enriched (p.value < alpha) by each GSEA method. Methods retrieving 0 significantly enriched gene-sets are not included. -
[condition_name]_upsetr.pdf
: UpsetR Ensemblist plot showing the intersection of the different gene-sets found significantly enriched (p.value < alpha) by each GSEA method. Methods retrieving 0 significantly enriched gene-sets are not included. Only exclusive intersections are taken into account. -
[condition_name]_heatmap.pdf
: Binary heatmap showing whether a gene-set is found differentially (red) expressed (p.value < alpha) by a GSEA method or not (blue). Only the gene-sets belonging the biggest intersections size (n), and the second biggest intersections size (n-1) are included. -
[condition_name]_advanced_sumplot_rank.pdf
: Summary plot combining both the logFC on the y-axis and the -log10(p.value) on the x-axis. The size of the bubbles indicates the size of the gene-set (number of genes) while the color indicates the average rank of the gene-set. -
[condition_name]_advanced_sumplot_direction.pdf
: Summary plot combining both the logFC on the y-axis and the -log10(p.value) on the x-axis. The size of the bubbles indicates the significance (combination of p.value and logFC) while the color indicates the direction of logFC. -
[condition_name]_rank_pval_corplot.pdf
: Scatterplot of combined p.values and average ranks with -log10(combined adjusted p.value) on x-axis, and average rank on the y-axis. Sperman correlation coefficient shown in the title.