Table of Content
- Quick start
- Introduction
- How to install
- Inputs
- Parameters
- Data preparation for running coGSEA
- Output
Quick start
coGSEA(ElistObject = elist, contrastMatrix = contrast, ENTREZGenesId = elist$genes$ENTREZ, geneSetCollection = "H", specie = "Mus musculus", directoryPath = "/path/to/existing/dir")
Introduction
coGSEA (comparative Gene Set Enrichment Analysis) is a R package developped to help you compare and combine up to 12 different methods of Gene Set Enrichment Analysis. In the current version, those 12 methods include :
- camera
- gage Results below generated with gage
2.24.0
Install manually from tar.gz archive provided here in dependancies - globaltest
- gsva
- ssgsea
- zscore
- plage
- ora
- padog
- roast
- safe
- setrank
Disclaimer : This tool is largely inspired by the eGSEA R package (and contains some of its code).
A longer and more detailed description and test of coGSEA can be found in the doc directory
How to install
devtools::install_github("maxibor/coGSEA",build_vignettes = TRUE)
Inputs
-
ElistObject : a Elist object. Your Elist object must be either a limma-voom object in case of RNAseq data, or have a estimated dispersion in case of MicroArray data. Dispersion estimates can be performed using the
estimateDisp()
function from the edgeR package. -
contrastMatrix : the contrast matrix object for your experiment. For some help to make a contrast matrix, you can check the vignette of the limma package.
-
ENTREZGenesIds : the Elist object containing the ENTREZ gene identifiers. For example
elist$genes$ENTREZ
Parameters
-
geneSetCollection : A geneset collection from the MSigDB database. Currently only three are supported :
H
(Hallmark) and two subset ofC2_KEGG
andC2_REACTOME
(Kegg and Reactome). You can also use your own geneset collection. In this case, you need to provide an list object with genesets and genes. Such an example file can be found here -
Specie : The organism from which the data were extracted. Currently only
Homo sapiens
andMus musculus
are supported. Default :Mus musculus
-
directoryPath : a path to an existing directory where coGSEA results and plots will be saved. Default :
"./"
Example :"~/coGSEA_results"
-
alpha : Alpha p value threshold. Default :
0.05
example :0.05
-
pvalAdjMethod : p value adjustment method for multiple testing. Default :
BH
.
To select among the following methods :holm
hochberg
hommel
bonferroni
BH
(Benjamini Hochberg)BY
(Benjamini Yekutieli)fdr
(False Discovery rate)none
-
pvalCombMethod : the method to combine all the p.values of one geneset accros the different GSEA methods. Default :
sumlog
To select among the following methods :sumz
(sum z method)votep
(vote counting method)minimump
(Wilkinson's method)sumlog
(Fisher's method) - selected by defaultsump
(sum p method)logitp
(logit method)meanp
(mean p method)maximump
(Wilkinson's method)
-
min.intersection.size : Graphical Parameter to select the minimum size of an intersection of genesets between different methods. Default :
1
Example :2
-
GSEA.Methods : between 1 and 12 methods to select from the methods listed in the introduction. Default :
c("camera", "gage","globaltest", "gsva", "ssgsea", "zscore", "plage", "ora", "padog", "roast","safe")
Example :c("camera", "gage","globaltest")
-
num.workers : number of thread for multithreading to decrease execution time. Default :
4
Example :4
-
shinyMode : Boolean value. Shouldn't be changed. Only used when running coGSEA is running in the background of a shiny application. Default :
FALSE
Data preparation for running coGSEA
MicroArray data
This example dataset is from a article about Astrocymotas tumors published in 2011 by Liu et al.
You can download it on the Gene Expression Omnibus database with the accession number GSE19728
Loading the necessary packages for this analysis
library(affy)
library(hgu133plus2.db)
library(edgeR)
library(gage)
library(coGSEA)
rma
Loading the CEL FILES and Normalizing them usin setwd("~/GitLab/GSE19728/data/")
celfiles = ReadAffy()
celfiles = rma(celfiles)
Getting the expression data, and the Probe Names
intensity = exprs(celfiles)
intensity = cbind(rownames(intensity), intensity)
colnames(intensity)[1] = "PROBEID"
intensity = as.data.frame(intensity)
intensity$PROBEID = as.character(intensity$PROBEID)
Getting the annotations of Probes IDS to ENTREZ accession numbers
annots = select(hgu133plus2.db, intensity$PROBEID, "ENTREZID", "PROBEID")
Merge expression matrix and annotation
res = merge(intensity, annots, by= "PROBEID")
Getting rid of PROBE ids and type casting
resmin = res[,2:ncol(res)]
cname = colnames(resmin)
resmin = apply(resmin, 2, as.numeric)
colnames(resmin)= cname
resmin = as.data.frame(resmin)
Aggregating the PROBES matching the same ENTREZ accession number by averaging them
result = aggregate(. ~ ENTREZID, resmin, mean)
result$ENTREZID = levels(as.factor(as.character(resmin$ENTREZID)))
rownames(result) = result$ENTREZID
result = result[,-1]
Visualizing in boxplot to check normalization
boxplot(result, las = 2)
.CEL
filename extension
Changing column names to remove the colnames(result) = gsub(".CEL","", colnames(result))
print(colnames(result))
Selecting only the samples we are interested in
result2 = cbind(result$GSM492649_Astrocytomas_N, result$GSM525014, result$GSM525015, result$GSM525016, result$`GSM492662_Astrocytomas_T4-1`, result$`GSM492663_Astrocytomas_T4-2` , result$`GSM492664_Astrocytomas_T4-3`, result$`GSM492665_Astrocytomas_T4-4`, result$`GSM492666_Astrocytomas_T4-5`)
colnames(result2) = c("GSM492649_Astrocytomas_N", "GSM525014", "GSM525015", "GSM525016","GSM492662_Astrocytomas_T4-1", "GSM492663_Astrocytomas_T4-2", "GSM492664_Astrocytomas_T4-3", "GSM492665_Astrocytomas_T4-4", "GSM492666_Astrocytomas_T4-5" )
rownames(result2) = rownames(result)
Preparing the design matrix
Normal = c(rep(1,4),rep(0,5))
Tumor = c(rep(0,4),rep(1,5))
design = cbind(Normal, Tumor)
rownames(design) = colnames(result2)
Preparing the contrast matrix
contr.matrix = makeContrasts(NormalVSTumor = Normal - Tumor, levels = design)
Preparing expression list object
temp = new("EList")
temp$design = design
temp$E = as.matrix(result2)
rownames(temp$E) = as.numeric(rownames(temp$E))
temp$genes$ENTREZ = rownames(result2)
temp$common.dispersion = estimateDisp(temp$E, design = temp$design)$common.dispersion
temp$samples = colnames(result2)
Preparing gene set collection
gs = gage::kegg.gsets(species = "hsa", id.type = "entrez")
geneset = gs$kg.sets
Some GSEA methods do not work properly with exotic gene names, so we need to simplify them
Function for simplifying gene sets names
nameshorter = function(names){
namemod = c()
for (i in seq(1,length(names))){
namemod[i] = paste(strsplit(names[i], " ")[[1]][-1], sep = "", collapse = " ")
namemod[i] = gsub("/","", names[i])
namemod[i] = gsub(" ","_", names[i])
}
return(namemod)
}
Simplifying gene sets names
names(geneset) = nameshorter(names(geneset))
names(geneset) = gsub("/","_",names(geneset))
Saving necessary objects to RDS files
saveRDS(contr.matrix, "~/GitLab/GSE19728/contrast.rds")
saveRDS(temp, "~/GitLab/GSE19728/elist.rds")
saveRDS(geneset, "~/GitLab/GSE19728/geneset.rds")
Reading necessary objects (generated above) from RDS files
elist = readRDS("~/GitLab/GSE19728/elist.rds")
contrast = readRDS("~/GitLab/GSE19728/contrast.rds")
geneset = readRDS("~/GitLab/GSE19728/geneset.rds")
Running coGSEA analysis
coGSEA(ElistObject = elist, contrastMatrix = contrast, ENTREZGenesIds = elist$genes$ENTREZ, geneSetCollection = geneset,specie = "Homo sapiens", directoryPath = "~/GitLab/GSE19728/results", alpha = 0.05, pvalAdjMethod = "BH", pvalCombMethod = "sumlog",min.intersection.size = 1, GSEA.Methods = c("camera", "gage","globaltest", "gsva", "ssgsea", "zscore", "ora", "padog", "roast","safe"), num.workers = 4, shinyMode = FALSE)
RNAseq data
You can download this example dataset on the Gene Expression Omnibus database with the accession number GSE63310.
This dataset was analyzed in a very detailed article on how to do differential expression analysis that we strongly advise you to read.
The file you're looking for is : GSE63310_RAW.tar
Loading necessary packages
library(edgeR)
library(limma)
library(Mus.musculus)
library(coGSEA)
Reading files
setwd("~/GitLab/GSE63310/data/")
files <- c(
"GSM1545535_10_6_5_11.txt",
"GSM1545536_9_6_5_11.txt",
"GSM1545538_purep53.txt",
"GSM1545539_JMS8-2.txt",
"GSM1545540_JMS8-3.txt",
"GSM1545541_JMS8-4.txt",
"GSM1545542_JMS8-5.txt",
"GSM1545544_JMS9-P7c.txt",
"GSM1545545_JMS9-P8c.txt")
x <- readDGE(files, columns=c(1,3))
Simplifying file names
colnames(x) = substring(colnames(x), 12, nchar(colnames(x)))
Grouping by sample condition
group <- as.factor(c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP"))
x$samples$group <- group
Grouping by lane
lane <- as.factor(rep(c("L004","L006","L008"), c(3,4,2)))
x$samples$lane <- lane
x$samples
Annotation
geneid <- rownames(x)
genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"),keytype="ENTREZID")
dim(genes)
head(genes)
Getting rid of duplicated annoations by keeping only the first one
genes <- genes[!duplicated(genes$ENTREZID),]
Count per million of reads
cpm <- cpm(x)
Removing genes lowly expressed (genes with 0 expression across all samples)
#Removing genes lowly expressed (genes with 0 expression across all samples)
table(rowSums(x$counts==0)==9)
keep.exprs <- rowSums(cpm>1)>=3
x <- x[keep.exprs,, keep.lib.sizes=FALSE]
dim(x)
Normlization with edgeR
x <- calcNormFactors(x, method = "TMM")
x$samples$norm.factors
Making the design matrix
design <- model.matrix( ~ 0 + group + lane)
colnames(design) <- gsub("group", "", colnames(design))
rownames(design) = colnames(x)
Making the contrast matrix
contr.matrix <- makeContrasts(
BasalvsLP = Basal-LP,
BasalvsML = Basal - ML,
LPvsML = LP - ML,
levels = colnames(design))
contr.matrix
Applying voom transformation
v <- voom(x, design, plot=F)
v$genes$ENTREZ = rownames(v$E)
Saving objects
saveRDS(v, "~/GitLab/GSE63310/elist.rds")
saveRDS(contr.matrix, "~/GitLab/GSE63310/contrast.rds")
Reading RDS objects (previously genereated above)
elist = readRDS("~/GitLab/GSE63310/elist.rds")
contrast = readRDS("~/GitLab/GSE63310/contrast.rds")
Running coGSEA
coGSEA(ElistObject = elist, contrastMatrix = contrast, ENTREZGenesIds = elist$genes$ENTREZ, geneSetCollection = "C2_KEGG",specie = "Mus musculus", directoryPath = "~/GitLab/GSE63310/results", alpha = 0.05, pvalAdjMethod = "BH", pvalCombMethod = "sumlog",min.intersection.size = 1, GSEA.Methods = c("camera", "gage","globaltest", "gsva", "ssgsea", "zscore", "ora", "padog", "roast","safe"), num.workers = 4, shinyMode = FALSE)
Output
Files
-
abbreviations_[condition_name].csv
: gene-sets abbreviations mapping to gene-sets accession numbers and names. -
geneset_collection_genes.csv
: genes of the dataset present in the gene-sets -
result_[condition_name].csv
: result table with indivudual methods ranking, p.value, adjusted p.value. Average ranks, Average logFC, and combined p.value
Plots
-
coGSEA_methods_runtime.pdf
: barplot of the runtime of the different GSE methods (in seconds) -
[condition_name]_clustering.pdf
: Clustering of the GSE methods on the gene-sets ranks (bases on raw p.values). -
[condition_name]_correlation.pdf
: Correlation Plot of the GSEA methods on the gene-sets ranks (bases on raw p.values). -
[condition_name]_pca.pdf
: PCA of the GSEA methods on the gene-sets ranks (bases on raw p.values). -
[condition_name]_eigen_fall.pdf
: Fall of the eigen values of the PCA. -
[condition_name]_snailplot.pdf
: Mset Ensemblist plot showing the intersection of the different gene-sets found significantly enriched (p.value < alpha) by each GSEA method. Methods retrieving 0 significantly enriched gene-sets are not included. -
[condition_name]_upsetr.pdf
: UpsetR Ensemblist plot showing the intersection of the different gene-sets found significantly enriched (p.value < alpha) by each GSEA method. Methods retrieving 0 significantly enriched gene-sets are not included. Only exclusive intersections are taken into account. -
[condition_name]_heatmap.pdf
: Binary heatmap showing whether a gene-set is found differentially (red) expressed (p.value < alpha) by a GSEA method or not (blue). Only the gene-sets belonging the biggest intersections size (n), and the second biggest intersections size (n-1) are included. -
[condition_name]_advanced_sumplot_rank.pdf
: Summary plot combining both the logFC on the y-axis and the -log10(p.value) on the x-axis. The size of the bubbles indicates the size of the gene-set (number of genes) while the color indicates the average rank of the gene-set. -
[condition_name]_advanced_sumplot_direction.pdf
: Summary plot combining both the logFC on the y-axis and the -log10(p.value) on the x-axis. The size of the bubbles indicates the significance (combination of p.value and logFC) while the color indicates the direction of logFC. -
[condition_name]_rank_pval_corplot.pdf
: Scatterplot of combined p.values and average ranks with -log10(combined adjusted p.value) on x-axis, and average rank on the y-axis. Sperman correlation coefficient shown in the title.