/
example_workflow.Rmd
240 lines (170 loc) · 6 KB
/
example_workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "Example of using mitch package for Infinium methylation analysis"
author: "The GMEA team"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_float: true
fig_width: 7
fig_height: 7
theme: cosmo
bibliography: references.bib
csl: epigenetics.csl
---
## Introduction
Here we are looking at putting together the easiest possible workflow for the package.
Source code: https://github.com/markziemann/gmea/blob/main/example_workflow/example_workflow.Rmd
## Requirements
Load packages.
Important: ensure that the mitch version used is 1.15.1 or higher.
```{r,packages}
suppressPackageStartupMessages({
library("limma")
library("eulerr")
library("IlluminaHumanMethylation450kanno.ilmn12.hg19")
library("IlluminaHumanMethylationEPICanno.ilm10b4.hg19")
library("HGNChelper")
library("tictoc")
library("mitch")
library("kableExtra")
library("beeswarm")
library("missMethyl")
library("gridExtra")
library("png")
})
if( packageVersion("mitch") < "1.15.1") {
warning("This workflow requires mitch version 1.15.1 or higher")
}
```
## Load methylation data
The limma[@Ritchie2015-zn] data table is read in.
This dataset is available from GEO (GSE49667)[@Zhang2013-oy].
The limma data was obtained using a previously prepared workflow [@Maksimovic2017-zv].
```{r,loaddata}
dm <- read.csv("DMPs.csv.gz",header=TRUE)
head(dm)
```
Note that this object has probes with their own column, not simply as row names.
## Load pathways
Gene ontologies were downloaded in GMT format from MSigDB on 15th Jan 2024[@Liberzon2015-um;@The_Gene_Ontology_Consortium2023-fm].
The GMT file is read into R using the mitch function `gmt_import()`.
```{r,loadpathways}
gene_sets <- gmt_import("c5.go.v2023.2.Hs.symbols.gmt")
```
## Curate the annotation
One of the critical parts of this workflow is the establishment of probe-gene relationships.
This controls how the probe data is aggregated to make the gene level scores.
In this demonstration we are using all probes on the HM450K chip.
Alternative approaches could use only promoters CpGs for example.
As these annotations are several years old, many of the annotated gene names are no longer
current.
To remedy this, the gene names are screened with the HGNChelper package and any defunct
symbols get updated to the newer gene name, so they will be recognised properly in the
gene sets.
The process for making a gene table for EPIC data is included down below.
```{r,anno1}
tic()
anno <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt1 <- stack(gp2)
colnames(gt1) <- c("gene","probe")
gt1$probe <- as.character(gt1$probe)
dim(gt1)
str(gt1)
length(unique(gt1$gene))
toc() #6.0s
tic()
#new.hgnc.table <- getCurrentHumanMap()
new.hgnc.table <- readRDS("../figs/new.hgnc.table.rds")
fix <- checkGeneSymbols(gt1$gene,map=new.hgnc.table)
fix2 <- fix[which(fix$x != fix$Suggested.Symbol),]
length(unique(fix2$x))
gt1$gene <- fix$Suggested.Symbol
toc() # 36 sec (2788 gene names updated)
head(gt1)
```
## Mitch pipeline
The first part is to import the data into mitch.
This only works properly when mitch's mapGeneIds function uses `mean` to aggregate (line 69 of mitch.R).
```{r,mg1}
tic()
m <- mitch_import(x=dm,DEtype="limma",geneTable=gt1,geneIDcol="Name")
toc() #2.5 sec
head(m) %>%
kbl(caption = "Example of differential gene methylation scores used for pathway analysis") %>%
kable_paper("hover", full_width = F)
```
Now run the enrichment analysis.
Note that the results are not sorted by p-value, rather S.distance, an enrichment score.
I think this works better for interpretation.
```{r,mitch1}
tic()
mres <- mitch_calc(x=m,genesets=gene_sets,minsetsize=5, priority="effect")
toc() #7 secs
mtable <- mres$enrichment_result
up <- subset(mtable,s.dist>0 & p.adjustANOVA<0.05)
dn <- subset(mtable,s.dist<0 & p.adjustANOVA<0.05)
nrow(up)
nrow(dn)
head(up,10) %>%
kbl(caption = "Top significant pathways with higher methylation") %>%
kable_paper("hover", full_width = F)
head(dn,10) %>%
kbl(caption = "Top significant pathways with lower methylation") %>%
kable_paper("hover", full_width = F)
```
Now make a barplot of these top findings.
```{r, barplot}
top <- rbind(up[1:10,],dn[1:10,])
top <- top[order(top$s.dist),]
barnames <- gsub("_"," ",top$set)
cols <- as.character(((sign(top$s.dist)+1)/2)+1)
cols <- gsub("1","blue",gsub("2","red",cols))
par(mar = c(5.1, 29.1, 4.1, 2.1))
barplot(abs(top$s.dist),horiz=TRUE,las=1,names.arg=barnames, cex.names=0.8,
cex.axis=0.8,col=cols, xlab="Enrichment score",main="Gene Ontologies")
mtext("whole gene")
grid()
par( mar = c(5.1, 4.1, 4.1, 2.1) )
```
There are some interesting results there that give hints to the differences between naive T and Treg cells.
Make a html report and some charts.
```{r,report1}
mitch_report(res=mres,outfile="example_mitchreport.html",overwrite=TRUE)
mitch_plots(res=mres,outfile="example_mitchcharts.pdf")
```
## Make gene table for EPIC array data
In case you are working with EPIC array data, the probe-gene table can be make with this snippet of code.
```{r,gt2}
tic()
anno <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt <- stack(gp2)
colnames(gt) <- c("gene","probe")
gt$probe <- as.character(gt$probe)
dim(gt)
str(gt)
toc() #9.0s
tic()
#new.hgnc.table <- getCurrentHumanMap()
new.hgnc.table <- readRDS("../figs/new.hgnc.table.rds")
fix <- checkGeneSymbols(gt$gene,map=new.hgnc.table)
fix2 <- fix[which(fix$x != fix$Suggested.Symbol),]
length(unique(fix2$x))
gt$gene <- fix$Suggested.Symbol
toc()
```
## Session information
```{r,save}
sessionInfo()
```
## Bibliography