-
Notifications
You must be signed in to change notification settings - Fork 1
/
Example_of_data_analysis.Rmd
executable file
·513 lines (425 loc) · 20.5 KB
/
Example_of_data_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
---
title: "Example of data analyis and tree aggregation"
author:
- name: Ruizhu HUANG
affiliation:
- Institute of Molecular Life Sciences, University of Zurich.
- SIB Swiss Institute of Bioinformatics.
- name: Charlotte Soneson
affiliation:
- Institute of Molecular Life Sciences, University of Zurich.
- SIB Swiss Institute of Bioinformatics.
- name: Mark Robinson
affiliation:
- Institute of Molecular Life Sciences, University of Zurich.
- SIB Swiss Institute of Bioinformatics.
package: treeAGG
output:
BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{Tree Aggregation}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
bibliography: treeAGG_vignette.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
# Introduction
The arrangement of hypotheses in a hierarchical structure appears in many
research fields and often indicates different resolutions at which data can be
viewed and results can be interpreted. Examples include studies of microbial
abundance where the OTUs (operational taxonomic units, often assumed to
represent species) can be arranged as leaves in a phylogenetic tree, and
single-cell studies where the individual cells can be organized hierarchically,
e.g. via clustering. In both these examples, a question arises on which
resolution level the signal in the data (e.g. differential abundance of features
between groups) is best interpreted, or put another way, on what resolution the
main 'driver' of the signal can be found. For example, in the microbial
abundance data, a whole family of species may react similarly to a particular
stimulus. In such situations, it would be more informative to summarize the
differential abundance analysis at this level of resolution, rather than
providing a long list of individual differentially abundant OTUs. An additional
reason for aggregating data to higher levels of the phylogenetic tree is that
the abundances of individual species or OTUs may be low, limiting the power of
the statistical test to pick up any significant differences, whereas by
aggregating the signal from multiple related OTUs, the detection power can be
increased. For the single-cell data example, similarly, if a treatment truly
acts on a high level, affecting the abundance of, e.g., all B-cells, it is
helpful if this is picked up by the analysis pipeline, rather than returning a
large number of significantly differentially abundant subclusters of the main
B-cell cluster. The `r Biocpkg("treeAGG")` package is designed to analyze data
that can be organized in a hierarchical structure, and to find the most
informative resolution at which to interpret the signal of interest.
# Methods {#flowchart}
## The main structure of the pipeline provided in `r Biocpkg("treeAGG")`
```{r flowchart, echo=FALSE, fig.cap= "The workflow of treeAGG package."}
knitr::include_graphics("Flowchart.png")
```
In the `r Biocpkg("treeAGG")` package, we provide an easy-to-use pipeline
(Figure \@ref(fig:flowchart)) to find the most informative level of the
hierarchical structure for interpretation. The starting point for the analysis
is a hierarchical structure (e.g., a phylogenetic tree), a data matrix with
values for all leaves in the hierarchical structure, and annotation tables for
rows and columns of the matrix. From here, users can follow the standard five
steps outlined in Figure \@ref(fig:flowchart) to find the main 'driver' of the
signal. The details of the five steps are listed as below. The example of each
step is in Section \@ref(example).
* Step 1: The data on the level of the leaf nodes is collected and stored in a
`leafSummarizedExperiment` container.
* Step 2: The data on the level of the internal nodes is generated and stored in
a `treeSummarizedExperiment` container.
* Step 3: The phenotypic outcome association test is performed at each node of
the tree, and a p-value is derived for each node. If the differential abundance
analysis is desired, `runEdgeR`, a wrapper using functions from the
`r Biocpkg("treeAGG")` package, could be used. Users are free to choose any
suitable software to do their customized analysis. What to be expected is that a
p-value is derived for each node.
* Step 4: The min-P algorithm is applied to do tree aggregation.
+ It starts from the leaf nodes of the tree, and compare the p-values on
them to those on their parent nodes. If a parent node has a smaller value
than its descendant nodes, select the parent node; otherwise, select the
descendants.
+ The selected nodes are compared to nodes on a higher level as previous
step. The root node is referred as the highest level.
+ The comparison is repeated until the root is reached.
+ Among the selected nodes, those with null hypotheses rejected are kept.
* Step 5: The result is printed out.
To do customized analysis, the extended data matrix can be extracted after step
2, and the results of which can be integrated back into the pipeline to do step
4 and 5. Note that the table extracted has more rows than the original table
because it includes data for the entities that represent the internal nodes of
the tree. In each step, we have listed the available functions to use in blue
texts.
We suggest users to come back to this workflow (Figure \@ref(fig:flowchart))
after they read Section \@ref(example).
## More details {#details}
Consider the case that we have a table that contains measurments of entities
collected from samples with different phenotypic outcomes, and a tree that
represents the hierarchical structure of the entities. Entities are in the rows
of the table, and samples are in the columns. Each entity could be mapped to a
node of the tree.
To arrange hypotheses in the hierarchical structure, we need that each node of
the tree could be mapped to a row of the table. However, in most case, only the
data of entities corresponding to the leaf nodes of the tree could be observed.
For the internal nodes, the data has to be generated from that of leaf nodes.
Users could decide how to reasonably create the data for the internal nodes. For
the abundance data, we suggest to create the value for an internal node by
summing the count of its descendant leaf nodes.
With the data ready, we test at each node of the tree the null hypothesis (H0:
There is no association between the differential abundance and the phenotypic
outcome.) Multiple testing correction methods, such as the Benjamin-Hochberg
procedure, could be applied then to decide whether the null hypothesis at a node
should be rejected.
It's likely that an internal node found to be significantly associated with the
phenotypic outcome is driven by some of its descendant nodes. In other words,
only some of its descendant nodes are phenotypic outcome associated, and this
signal is not diluted enough by other descendant nodes that are not phenotypic
outcome associated. In the tree aggregation step, we try to pinpoint the nodes
that drive the association.
# Examples {#example}
The packages below are required.
```{r}
suppressPackageStartupMessages({
library(treeAGG)
library(edgeR)
library(S4Vectors)
library(ggtree)})
```
## The standard part of the pipeline {#pline1}
In this section, a toy data is used to show how to perform the standard five
steps outlined in Figure \@ref(fig:flowchart).
### Step 1: Data preparation for the leaf nodes {#step1}
At the begining, we have a tree structure (*tinyTree*), a table (*toyData*) and
the annotation data for the rows and columns of the table (*rowD* and *colD*).
Entities are in the rows of the table *toyData* and samples are in the columns.
Each entity could be mapped to a leaf node of the tree. The first five samples
belong to the group G1, and the other five to G2.
The table is generated in a way that only the first five entities have
differential proportion among groups. The back ground of the branches that the
five entities on is colored as cyan. Our goal is find the optimal level on the
tree, indicated by the violet squares, to interpret this difference.
```{r}
# tree
data("tinyTree")
```
```{r sigTree, fig.cap="The differential pattern on tinyTree.", echo=FALSE}
ggtree(tinyTree) +
geom_text2(aes(subset = isTip, label = label), hjust = -0.2) +
geom_hilight(node = 18, fill = "cyan", alpha = 0.1) +
geom_hilight(node = 13, fill = "cyan", alpha = 0.1) +
geom_point2(aes(subset = (node %in% c(13, 18))), size = 4,
color = "darkviolet", shape = 23, stroke = 3)
```
```{r}
# table
# p1 and p2 differ in the first five values
p1 <- rep(0.1, 10)
p2 <- c(rep(0.05, 3), rep(0.175, 2), rep(0.1, 5) )
set.seed(1)
# randomly generate the table from multinomial distribution
toyData <- cbind(rmultinom(n = 5, size = 1000, prob = p1),
rmultinom(n = 5, size = 1000, prob = p2))
# annotation data for rows
# The true differential information between groups is in Diff
rowD <- DataFrame(nodeLab = tinyTree$tip.label,
Diff = rep(c(TRUE, FALSE), each = 5))
# annotation data for columns
colD <- DataFrame(trt = sample(letters[1:3], size = 10,
replace = TRUE),
group = rep(c("G1", "G2"), each = 5))
```
All data above could be stored in a `leafSummarizedExperiment` container using
the function `leafSummarizedExperiment`. More details about the
`leafSummarizedExperiment` class could be found in the help page
`?leafSummarizedExperiment` or in the vignette *Introduction to
leafSummarizedExperiment and treeSummarizedExperiment*.
```{r}
lse <- leafSummarizedExperiment(assays = list(toyData),
rowData = rowD,
colData = colD,
tree = tinyTree)
```
### Step 2: Data generation for the internal nodes
The data in Section \@ref(step1) is on the level of the leaf nodes. To
generate data for the internal nodes, the function `nodeValue` can be used.
Users could decide how to generate the values for internal nodes via `fun`.
Here, we consider the value at an internal node to be the sum of the values at
its descendants (`fun = sum`).
```{r}
tse <- nodeValue(data = lse, fun = sum)
```
The output *tse* is a `treeSummarizedExperiment` object. More details about the
`treeSummarizedExperiment` class could be found in the help page
`?treeSummarizedExperiment` or in the vignette *Introduction to
leafSummarizedExperiment and treeSummarizedExperiment*.
### Step 3: Data analysis
With the data ready, the statistical analysis could be performed using suitable
softwares. Here, `runEdgeR` is shown as an example to do differential abundance
analysis. It's a wrapper of functions from `r Biocpkg("edgeR")`. The detailed
analysis performed is shown step by step in Section \@ref(pline2).
```{r}
new_tse <- runEdgeR(obj = tse, use.assays = 1,
design = NULL, contrast = NULL,
normalize = TRUE, method = "TMM",
adjust.method = "BH")
```
To do analysis, we assign *tse* to the `obj` argument. If there multiple
matrix-like elements in the `assays` of *tse*, we could use `use.assays` to
indicate which elements are used for the analysis. Here, the first one is used
(`use.assays = 1`). The design matrix and contrasts can be customized
correspondingly via `design` and `contrast`, and they will be saved in the
`metadata` of *new_tse* for later check. If the design matrix is not given
(`design = NULL`), `colData` is used to generate a design matrix. If the
contrast is not provided (`contrast = NULL`), the last coefficient is tested in
the `glmLRT` step (see `?edgeR::glmLRT` and Section \@ref(pline2)).
To print out the result, `topNodes` can be used. The output is a `list`. More
details about its structure could be found in Section \@ref(mtab).
```{r}
(tN1 <- topNodes(data = new_tse))
```
### Step 4: Tree aggregation {#step4}
The tree aggregation can be done in one step using `treeAGG`.
```{r}
outP <- treeAGG(data = new_tse, agg.by = "PValue",
sigf.by = "FDR", sigf.limit = 0.05,
message = TRUE)
```
Based on the analysis result (`data = new_tse`), we specify the name of the
column in *tN1* for aggregation via `agg.by`, and the name of the column storing
the adjusted p-value via `sigf.by`. The threshold for the adjusted p-value can
be decided via `sigf.limit`. To see the running process, `message = TRUE` is
used.
### Step 5: Result visualisation {#step5}
We could use `topNodes` again to print out the result after aggregation. If some
columns in `rowData` and `linkData` are of interest, they could be printed out
too via `col.rowData` and `col.linkData`, respectively.
```{r}
(tN2 <- topNodes(data = outP,
col.rowData = "Diff",
col.linkData = "nodeNum"))
```
Compared to *tN1*, *tN2* has three more columns, *aggKeep*, *Diff* and
*nodeNum*. The column *aggKeep* is created in the step of the tree aggregation,
the other two are from `rowData` and `linkData` as we specified in the
arguments.
The next step is to check whether the optimal level for the interpretation is
found. The true differential information is stored in the column `Diff` of
`rowData`.
```{r}
# Extract data having nodes that are truely differentially abundant
outP_sel <- outP[rowData(outP)$Diff %in% TRUE, ]
# the true optimal level of the signal
# use signalNode to remove the redundant nodes
truth <- signalNode(tree = treeData(outP_sel),
node = linkData(outP_sel)$nodeNum)
```
The estimated optimal level to interpret the differential abundance pattern from
`r Biocpkg("treeAGG")` is as below.
```{r}
# the estimated signal level
res <- tN2$result_assay1$contrastNULL
found <- res$nodeNum[res$aggKeep]
```
The node number (`nodeNum`) instead of the node label is extracted because
internal nodes might not have labels in the tree provided.
```{r fig.height= 4}
p <- treePlot(tree = treeData(outP) ,
branch = truth,
point = found,
layout = "rectangular")
# using the function geom_text2 from the ggtree packages to add labels
p + geom_text2(aes(label = label), hjust = -0.3)
```
We use `treePlot` (`?treePlot`) to visualize the results. The branches with blue
edges are truly differentially abundant. The results obtained from the minP
aggregation algorithm implemented in `r Biocpkg("treeAGG")` are shown as orange
points. Compared to Figure \@ref(fig:sigTree), the orange points are exactly in
the same location as the violet squares. That indicates the true signal is found
correctly. More details about how to use `r Biocpkg("ggtree")` could be seen
[here](https://bioconductor.org/packages/release/bioc/html/ggtree.html).
## The customized part of the pipeline {#pline2}
### Data analysis {#cDA}
As described in Figure \@ref(fig:flowchart), the table stored in `assays` could
be extracted to do customized analysis. Note: To make sure we know which row
corresponds to which node of the tree, we need to use `use.nodeLab = TRUE`.
```{r}
# extract the abundance table
dat <- assays(tse, use.nodeLab = TRUE)[[1]]
```
Users are free to choose any suitable software to perform analysis for their
data. Here, we follow the routine steps of using the `r Biocpkg("edgeR")`
package to analyze the toy data.
```{r}
# calculate total count for each sample
# The total count is the sum of cell counts of clusters on the leaf level of the
# tree.
tip_tse <- tse[linkData(tse)$isLeaf, ]
tipDat <- assays(tip_tse, use.nodeLab = TRUE)[[1]]
libSize <- apply(tipDat, 2, sum)
# create DGEList
y <- DGEList(counts = dat, lib.size = libSize,
remove.zeros = FALSE)
# calculate normalisation factors
y <- calcNormFactors(object = y, method = "TMM")
# construct design matrix
sample_inf <- colData(tse)
design <- model.matrix(~ trt + group, data = sample_inf)
# estimate dispersion
y <- estimateDisp(y, design = design)
# fit the negative binomial GLMs
fit <- glmFit(y, design = design, prior.count = 0.125)
# run likelihood ratio tests
# contrast is not specified here, so the last coefficient is tested.
lrt <- glmLRT(fit, contrast = NULL)
# Use Benjamin-Hochberg method to do multiple testing correction
# n is set to Inf below, because we want to have the results of all entities.
out <- topTags(lrt, n = Inf, adjust.method = "BH")$table
head(out)
```
The output *out* has five columns, one of which is the nominal p-value
(*PValue*) and one is the adjusted p-value (*FDR*). These columns are required
by the tree aggregation step.
If other softwares are used to do analysis, the output is expected to has at
least two columns that contain the nominal and adjusted p-values, and each row
representing a node of the tree. The column names are not necessary to be
`PValue` and `FDR`, as they could be specified in the function `treeAGG` (see
Section \@ref(step4)).
### Back to the standard part of the pipeline
To prepare for the tree aggregation, we now put the analysis result in the
`rowData` of *tse* via `updateTSE`.
```{r}
outList <- list(assay1 = list(out))
new_tse1 <- updateTSE(result = outList, tse = tse,
use.assays = 1, design = design,
contrast = NULL, fit = list(fit))
```
The result *out* is changed to a list object *outList* before it is assigned to
the argument `result`. To update the `treeSummarizedExperiment` object (`tse =
tse`), the result (`result = outList`) and the information, such as the design
matrix (`design`), the contrasts (`contrast`) and the number of matrix-like
elements in `assays` (`use.assays`) that were used, should be provided. We could
also optionally store the result *fit* created by the `glmFit` function for
later use, for example, to quickly get new result when specifying a new
contrast.
The analysis performed by `runEdgeR` is exactly the same as the code in this
section.
```{r}
all.equal(new_tse, new_tse1)
```
Now, we are back to the standard part of the pipeline and could follow Section \@ref(step4) and Section \@ref(step5) to get the final results.
## Additional
### The application of pipeline on multiple tables {#mtab}
It's possible to perform analysis and aggregation simultaneously on multiple
elements of `assays` with different contrasts.
The data *tseM* with two elements in the `assays` is created.
```{r}
tseM <- treeSummarizedExperiment(assays = list(dat, 2*dat),
tree = treeData(tse),
rowData = rowData(tse),
colData = colData(tse))
```
To use both elements, we specify `use.assays = c(1, 2)`. Two different contrasts
are applied (`contrast`) to the analysis of each element.
```{r}
# data analysis
new_tseM <- runEdgeR(obj = tseM, use.assays = c(1, 2),
design = NULL,
contrast = list(contrast1 = c(0, 1, -1, 0),
contrast2 = c(0, 0, 0, 1)),
normalize = TRUE, method = "TMM",
adjust.method = "BH")
```
If the customized analysis was performed, the results from multiple elements
could be written back in one step.
To simplify the document, we directly use the output from Section \@ref(cDA)
to show how to do it.
```{r}
# assume two contrasts are used in the analysis
cList <- list(contrast1 = c(0, 1, -1, 0),
contrast2 = c(0, 0, 0, 1))
# the results from the first element of assays
# name the results as the list of the contrasts (cList)
res1 <- list(contrast1 = out,
contrast2 = 2*out)
# the results from the second element of assays
# name the results as the list of the contrasts (cList)
res2 <- list(contrast1 = 0.5*out,
contrast2 = 3*out)
# combine the results into a list
# name the results with 'assay' followed by a number
# e.g., if the first and the third table in the assays were used for analysis
# then, we name the result with assay1 assay3
res <- list(assay1 = res1,
assay2 = res2)
# keep both results and data in the container
test_tseM <- updateTSE(result = res, tse = tseM,
use.assays = c(1, 2),
design = NULL,
contrast = cList)
```
Now, the output could be obtained using `topNodes` as mentioned before. The
output is a `list`.
```{r}
rD4 <- topNodes(test_tseM)
class(rD4)
```
Each element of the list is the analysis result of a table in `assays`.
```{r}
# Here, we have two elements: one is the result of the first table,
# and the other is the result of the second table
names(rD4)
```
Each sub-element is a `list` too. It stores the analysis result of a table in
`assays` under a specific contrast.
```{r}
class(rD4$result_assay1)
names(rD4$result_assay1)
```
Here, the table below is exactly the data *out* that was put in.
```{r}
# There are five columns
head(rD4$result_assay1$contrast1)
```