/
hierarchical.Rmd
372 lines (303 loc) · 15.2 KB
/
hierarchical.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
title: "Hierarchical Consensus Partitioning"
author: "Zuguang Gu (z.gu@dkfz.de)"
date: '`r Sys.Date()`'
output: rmarkdown::html_vignette
---
```{r, echo = FALSE, message = FALSE}
library(markdown)
options(markdown.HTML.options = c(options('markdown.HTML.options')[[1]], "toc"))
library(knitr)
knitr::opts_chunk$set(
error = FALSE,
tidy = FALSE,
message = FALSE,
fig.align = "center")
options(markdown.HTML.stylesheet = "custom.css")
options(rmarkdown.html_vignette.check_title = FALSE)
options(width = 100)
library(ComplexHeatmap)
library(circlize)
library(cola)
```
## The problem
Choosing the best k (number of partitions) is not an easy problem for
consensus partitioning. In consensus partitioning, various metrics based on
the consensus matrix are normally calculated, _e.g._ PAC scores (or 1-PAC) or
mean silhouette scores, and the best k is normally selected based on the
"extremal method", _i.e._ to select the k that corresponds to the highest or
lowest values of the metrics. When the number of partitions is small, it is
relatively easy to determine the best k with high confidence, while when the
real number of clusters gets very large, it is difficult to identify the
correct or approximate k for several reasons, some of which we list in the
following: 1) Variation in the "big clusters" affect the eCDF of the consensus
matrix stronger than variation in the "small clusters", this can strongly
affect PAC scores. 2) Groups showing weaker differences (we can call them
secondary groups) are more difficult to separate especially when there are
already other groups showing distinct differences (we can call them major
groups). 3) The curve of various metrics against k gets flattened for large k
and the value of k with the extremal values will be less distinct.
The following four figures illustrate the eCDF curves of a consensus matrix,
1-PAC, mean silhouette and concordance scores for different k where k ranges
from 2 to 10 (from the analysis [here](
https://jokergoo.github.io/cola_examples/GBM_450K/GBM_450K_cgi_all_subgroup_cola_report/cola_report.html#MAD-pam)).
Basically, when k >= 5, the eCDF curves have a long plateau with less and less
curvature, _i.e._ they lose their step-like shape, which results in 1-PAC
getting almost stable for k >= 5. Also for the curves of mean silhouette and
concordance scores against k, they are almost flattened for k >= 3. If using
the "extremal method", k = 6 is taken as the best k because 1-PAC selects k =
9 while mean silhouette and concordance select k = 6.
<img src="large_k_problem_select_partition_number-min.png" width="100%" />
When inspecting the consensus heatmaps for different k (_cf._ heatmaps below),
actually it is difficult to assess whether the partitioning for k = 6 is
better than any other k in [5, 10].
<img src="large_k_problem_consensus_heatmap-min.png" width="100%" />
The problems of the "big clusters / small clusters" or "major clusters /
secondary clusters" in selecting the best k are mainly due to the consensus
partitioning procedure that all samples are taken into account equally. From
version 2.0.0, we proposed a new method which tries to solve this issue by
applying consensus partitioning in a hierarchical way. Simply speaking, one
could first classify the samples into n<sub>major</sub> groups
(n<sub>major</sub> is a small number, major clusters), then for each subgroup
of samples, one could repeatedly apply consensus clustering. By this means,
theoretically, small clusters or secondary clusters could be detected in later
steps of the hierarchical procedure.
## The workflow
The figure below illustrates the workflow of the hierarchical consensus partitioning.
<img src="hierarchical_partitioning_workflow.png" width="400" />
The steps are:
1. The input matrix _M_ is treated as the top node in the hierarchy, with a
node label "0".
2. Apply multiple combinations of top-value methods and partitioning methods
on the matrix with a small set of k (_e.g._, 2-4).
3. Select the consensus partitioning result (_i.e._ with a specific top-value
method and a partitioning method) which shows the highest mean silhouette scores with
its best k. This consensus partitioning is treated as the partitioning on
the node.
4. If this partitioning is not stable (_e.g._, mean silhouette < 0.9), the hierarchical
partitioning stops and the columns of the matrix are treated as a subgroup.
5. If the number of signatures or the fraction of signatures to the total
number of rows in the matrix is too small, it means the partitioning does
not show biological meaningful results and the hierarchical partitioning
stops. The columns are treated as a subgroup.
6. If the partioning passes the filtering on step 4 and 5, each subgroup is treated
a child node of the current node.
7. For each submatrix, if the number of columns is too small, the hierarchical
partitioning stops and the set of columns is treated as a subgroup.
8. If the submatrix has enough columns, go to step 2 to perform consensus
partioning recursively.
The process of the hierarchical consensus partitioning is saved as a
dendrogram internally.
## The usage
In this section, we demonstrate the functionalities of hierarchical consensus
partitioning. The design of these new functionalities tries to be as
consistent as the functions for normal consensus partitioning in **cola**,
_i.e._, the function `consensus_partition()` or
`run_all_consensus_partition_methods()`. Thus, you may find many functions
having the same names as the functions for normal consensus partitioning.
The following code applies hierarchical consensus partitioning on the Golub
dataset. Function `hierarchical_partition()` applies the analysis where the
main arguments are very similar as in `consensus_partition()` or
`run_all_consensus_partition_methods()`, which are the input matrix, the
sample annotations and the number of cores. The function returns a
`HierarchicalPartition` object.
```{r, eval = FALSE}
library(golubEsets) # from Bioconductor
data(Golub_Merge)
m = exprs(Golub_Merge)
colnames(m) = paste0("sample_", colnames(m))
anno = pData(Golub_Merge)
m[m <= 1] = NA
m = log10(m)
m = adjust_matrix(m)
# apply quantile normalization
library(preprocessCore) # from Bioconductor
cn = colnames(m)
rn = rownames(m)
m = normalize.quantiles(m)
colnames(m) = cn
rownames(m) = rn
set.seed(123)
golub_cola_rh = hierarchical_partition(
m, anno = anno[, c("ALL.AML"), drop = FALSE]
)
```
Some important arguments in `hierarchical_partition()` are listed as follows:
- `top_n`: Number of rows with top values. The value can be a vector of
integers. On each node, there is an additional filtering on rows of the
submatrices to remove those rows with very small variances, which results in
the reducing number of rows in the submatrices. Here `top_n` can be set as a
vector of values less than 1 which are treated as the fraction of rows of
every submatrix.
- `top_value_method`: A single or a vector of top-value methods.
- `partition_method`: A single or a vector of partitioning methods. All
combinations of `top_value_method` and `partition_method` are tried.
- `combination_method`: Instead of specifying `top_value_method` and
`partition_method`, all methods that are going to be tried can also be
specified with `combination_method`. The value can be a vector in a form of
`c("SD:hclust", "ATC:skmeans", ...)` or a data frame with two columns
`data.frame(c("SD", "ATC"), c("hclust", "skmeans"))`.
- `anno`: Annotation for the columns. The value can be a vector or a data
frame. Note the rows in `anno` should be corresponded to the matrix columns.
- `anno_col`: Colors for annotations. The value should be a list of colors
where a named vector for discrete color mapping and a color mapping function
generated by `circlize::colorRamp2()` for continuous color mapping.
- `mean_silhouette_cutoff`: Cut off for mean silhouette_cutoff score to decide whether a partition
is stable and to be split further more.
- `min_samples`: Miminal number of samples to perform partitioning.
- `subset`: Number of subset columns to perform partitioning. If the current
number of columns in the submatrix is higher than `subset`,
`consensus_partition_by_down_sampling()` instead of `consensus_partition()`
will be applied. It will be discussed in more details in the Section [**Work
with huge datasets**](hierarchical.html#toc_5).
- `min_n_signatures`: On each node that has a partition, `get_signatures()` is
applied to find the number of signatures under the best k. If the number of
signatures is less than `min_n_signatures`, it means the partitioning might
not be significantly different and the hierarchical consensus partitioning
stops.
- `min_p_signatures`: This is the fraction of signatures to the total number
of rows of the original matrix. This filtering is a companion of
`min_n_signatures`.
- `max_k`: Maximal number of groups to try for consensus partitioning.
Normally this value should be set to a small value, because more subgroups
will be found during the hierarchical consensus partitioning process.
- `cores`: `hierarchical_partition()` supports parallel computing. This is
the number of cores to use.
The object `golub_cola_rh` is already generated and shipped in **cola** package, so we directly load it.
```{r}
data(golub_cola_rh)
golub_cola_rh
```
Directly entering `golub_cola_rh` prints the hierarchy. As you already see in
the previous output, the node in the hierarchy is encoded in a special way. As
explained in previous text, on each node, the columns are split into several
groups and a digit (the subgroup index in current partitioning) is appended with the current node label to the
children node labels. Thus, the length (or `nchar`) of the label represents
the depth of that node in the hierarchy and from the node label, it is also
straightforward to infer its parent node. _E.g._, a node with label `0123` has its
parent node `012`.
Also you can find the functions that can be applied to the `HierarchicalPartition` object
in previous output.
The first function you may try is to see how the columns are separated and the hierarchy
of the subgroups. This can be done by `collect_classes()` function:
```{r}
collect_classes(golub_cola_rh)
```
There are several metrics saved for each node which can be retrieved by `node_info()`.
```{r}
node_info(golub_cola_rh)
```
There are following columns from `node_info()`:
- `id`: The node id or label.
- `best_method`: Because on each node, multiple methods set in `combination_method` are tried
and the method that gives the highest 1-PAC under its best k is finally selected. The best
method is saved here.
- `depth`: The depth of the node in the hierarchy.
- `best_k`: The best k used.
- `n_columns`: Number of columns of the submatrix.
- `n_signatures`: Number of signatures.
- `p_signatures`: Fraction of signatures to the total number of rows of the matrix.
- `is_leaf`: Whether the node is a leaf node.
These values are useful to merge the children nodes.
Most functions for dealing with the `HierarchicalPartition` object accept a `merge_node` argument,
where you can set different paremeters to select the children node to merge. These parameters
should be set by the function `merge_node_param()` function. And there are the four parameters
can be adjusted:
- `depth`
- `min_n_signatures`
- `min_p_signatures`
`min_n_signatures` measures how much you trust the classification on the bioloigcal point of view.
In the following, we demonstrate to manuplate the dendrogram by
setting different `min_n_signatures` values.
```{r, eval = FALSE}
collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 174))
collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 263))
collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 2041))
```
```{r, echo = FALSE, fig.width = 7}
p1 = grid.grabExpr(collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 174),
row_title = "min_n_signatures >= 174"))
p2 = grid.grabExpr(collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 263),
row_title = "min_n_signatures >= 263"))
p3= grid.grabExpr(collect_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 2041),
row_title = "min_n_signatures >= 2041"))
grid.newpage()
pushViewport(viewport(x = 0, width = 1/3, just = "left"))
grid.draw(p1)
popViewport()
pushViewport(viewport(x = 1/3, width = 1/3, just = "left"))
grid.draw(p2)
popViewport()
pushViewport(viewport(x = 2/3, width = 1/3, just = "left"))
grid.draw(p3)
popViewport()
```
We can also compare to the normal consensus partitioning classification:
```{r}
data(golub_cola)
golub_cola_cp = golub_cola["ATC:skmeans"]
collect_classes(golub_cola_rh,
anno = cbind(get_anno(golub_cola_rh),
cola_cp = factor(get_classes(golub_cola_cp, k = suggest_best_k(golub_cola_cp))[, "class"])),
anno_col = c(get_anno_col(golub_cola_rh))
)
```
`get_classes()` returns the subgroups of columns:
```{r}
get_classes(golub_cola_rh)
```
Also you can control `merge_node` argument to decide on which level of hierarchy you want.
```{r}
get_classes(golub_cola_rh, merge_node = merge_node_param(min_n_signatures = 263))
```
`suggest_best_k()` extracts the the best k as well as related metrics for the
best partitions on each node.
```{r}
suggest_best_k(golub_cola_rh)
```
Another important function which gives a direct feeling of how the subgrouping look like
is to check the signatures that are significantly different between subgroups.
Similarly as normal consensus partitioning, we can use `get_signatures()` here.
The function basically applies `get_signatures,ConsesusPartition-method()` on the partition
on every node and collect all the signatures as the signatures of the hierarchical
consensus partitioning.
```{r}
get_signatures(golub_cola_rh, verbose = FALSE)
```
Other useful functions are `dimension_reduction()`, `compare_signatures()` and
`test_to_known_factors()`. The usages are as follows:
```{r}
dimension_reduction(golub_cola_rh)
```
```{r}
compare_signatures(golub_cola_rh)
```
```{r}
test_to_known_factors(golub_cola_rh)
```
Note all these functions mentioned above allow the `merge_node` argument to adjust
the hierarchy.
## Automated reporting
One of the key advantage of **cola** package is it automates the complete analysis.
There is also a `cola_report()` function for `HierarchicalPartition` class and it
automates the complete analysis as well. Simply run:
```{r, eval = FALSE}
rh = hierarchical_partition(...)
cola_report(rh, output_dir = ...)
```
## Work with huge datasets
In the vignette ["Work with Big Datasets"](work_with_big_datasets.html), we introduced
a `DownSamplingConsensusPartition` class and its corresponding method `consensus_partition_by_down_sampling()`
which performs consensus partitioning on a randomly sampled subset of columns and predict the subgroup
labels for the remaining columns from the ... Here `hierarchical_partition()` also supports down sampling
which makes it possible to work on extremly large datasets.
The only thing for dealing with huge datasets is to set the `subset` argument.
```{r, eval = FALSE}
hierarchical_partition(..., subset = ...)
```
On each node, to consider the euqal sizes of groups, we first perform a fast k-means and random sample columns with
different weight according to the size the groups.
## Session info
```{r}
sessionInfo()
```