/
hclust.Rmd
240 lines (206 loc) · 8.07 KB
/
hclust.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "Hierarchical Clustering"
author: "Stefano Monti"
output:
html_document:
theme: united
toc: yes
code_folding: show
css: "../style/BS831.css"
---
```{r include=FALSE}
knitr::opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE)
devtools::load_all(".")
library(Biobase)
library(pheatmap)
library(tidyverse)
```
```{r eval=FALSE}
library(Biobase)
library(BS831)
library(pheatmap)
library(tidyverse)
```
## Clustering of Real Data: the Drug Matrix
We will use the DrugMatrix subset previously described for our
examples. This data corresponds to chemical perturbation experiments,
whereby rats are exposed to different chemicals, and their liver's RNA
is profiled.
We upload the data and perform a drastic variation filtering to obtain
a small and easy to manage dataset.
```{r drugMatrix, message=FALSE, warning=FALSE}
data(dm10)
DM <- variationFilter(dm10,ngenes=1000,do.plot=TRUE,verbose=FALSE)
```
## Hierarchical Clustering
Let us analyze the data by carrying out hierarchical clustering. We'll
use `heatmap.plus` to visualize the data. Let us first
define a simple function to create a color gradient to be used for
coloring the gene expression heatmaps.
```{r colGradient}
colGradient <- function( cols, length, cmax=255 )
{
## e.g., to create a white-to-red gradient with 10 levels
##
## colGradient(cols=c('white','red'),length=10)
##
## or, to create a blue-to-white-to-red gradients with 9 colors (4 blue's, white, 4 red's)
##
## colGradient(cols=c('blue','white','red'),length=9)
##
ramp <- grDevices::colorRamp(cols)
rgb( ramp(seq(0,1,length=length)), max=cmax )
}
```
Next, let us perform the actual clustering. While the function
`heatmap.plus` can carry out hierarchical clustering
internally, we explicitely call `hclust` outside the
function call to illustrate its use (this will also save computation
time should one want to call heatmap.plus multiple times with
different color coding or other changes that would not affect the
clustering).
Notice that the heatmap color is somewhat saturated. In the module
`R/Heatmaps.Rmd` we present additional visualization functions
that do a better job at color-coding the heatmap.
```{r hclust}
chemicals <- unique(pData(dm10)[,"CHEMICAL"])
annot <- data.frame(CARC=dm10$Carcinogen_liv,
CHEM=dm10$CHEMICAL,
row.names = Biobase::sampleNames(dm10))
annot_col <- list(
CARC=c("NON-CARC"="green","CARCINOGEN"="pink"),
CHEM=data.frame(chem=chemicals,cols=rainbow(length(chemicals))) %>% tibble::deframe())
## color gradient for the expression levels (blue=down-regulated; white=neutral; red=up-regulated)
bwrPalette <- colGradient(c("blue","white","red"),length=9)
## cluster rows (genes) and columns (samples)
hc_row <- hclust(as.dist(1-cor(t(exprs(DM)))),method="ward.D2") # genes by correlation
hc_col <- hclust(dist(t(exprs(DM))),method="ward.D2") # samples by euclidean distance (default)
pheatmap(exprs(DM),
color=bwrPalette,
annotation_col = annot,
annotation_colors = annot_col,
cluster_rows=hc_row,
cluster_cols=hc_col,
show_rownames = FALSE,
show_colnames = FALSE,
scale = "row",
fontsize = 5)
```
```{r hclust.old, eval=FALSE}
## color coding of the samples indicating carcinogenicity status and chemical
chemicals <- unique(pData(dm10)[,"CHEMICAL"])
CSC <- cbind(CARC=c("white","green","pink")[match(pData(dm10)[,"Carcinogen_liv"],c(NA,"NON-CARC","CARCINOGEN"))],
CHEM=rainbow(length(chemicals))[match(pData(dm10)[,"CHEMICAL"],chemicals)])
## color gradient for the expression levels (blue=down-regulated; white=neutral; red=up-regulated)
bwrPalette <- colGradient(c("blue","white","red"),length=13)
## cluster rows (genes) and columns (samples)
hc.row <- hclust(as.dist(1-cor(t(exprs(DM)))),method="ward.D2") # genes by correlation
hc.col <- hclust(dist(t(exprs(DM))),method="ward.D2") # samples by euclidean distance (default)
## draw the heatmap (hide row and col labels since they'd be unreadable)
heatmap.plus(exprs(DM),
Rowv=as.dendrogram(hc.row),
Colv=as.dendrogram(hc.col),
col=bwrPalette,
ColSideColors=CSC,
labCol=NA,labRow=NA)
pheatmap(DM)
```
### Optimal Leaf Ordering
As discussed in class, hierarchical clustering induces a _partial_
ordering of the dendogram leaves (i.e., of the clustered items),
modulo the 'flipping' of any of the sub-trees. However, one can obtain
a _total_ ordering by using the leaf-ordering algorithm developed by
<A
HREF="http://bioinformatics.oxfordjournals.org/content/19/9/1070">Bar-Joseph
et al. (2001)</A>, which minimizes the distance betwees adjacent items
in two distinct sub-trees.
```{r hcopt}
library(cba) # load the necessary pacakge
## First, call hclust, then call the leaf-ordering algorithm and re-order the dendrogram
hc_col <- hclust(dist(t(exprs(DM))),method="ward.D2")
ord <- order.optimal(dist(t(exprs(DM))),merge=hc_col$merge)
hc_col$merge <- ord$merge
hc_col$order <- ord$order
## visualize the results (and compare, visually, to the results w/o optimal leaf ordering above)
pheatmap(exprs(DM),
color=bwrPalette,
annotation_col = annot,
annotation_colors = annot_col,
cluster_rows=hc_row,
cluster_cols=hc_col,
show_rownames = FALSE,
show_colnames = FALSE,
scale = "row",
fontsize = 5)
## since this is a step always worth performing, let's define a simple function that implements the necessary steps
hcopt <- function(d, HC=NULL, method = "ward.D", members = NULL)
{
if ( is.null(HC) ) {
HC <- hclust(d,method=method,members=members)
}
ORD <- order.optimal(d,merge=HC$merge)
HC$merge <- ORD$merge
HC$order <- ORD$order
HC
}
## ok, now we can easily use the function in place of hclust
ho_row <- hcopt(as.dist(1-cor(t(exprs(DM)))),method="ward.D2") # genes by correlation
ho_col <- hcopt(dist(t(exprs(DM))),method="ward.D2") # samples by euclidean distance (default)
pheatmap(exprs(DM),
color=bwrPalette,
annotation_col = annot,
annotation_colors = annot_col,
cluster_rows=ho_row,
cluster_cols=ho_col,
show_rownames = FALSE,
show_colnames = FALSE,
scale = "row",
fontsize = 5)
```
### Alternative Agglomeration Rules
In the examples above, we used the <A
HREF="https://en.wikipedia.org/wiki/Ward's_method">Ward method</A> as
our agglomeration rule. One of the nice properties of Ward is that it
tends to yield well-balanced trees. Let us not look at alternative
agglomeration rules.
```{r linkage}
## complete linkage
ho_row1 <- hcopt(as.dist(1-cor(t(exprs(DM)))),method="complete") # genes by correlation
ho_col1 <- hcopt(dist(t(exprs(DM))),method="complete") # samples by euclidean distance (default)
pheatmap(exprs(DM),
color=bwrPalette,
annotation_col = annot,
annotation_colors = annot_col,
cluster_rows=ho_row1,
cluster_cols=ho_col1,
show_rownames = FALSE,
show_colnames = FALSE,
scale = "row",
fontsize = 5)
## single linkage
ho_row2 <- hclust(as.dist(1-cor(t(exprs(DM)))),method="single") # genes by correlation
ho_col2 <- hcopt(dist(t(exprs(DM))),method="single") # samples by euclidean distance (default)
pheatmap(exprs(DM),
color=bwrPalette,
annotation_col = annot,
annotation_colors = annot_col,
cluster_rows=ho_row2,
cluster_cols=ho_col2,
show_rownames = FALSE,
show_colnames = FALSE,
scale = "row",
fontsize = 5)
## average linkage
ho_row3 <- hclust(as.dist(1-cor(t(exprs(DM)))),method="average") # genes by correlation
ho_col3 <- hcopt(dist(t(exprs(DM))),method="average") # samples by euclidean distance (default)
pheatmap(exprs(DM),
color=bwrPalette,
annotation_col = annot,
annotation_colors = annot_col,
cluster_rows=ho_row3,
cluster_cols=ho_col3,
show_rownames = FALSE,
show_colnames = FALSE,
scale = "row",
fontsize = 5)
```