-
Notifications
You must be signed in to change notification settings - Fork 0
/
define_rb.R
305 lines (291 loc) · 13.7 KB
/
define_rb.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
#' Define Rare Biosphere
#'
#' Classifies species in each sample into either "Rare", "Undetermined" or "Abundant". Other classifications are allowed.
#'
#' @details
#' **Overview**
#'
#' Function to cluster species abundance with partition around medoids algorithm (Kaufman and Rousseuw. 1991).
#' By default, we propose the division into three clusters (k = 3),
#' which can have the convenient description of: "rare", "undetermined" and "abundant".
#' The phylogenetic units from the cluster with lowest median abundance is considered
#' to be the "rare biosphere".
#'
#' @details
#' **The classification vector**
#'
#' The classification vector (argument classification_vector) represents the
#' different clusters to be used, by ascending order of median abundance.
#' To change the number of clusters, change the number of elements in the
#' classification vector, **but order matters!** Depending on the number of clusters used,
#' you can change the meaning that best applies to your research.
#'
#' For example, you can use a classification vector with the designations: "very rare", "rare",
#' "abundant" and "very abundant"; which would apply a k = 4 underneath.
#' It is possible to use any number of clusters, as long as they are within 2 and
#' the maximum possible k.
#'
#' The maximum possible k is the number of different abundance scores observed in a single sample. Note, however,
#' that we do not recommend any clustering for k > 10
#' and we also don't recommend k = 2 (we explain in more detail in Pascoal et al., 2023;
#' and in the vignette `vignette("explore-classifications")`.
#'
#' @details
#' **Automatic selection of the number of clusters**
#'
#' To automatically decide the number of clusters (i.e., the value of k), it is possible to do so with the argument **automatic=TRUE**. For details on
#' complete automation of [define_rb()], please see the documentation for [suggest_k()]. Briefly, the k with best average Silhouette score
#' is selected from a range of k values between 3 and 10. It is possible to decide k based on other indices ("Davies-Bouldin" or "Calinsky-Harabasz").
#'
#' If you want a more fine grained analysis of k values, we provide several functions:
#' - [evaluate_k()];
#' - [check_avgSil()];
#' - [check_DB()];
#' - [check_CH()].
#'
#' @details
#' **Verify clustering results**
#'
#' If half of the taxa of any cluster got a Silhouette score below 0.5 in any sample, then a warning is provided.
#' The warning provides the number of times this issue occurred.
#' You can inspect other alternatives to reduce the occurrences of a bad clustering,
#' but it is possible that, in some situations, you just can't find an optimal clustering.
#'
#' The detailed output gives you access to all of the clustering results:
#' - `pam_object` is a list with the original results from the k-medoids clustering,
#' see [cluster::pam()] documentation.
#' - `Level` is an integer indicating the specific cluster attributed by the [cluster::pam()]
#' function for each observation. Its order is random.
#' - `Silhouette_scores` provides the Silhouette score obtained for each
#' observation, i.e. a score for each taxa.
#' - `Cluster_median_abundance` provides the median taxa abundance of each cluster.
#' - `median_Silhouette` provides the median Silhouette score obtained for each cluster.
#' - `Evaluation` indicates if the silhouette score obtained for a given observation
#' is below the median Silhouette of its cluster and sample.
#'
#' You can make your own plots and analysis, but we also provide another function,
#' [plot_ulrb()], which illustrates the results obtained.
#'
#' @details
#' **Partition around medoids (pam)**
#'
#' To calculate k-medoids, we used the partition around medoids (pam)
#' algorithm, which was described in Chapter 2 of "Finding Groups in Data: An Introduction to Cluster Analysis."
#' (Kaufman and Rousseeuw, 1991) and implemented by the cluster package with the [cluster::pam()] function.
#'
#' Briefly, the pam algorithm is divided into two main phases: **build** and **swap**.
#'
#' The first phase (**build**) selects k observations as cluster representatives. The first
#' observation selected as representative is the one that minimizes the sum of the dissimilarities to the
#' remaining observations. The second, third and so on repeat the same process, until k clusters have
#' been formed.
#'
#' The **build** steps are:
#'
#' 1 - Propose a centroid with observation, \eqn{i}, which has not been selected as a centroid yet
#'
#' 2 - Calculate the distance between another observation, \eqn{j}, and its most similar
#' observation, \eqn{D_j}; and calculate the difference with the proposed centroid,
#' \eqn{i}, i.e., \eqn{d(j,i)}
#'
#' 3 - If \eqn{d(j,i) > 0}, then calculate its contribution to the centroid:
#' \deqn{max(D_j - d(j,i),0)}
#'
#' 4 - Calculate the total gain obtained by \eqn{i}, \deqn{\sum_{j}C_{ji}}
#'
#' 5 - From all possible centroids, select the one that maximizes the previous total gain obtained,
#' \deqn{max_i \sum_jC_{ji}}
#'
#' 6 - Repeat until k observations have been selected as cluster representatives.
#'
#' The purpose of the next phase, **swap**, is to improve the representatives
#' for the clusters. The principle is to swap the cluster representative between
#' all possibilities and calculate the value sum of dissimilarities between each
#' observation and the closest centroid. The swapping continues until no more improvement is
#' possible, i.e., when the minimum sum of dissimilarities of the clusters is reached.
#'
#'
#' @details
#' **Notes**:
#'
#' Understand that **ulrb** package considers each sample as an independent
#' community of phylogenetic units, which means clustering is also independent across
#' different samples.
#' Thus, be aware that you will have clustering results and metrics for each
#' single sample,
#' which is why we also provide some functions to analyze results across
#' any number of samples (see: [plot_ulrb()] for clustering results
#' and [evaluate_k()] for k selection).
#'
#'
#' @param data A tibble with, at least, a column for Abundance and Sample. Additional columns are allowed.
#' @param classification_vector A vector of strings with the names for each cluster, from lower to higher abundance. Default is c("Rare", "Undetermined", "Abundance").
#' @param samples_col String with name of column with sample names.
#' @param abundance_col String with name of column with abundance values.
#' @param simplified Can be TRUE/FALSE. Default (FALSE) provides an additional column with detailed pam() results
#' and Silhouette scores. If TRUE, only the Classification result is added to the original input data.
#' @param automatic By default (FALSE), will assume a classification into "Rare", "Undetermined" or "Abundant". If TRUE, then it will automatically select the number of classifications (or k),
#' based on the index argument.
#' @inheritParams suggest_k
#'
#' @returns The input data.frame with extra columns containing the classification and additional metrics (if detailed = TRUE).
#' @seealso [suggest_k()], [evaluate_k()], [plot_ulrb()], [cluster::pam()]
#' @export
#'
#' @references Kaufman, L., & Rousseuw, P. J. (1991). Chapter 2 in book Finding Groups in Data: An Introduction to Cluster Analysis. Biometrics, 47(2), 788.
#'
#'
#' @examples
#' \donttest{
#' library(dplyr)
#' # Sample ID's
#' sample_names <- c("ERR2044662", "ERR2044663", "ERR2044664",
#' "ERR2044665", "ERR2044666", "ERR2044667",
#' "ERR2044668", "ERR2044669", "ERR2044670")
#'
#' # If data is in wide format, with samples in cols
#' nice_tidy <- prepare_tidy_data(nice,
#' sample_names = sample_names,
#' samples_in = "cols")
#'
#' # Straightforward with tidy format
#' define_rb(nice_tidy)
#'
#' #Closer look
#' classified_table <- define_rb(nice_tidy)
#' classified_table %>%
#' select(Sample, Abundance, Classification) %>%
#' head()
#'
#'
#' # Automatic decision, instead of a predefined definition
#' define_rb(nice_tidy, automatic = TRUE) %>% select(Sample, Abundance, Classification)
#'
#' # Automatic decision, using Davies-Bouldin index,
#' # instead of average Silhouette score (default)
#' define_rb(nice_tidy, automatic = TRUE, index = "Davies-Bouldin") %>%
#' select(Sample, Abundance, Classification)
#'
#' # User defined classifications
#' user_classifications <- c("very rare",
#' "rare",
#' "undetermined",
#' "abundant",
#' "very abundant")
#'
#' define_rb(nice_tidy, classification_vector = user_classifications) %>%
#' select(Sample, Abundance, Classification)
#'
#' # Easy to incorporate in big pipes
#' # Remove Archaea
#' # Remove taxa below 10 reads
#' # Classify according to a different set of classifications
#' nice_tidy %>%
#' filter(Domain != "sk__Archaea") %>%
#' filter(Abundance > 10) %>%
#' define_rb(classification_vector = c("very rare",
#' "rare",
#' "abundant",
#' "very abundant")) %>%
#' select(Sample, Abundance, Classification)
#'
#' # An example that summarises results
#' nice_tidy %>%
#' filter(Domain != "sk__Archaea") %>%
#' filter(Abundance > 10) %>%
#' define_rb(classification_vector = c("very rare",
#' "rare",
#' "abundant",
#' "very abundant")) %>%
#' select(Sample, Abundance, Classification) %>%
#' group_by(Sample, Classification) %>%
#' summarise(totalAbundance = sum(Abundance))
#'}
#' @import dplyr
#' @importFrom rlang .data
#'
define_rb <- function(data,
classification_vector = c("Rare","Undetermined","Abundant"),
samples_col = "Sample",
abundance_col = "Abundance",
simplified = FALSE,
automatic = FALSE,
index = "Average Silhouette Score",
...) {
#If automatic, use suggest_k()
if(isTRUE(automatic)){
message("Automatic option set to TRUE, so classification vector was overwritten")
automatic_k <- suggest_k(data, index = index, ...)
classification_vector <- seq_along(1:automatic_k)
message(paste0("K= ", automatic_k, " based on ", index,"."))
}
# Define number of cluster based on possible classifications
k <- length(classification_vector)
# Match samples_col and abundance_col with Samples and Abundance, respectively
data <-
data %>%
rename(Sample = all_of(samples_col),
Abundance = all_of(abundance_col))
# Calculate k-medoids
## Apply cluster algorithm
clustered_data <-
data %>%
filter(.data$Abundance > 0, !is.na(.data$Abundance)) %>%
group_by(.data$Sample, .add = TRUE) %>%
tidyr::nest() %>%
mutate(pam_object = purrr::map(.x = data,
.f = ~cluster::pam(.x$Abundance,
k = k,
diss = FALSE))) %>%
mutate(Level = purrr::map(.x = .data$pam_object, .f = ~.x[[3]]), # obtain clusters
Silhouette_scores = purrr::map(.x = .data$pam_object, .f = ~.x[[7]][[1]][,3])) %>% ## obtain silhouette plots
tidyr::unnest(cols = c("data", "Level", "Silhouette_scores"))
# Make classification table
classification_table <- clustered_data %>%
group_by(.data$Sample, Level = as.factor(.data$Level), .add = TRUE) %>%
summarize(Cluster_median_abundance = stats::median(.data$Abundance)) %>%
arrange(.data$Sample, .data$Cluster_median_abundance) %>%
mutate(Classification = factor(classification_vector, levels = classification_vector))
# Apply classification_table to classify clusters
classified_clusters <- clustered_data %>%
mutate(Level = as.factor(.data$Level)) %>%
left_join(classification_table)
# Evaluate
classified_clusters <-
classified_clusters %>%
group_by(.data$Sample, .data$Classification) %>%
tidyr::nest() %>%
mutate(median_Silhouette = purrr::map(.x = data, .f = ~median(.x$Silhouette_scores))) %>%
mutate(Evaluation = purrr::map(.x = .data$median_Silhouette, .f = ~case_when(median_Silhouette > 0.9 ~ "Very good",
median_Silhouette > 0.75 ~ "Good",
median_Silhouette > 0.5 ~ "Sufficient",
median_Silhouette <= 0.5 ~ "Bad"))) %>%
tidyr::unnest(cols = c(data, median_Silhouette, Evaluation))
## the nest steps only check if a warning is necessary, the output is classified clusters
clusters_report <- classified_clusters %>%
select(Sample, Classification, median_Silhouette, Evaluation) %>% ### break from here
distinct() %>%
group_by(.data$Classification) %>%
count(.data$Evaluation)
# Count number of samples with bad scores
bad_samples <- clusters_report %>%
filter(.data$Evaluation == "Bad") %>%
pull(.data$n) %>%
sum()
#
if(bad_samples > 0){
warning(paste(bad_samples, "samples got a bad Silhouette score. Consider changing the number of classifications."))
message("If half the observations within a classification are below 0.5 Silhouette score, we consider that the clustering was 'Bad'.")
message("Check 'Evaluation' collumn for more details.")
}
# Option to simplify
if(simplified == TRUE){
# Remove unnecessary columns
classified_clusters %>%
select(-"Level", -"pam_object",
-"Evaluation", -"median_Silhouette",
-"Silhouette_scores",
-"Cluster_median_abundance")
}
return(classified_clusters)
}