-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: fix/constrain cluster assignment in anticlustering() #46
Comments
Hi! I think there is quite a lot to be said about this ;) First, thanks for sharing your application, it is indeed interesting to know what features would be useful in Second, in short: There probably will not be a feature that implements your request in the near future. However, if you share more about your concrete data set, we might be able to find a solution (using the available Third, with some background: inducing constraints on cluster membership is definitely something I will more be looking into in the future. There has been research on this topic in cluster analysis, e.g., on inducing "must link" constraints (these items must be in the same cluster) or "cannot link" constraints (these items cannot be in the same cluster). I believe considering such constraints makes a lot of sense in the context of anticlustering (maybe even more than in cluster analysis). I even "hacked" an implementation for "must link" constraints, which currently resides on a development branch on this repository. (I think it worked quite well, but I did not yet insert the functionality into the That said, there are already ways to induce / respect certain constraints on cluster memberships, by combining the arguments I am leaving this issue open because I intend to work on combining cluster membership constraints with anticlustering in the future. I am not sure if in the final implementation, the argument(s) would work as you describe, but it is possible. |
Thank you for this thoughtful response! I'm looking forward to seeing what you come up with. I've tried to think of something to solve my problem along what you outlined: a Thanks so much for your generous offer to give it a try. If you don't find the time never mind, I'll find some other way that is "good enough". Here are some fake data that look very much like my real data. There should be 3 clusters, the first element should be fixed to cluster 3, the second can be assigned to either 2 or 3. All others can be freely assigned. N <- 100
classes <-
data.frame(
pct_boys = rnorm(N, 50, 3),
pct_natlang = rnorm(N, 60, 5),
n_students = rbinom(N, 30, .5),
allowed_clusters = c("3", "2/3", rep("all", N - 2))
) |
I think this problem could be solved by using a combination of the arguments library(anticlust)
clusters <- anticlustering(classes[,-4], K = 4, objective = "kplus", standardize = TRUE)
clusters
#> [1] 1 2 1 3 3 2 1 2 2 4 2 1 4 4 3 2 3 1 2 3 1 3 3 2 4 4 2 3 2 2 1 4 4 3 3 2 4 1 1 2 4 2 1 1 2 4 3 2 2
#> [50] 4 3 1 4 3 2 2 1 3 1 1 4 1 4 4 3 1 1 3 3 1 1 4 4 4 4 3 3 1 3 3 3 2 1 4 4 2 3 1 4 4 1 3 4 3 2 2 1 4
#> [99] 2 2
cluster_levels_ <- c(unique(clusters[1:2]), setdiff(1:4, clusters[1:2]))
new_cluster_labels <- as.numeric(as.character(
factor(clusters, levels = cluster_levels_, labels = c(3, 2, 1, 4))
))
new_cluster_labels
#> [1] 3 2 3 1 1 2 3 2 2 4 2 3 4 4 1 2 1 3 2 1 3 1 1 2 4 4 2 1 2 2 3 4 4 1 1 2 4 3 3 2 4 2 3 3 2 4 1 2 2
#> [50] 4 1 3 4 1 2 2 3 1 3 3 4 3 4 4 1 3 3 1 1 3 3 4 4 4 4 1 1 3 1 1 1 2 3 4 4 2 1 3 4 4 3 1 4 1 2 2 3 4
#> [99] 2 2 But maybe I am missing something. |
You're right, this works for the example I provided, and in fact it also works for the real data I'm currently working with. Though the data are not complete yet and there might well be classes added with constraints that can't be dealt with in this way. In my example, if the allowed cluster for item 1 had been 1 instead of 3, the approach would not have worked any more, right? N <- 100
classes <-
data.frame(
pct_boys = rnorm(N, 50, 3),
pct_natlang = rnorm(N, 60, 5),
n_students = rbinom(N, 30, .5),
allowed_clusters = c("1", "2/3", "1/3", "2", "3", "1/2", rep("all", N - 5))
) |
In this case, I would randomly assign the cluster membership in the beginning and then use the Here is example code how to solve the present problem, ensuring that the constraints are met: library(anticlust)
N <- 100
classes <-
data.frame(
pct_boys = rnorm(N, 50, 3),
pct_natlang = rnorm(N, 60, 5),
n_students = rbinom(N, 30, .5),
allowed_clusters = c("1", "2/3", "1/3", "2", "3", "1/2", rep("all", N - 6))
)
# create random initial assignments for the restricted items:
initial_clusters <- sapply(strsplit(classes$allowed_clusters, "/"), sample, size = 1)
initial_clusters <- as.numeric(initial_clusters[initial_clusters != "all"])
N_restricted_items <- length(initial_clusters)
# add random clusters for the other items, ensuring each group has the same number of elements:
K <- 4
remaining_N_per_cluster <- rep(N/K, K) - c(table(initial_clusters), 0) # 0 is hard coded, may not work in each case
remaining_clusters <- sample(rep(1:K, remaining_N_per_cluster))
# combine
initial_clusters <- c(initial_clusters, remaining_clusters)
# check group sizes:
table(initial_clusters)
# if there is more than one item per type of constraint, these should have the same value in `categories`
categories <- c(1:6, rep(7, N-6))
clusters <- anticlustering(
classes[,-4],
K = initial_clusters,
objective = "kplus",
standardize = TRUE,
categories = categories
)
table(clusters)
# check out the first six items:
classes$allowed_clusters[1:6]
clusters[1:6]
initial_clusters[1:6]
# check quality of solution:
mean_sd_tab(classes[,-4], clusters)
For your "actual" data some of the code might need to be changed, in particular if you have more than one item per constraint. Moreover, in some cases the variables are hard coded (sometimes I just use "6" because in this particular example six items are constrained). |
I might elaborate on the use of the arguments The "standard" anticlustering algorithm ( In this particular example, I think the biggest difficulty is to set up the initial clustering while respecting the group number and the size of each group. Note that groups may also be of different size. |
Sorry for the long delay, and thanks a lot for the code and the explanation. The documentation on |
Okay, thanks for your feedback! I will close this issue here, if you have additional requirements to cluster membership restrictions you may however reuse it if you wish. |
Hello. For my application of anticlust, it would be tremendously helpful to use a "must link" constraint -- is this available? My use case involves the assignment of hundreds of tissue samples to batches (anticlusters) of 16. The batches (anticlusters) should be balanced on features such as patient demographics and condition severity. |
Hi Tomiko, there is no (at least not yet) a dedicated argument for inducing must link constraints in anticlustering. I assume what you wish for might be possible using the current instruments, but some manual data preparation will be necessary. I was testing around some code and I think whether your endeavour is successful strongly depends on the structure of your data. If you could share some data, that would be great. Or at least I would need the following information: What is the total sample size (N), the number of variables (M), and most importantly: what is the exact distribution of the patient IDs (so how often does each patient occur in the data set). If you provide me with this information, I will see what is possible. Cheers Martin |
Well this was very interesting for me, so I just provide you with the code that I produced so far (implementing must-link constraints with anticlustering is on my to do list anyway, might start now as well). I guess the code should work, but having the information I was inquiring about in my last post would help even more. With your data, I guess the strategy involving the arguments First, you have to specify an initial vector where all must-link constraints are satisfied. In So, I first generate some fake data that might resemble your use case. set.seed(1234)
# Generate some example data, N = 128, M = 5
K <- 8
group_sizes <- 16
N <- group_sizes * K # here 128
M <- 5
data <- matrix(rnorm(N * M), ncol = M)
# Generate random patient IDs
ID <- sample(N, replace = TRUE)
while (any(table(ID) > 8)) { # ensure that "at most 8" constraint is fulfilled
ID <- sample(N, replace = TRUE)
}
table(table(ID)) # most IDs occur only once, the maximum here was five
#> 1 2 3 4 5
#> 46 18 8 3 2 Now, we have to create the vector where all "patients" with the same ID have the same value. This is an example of how I would do it, it is somewhat long code (something like this really should be in an # All data points having the same ID should have the same initial value in the argument `K`:
# Initialize all as NA
init <- rep(NA, N)
# use for loop to satisfy must-link constraints, will not work in all cases (I believe this can
# be solved optimally in polynomial time but I do not have this algorithm right now):
cluster_sizes <- rep(0, K)
multiple_IDs <- as.numeric(names(table(ID)[table(ID) > 1]))
for (current_id in multiple_IDs) {
random_order_clusters <- sample(K)
for (k in random_order_clusters) {
# only fill into cluster if it fits
if ((cluster_sizes[k] + sum(ID == current_id)) > group_sizes) {
if (k == random_order_clusters[K]) {
stop("Sorry, this failed") # can happen
}
next
}
init[ID == current_id] <- k
cluster_sizes[k] <- cluster_sizes[k] + sum(ID == current_id)
break
}
}
init
#> [1] 5 3 5 NA 7 5 8 1 NA 8 NA 6 6 NA 6 NA 8 5 8 5 NA 5 NA 1 NA
#> [26] 6 7 NA NA 5 3 3 7 NA 6 NA 4 8 4 4 1 NA NA NA 3 4 3 5 3 4
#> [51] 2 8 8 3 3 8 1 3 3 NA 1 8 3 6 NA 1 6 NA 2 5 NA 5 6 NA NA
#> [76] 3 5 2 NA NA NA NA 5 7 NA NA 8 NA NA 5 4 3 4 3 3 NA 4 1 3 8
#> [101] 6 1 4 1 5 NA 6 NA 4 NA NA 6 7 NA 7 5 NA NA NA NA NA 5 NA NA NA
#> [126] NA NA NA
# validate that the correct number of cases were already assigned:
sum(!is.na(init)) == sum(ID %in% multiple_IDs)
#> TRUE Now, we generate a distance matrix and replace the values between data points that belong to the same patient by a large value to ensure that anticlustering will not place them in the same group by accident. # Replace distances by large value to ensure must link constraint during optimization
# (uses some internal anticlust functions, but this could also be done quite easily "by hand")
distances <- as.matrix(dist(data)) # important this is a matrix
copy <- distances
copy <- anticlust:::edit_distances(copy, anticlust:::replace_na_by_index(init), sum(distances) + 1) Now, we fill the remaining values in # assign elements that have no group (unfortunately, this "simple" task is quite difficult in general,
# and generates a lot of boiler code)
target_groups <- rep(group_sizes, K)
table_assigned <- data.frame(K = 1:K, size = 0)
df <- as.data.frame(table(init))
df$K <- df$init
df$init <- NULL
together <- merge(table_assigned, df, by = "K", all = TRUE)
table_assigned <- ifelse(is.na(together$Freq), 0, together$Freq)
freq_not_assigned <- target_groups - table_assigned
init[is.na(init)] <- anticlust:::sample_(rep(1:K, freq_not_assigned))
table(init) # what we wanted
#> 1 2 3 4 5 6 7 8
#> 16 16 16 16 16 16 16 16
# validate that must-link constraints were preserved
same <- as.numeric(names(table(ID)[table(ID) > 1]))
for (i in same) {
stopifnot(all(init[ID == i] == init[ID == i][1]))
}
The groups <- anticlustering(copy, K = init, method = "local-maximum")
# validate that must-link constraints are still preserved
for (i in same) {
stopifnot(all(groups[ID == i] == groups[ID == i][1]))
} We can see that the anticlustered version is better than the initial grouping through the objective value, and I verified that the constraints are still met in the end. # is optimized solution better than initial solution?
diversity_objective(distances, init)
#> [1] 2911.154
diversity_objective(distances, groups)
#> [1] 2971.869 If you have any questions regarding the code, please tell me, and please tell me if it worked for you. |
For my application of anticlust it would be very useful if assignment of individual elements to clusters could be fixed or constrained a priori in
anticlustering()
. Instead of considering all K clusters for the constrained element, the algorithm would consider only a specific subset of clusters.My use case is the assignment of versions of a psychological test to school classes during field testing. A small subset of classes have asked to use or not use a specific version; I still want to balance the covariates (averaged student characteristics) between versions across all classes taking these constraints into account.
A list of possible cluster memberships would be a straightforward way of specifying the constraints. Empty (NULL) list elements could denote unconstrained cluster selection. For example, with four elements and three clusters, the following list would specify unconstrained cluster selection for elements 1 and 2, constrain element 3 to cluster 2, and allow only clusters 2 or 3 for element 4:
Maybe this is already possible somehow but I was unable to figure out how. Also of course, it may well be that this is not possible to implement for some reason. But I still thought it worthwhile to signal that there is demand for this feature (if only from me…).
Finally, thank you for the anticlust package!
The text was updated successfully, but these errors were encountered: