Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: fix/constrain cluster assignment in anticlustering() #46

Closed
uhkeller opened this issue Dec 10, 2022 · 11 comments
Closed

Comments

@uhkeller
Copy link

For my application of anticlust it would be very useful if assignment of individual elements to clusters could be fixed or constrained a priori in anticlustering(). Instead of considering all K clusters for the constrained element, the algorithm would consider only a specific subset of clusters.

My use case is the assignment of versions of a psychological test to school classes during field testing. A small subset of classes have asked to use or not use a specific version; I still want to balance the covariates (averaged student characteristics) between versions across all classes taking these constraints into account.

A list of possible cluster memberships would be a straightforward way of specifying the constraints. Empty (NULL) list elements could denote unconstrained cluster selection. For example, with four elements and three clusters, the following list would specify unconstrained cluster selection for elements 1 and 2, constrain element 3 to cluster 2, and allow only clusters 2 or 3 for element 4:

list(
  NULL,       # unconstrained assignment for element 1
  c(1, 2, 3), # unconstrained assignment for element 2 (since we only have 3 clusters)
  2,          # element 3 fixed to cluster 2
  c(2, 3)     # element 4 constrained to clusters 2 or 3
)

Maybe this is already possible somehow but I was unable to figure out how. Also of course, it may well be that this is not possible to implement for some reason. But I still thought it worthwhile to signal that there is demand for this feature (if only from me…).

Finally, thank you for the anticlust package!

@m-Py
Copy link
Owner

m-Py commented Dec 12, 2022

Hi! I think there is quite a lot to be said about this ;)

First, thanks for sharing your application, it is indeed interesting to know what features would be useful in anticlust and the feature you describe as well as its implementation as an argument make sense to me.

Second, in short: There probably will not be a feature that implements your request in the near future. However, if you share more about your concrete data set, we might be able to find a solution (using the available anticlust features) that works for that particular problem, but maybe not in general.

Third, with some background: inducing constraints on cluster membership is definitely something I will more be looking into in the future. There has been research on this topic in cluster analysis, e.g., on inducing "must link" constraints (these items must be in the same cluster) or "cannot link" constraints (these items cannot be in the same cluster). I believe considering such constraints makes a lot of sense in the context of anticlustering (maybe even more than in cluster analysis). I even "hacked" an implementation for "must link" constraints, which currently resides on a development branch on this repository. (I think it worked quite well, but I did not yet insert the functionality into the anticlustering() function; it is also quite a hacky heuristic that I would need to document a little bit better...) This must link approach is different to the input format you describe and solves a somewhat different problem, but I assume both approaches share the same "problem", which aggravates a general implementation in anticlust: Specifying constraints in the way you describe may lead to a conflict cannot be resolved, and even checking if the constraints can be satisfied at all is computationally infeasible for large data sets. (I guess the biggest problem is to find an initial cluster assignment that satisfies all of the constraints, which sometimes may not be possible). So, working out a feature as you describe would take some work and I would also like to do some theoretical work on this topic, which will take some time.

That said, there are already ways to induce / respect certain constraints on cluster memberships, by combining the arguments K and categories in a "clever" way: K can also be a vector of length N and describe the cluster membership of all elements before the anticlustering optimization starts; categories describes which elements serve as exchange partner for each other during the optimization. So if an item that starts in cluster 1 is only swapped with elements in cluster 1, 2 and 3, it would remain in one of these groups (if it is assured that none of these other items is swapped with items in cluster 4...). This is not as much as your feature would allow, but maybe it is enough for a concrete application. If you have an example data set I should be able to check out if such an approach would work for you.

I am leaving this issue open because I intend to work on combining cluster membership constraints with anticlustering in the future. I am not sure if in the final implementation, the argument(s) would work as you describe, but it is possible.

@uhkeller
Copy link
Author

Thank you for this thoughtful response! I'm looking forward to seeing what you come up with.

I've tried to think of something to solve my problem along what you outlined: a K vector in conjunction with clever categories, but evidently I'm not clever enough…

Thanks so much for your generous offer to give it a try. If you don't find the time never mind, I'll find some other way that is "good enough".

Here are some fake data that look very much like my real data. There should be 3 clusters, the first element should be fixed to cluster 3, the second can be assigned to either 2 or 3. All others can be freely assigned. pct_boys, pct_natlang and n_students should be balanced across clusters.

N <- 100
classes <- 
  data.frame(
    pct_boys = rnorm(N, 50, 3),
    pct_natlang = rnorm(N, 60, 5),
    n_students = rbinom(N, 30, .5),
    allowed_clusters = c("3", "2/3", rep("all", N - 2))
  )

@m-Py
Copy link
Owner

m-Py commented Dec 13, 2022

I think this problem could be solved by using a combination of the arguments K and categories to restrict the group membership of the first two elements. But before I try out any code, I have a question for clarification: Is this even a restriction? I think you could just use "normal" anticlustering and then rename the groups? The group label that the first item has is renamed to "3" and the group label the second group has is either "2" or "3", depending on whether the two items are assigned to the same group. This code would do the renaming consistently with your requirement:

library(anticlust)
clusters <- anticlustering(classes[,-4], K = 4, objective = "kplus", standardize = TRUE)
clusters
#>  [1] 1 2 1 3 3 2 1 2 2 4 2 1 4 4 3 2 3 1 2 3 1 3 3 2 4 4 2 3 2 2 1 4 4 3 3 2 4 1 1 2 4 2 1 1 2 4 3 2 2
#> [50] 4 3 1 4 3 2 2 1 3 1 1 4 1 4 4 3 1 1 3 3 1 1 4 4 4 4 3 3 1 3 3 3 2 1 4 4 2 3 1 4 4 1 3 4 3 2 2 1 4
#> [99] 2 2

cluster_levels_ <- c(unique(clusters[1:2]), setdiff(1:4, clusters[1:2]))

new_cluster_labels <- as.numeric(as.character(
  factor(clusters, levels = cluster_levels_, labels = c(3, 2, 1, 4))
))
new_cluster_labels
#>  [1] 3 2 3 1 1 2 3 2 2 4 2 3 4 4 1 2 1 3 2 1 3 1 1 2 4 4 2 1 2 2 3 4 4 1 1 2 4 3 3 2 4 2 3 3 2 4 1 2 2
#> [50] 4 1 3 4 1 2 2 3 1 3 3 4 3 4 4 1 3 3 1 1 3 3 4 4 4 4 1 1 3 1 1 1 2 3 4 4 2 1 3 4 4 3 1 4 1 2 2 3 4
#> [99] 2 2

But maybe I am missing something.

@uhkeller
Copy link
Author

You're right, this works for the example I provided, and in fact it also works for the real data I'm currently working with. Though the data are not complete yet and there might well be classes added with constraints that can't be dealt with in this way. In my example, if the allowed cluster for item 1 had been 1 instead of 3, the approach would not have worked any more, right?
For example, let's say the data look like this (not unlikely the real data will be similar in the end):

N <- 100
classes <- 
  data.frame(
    pct_boys = rnorm(N, 50, 3),
    pct_natlang = rnorm(N, 60, 5),
    n_students = rbinom(N, 30, .5),
    allowed_clusters = c("1", "2/3", "1/3", "2", "3", "1/2", rep("all", N - 5))
  )

@m-Py
Copy link
Owner

m-Py commented Dec 13, 2022

In this case, I would randomly assign the cluster membership in the beginning and then use the categories argument to ensure that items with constraints are not swapped to a different cluster. If you have more items with the same constraint (here, each "type of constraint" only occurs once), they should have the same value in categories, respectively.

Here is example code how to solve the present problem, ensuring that the constraints are met:

library(anticlust)

N <- 100
classes <- 
  data.frame(
    pct_boys = rnorm(N, 50, 3),
    pct_natlang = rnorm(N, 60, 5),
    n_students = rbinom(N, 30, .5),
    allowed_clusters = c("1", "2/3", "1/3", "2", "3", "1/2", rep("all", N - 6))
  )


# create random initial assignments for the restricted items:
initial_clusters <- sapply(strsplit(classes$allowed_clusters, "/"), sample, size = 1)
initial_clusters <- as.numeric(initial_clusters[initial_clusters != "all"])
N_restricted_items <- length(initial_clusters)

# add random clusters for the other items, ensuring each group has the same number of elements:
K <- 4
remaining_N_per_cluster <- rep(N/K, K) - c(table(initial_clusters), 0) # 0 is hard coded, may not work in each case
remaining_clusters <- sample(rep(1:K, remaining_N_per_cluster))

# combine
initial_clusters <- c(initial_clusters, remaining_clusters)
# check group sizes:
table(initial_clusters)

# if there is more than one item per type of constraint, these should have the same value in `categories`
categories <- c(1:6, rep(7, N-6))

clusters <- anticlustering(
  classes[,-4], 
  K = initial_clusters, 
  objective = "kplus", 
  standardize = TRUE,
  categories = categories
)
table(clusters)

# check out the first six items:
classes$allowed_clusters[1:6]
clusters[1:6]
initial_clusters[1:6]

# check quality of solution:
mean_sd_tab(classes[,-4], clusters)

For your "actual" data some of the code might need to be changed, in particular if you have more than one item per constraint. Moreover, in some cases the variables are hard coded (sometimes I just use "6" because in this particular example six items are constrained).

@m-Py
Copy link
Owner

m-Py commented Dec 13, 2022

I might elaborate on the use of the arguments K and categories here. This combination of K and categories has not been well documented (yet), so at least some words in this issue might help users:

The "standard" anticlustering algorithm (method = "exchange" or method = "local-maximum") works by first initializing a (random) cluster assignment. It is possible users to specify this initial assignment themselves, as I did in my code above with the argument K. Even though K usually means the number of clusters, in anticlust, K is generally used to describe how the clusters should look like; so K can be [a] the number of clusters, [b] the size of each cluster, or [c] an assignment of items to clusters. After the clusters have been initialized, an exchange procedure is executed where items are swapped between clusters. The categories argument restricts which items serve as exchange partners for each other. Items that have the same value in categories are considered exchange partners; items that have different values are not exchange partners. In my code above, the six restricted items have a unique value in categories, which effectively means they are never changed with another item and therefore remain in the cluster that they were initialized with. All of the other items however are exchanged with one another, because they all have the same value (here "7") in categories. Thus, we restrict the cluster membership of the first six items while all of the other items can join any cluster they like.

In this particular example, I think the biggest difficulty is to set up the initial clustering while respecting the group number and the size of each group. Note that groups may also be of different size.

@uhkeller
Copy link
Author

uhkeller commented Jan 2, 2023

Sorry for the long delay, and thanks a lot for the code and the explanation. The documentation on K and categories was clear enough I think, but my stressed pre-holiday brain failed to put two and two together. The approach worked perfectly.

@m-Py
Copy link
Owner

m-Py commented Feb 6, 2023

Okay, thanks for your feedback! I will close this issue here, if you have additional requirements to cluster membership restrictions you may however reuse it if you wish.

@m-Py m-Py closed this as completed Feb 6, 2023
@tomiko20
Copy link

tomiko20 commented May 7, 2024

Hi! I think there is quite a lot to be said about this ;)

First, thanks for sharing your application, it is indeed interesting to know what features would be useful in anticlust and the feature you describe as well as its implementation as an argument make sense to me.

Second, in short: There probably will not be a feature that implements your request in the near future. However, if you share more about your concrete data set, we might be able to find a solution (using the available anticlust features) that works for that particular problem, but maybe not in general.

Third, with some background: inducing constraints on cluster membership is definitely something I will more be looking into in the future. There has been research on this topic in cluster analysis, e.g., on inducing "must link" constraints (these items must be in the same cluster) or "cannot link" constraints (these items cannot be in the same cluster). I believe considering such constraints makes a lot of sense in the context of anticlustering (maybe even more than in cluster analysis). I even "hacked" an implementation for "must link" constraints, which currently resides on a development branch on this repository. (I think it worked quite well, but I did not yet insert the functionality into the anticlustering() function; it is also quite a hacky heuristic that I would need to document a little bit better...) This must link approach is different to the input format you describe and solves a somewhat different problem, but I assume both approaches share the same "problem", which aggravates a general implementation in anticlust: Specifying constraints in the way you describe may lead to a conflict cannot be resolved, and even checking if the constraints can be satisfied at all is computationally infeasible for large data sets. (I guess the biggest problem is to find an initial cluster assignment that satisfies all of the constraints, which sometimes may not be possible). So, working out a feature as you describe would take some work and I would also like to do some theoretical work on this topic, which will take some time.

That said, there are already ways to induce / respect certain constraints on cluster memberships, by combining the arguments K and categories in a "clever" way: K can also be a vector of length N and describe the cluster membership of all elements before the anticlustering optimization starts; categories describes which elements serve as exchange partner for each other during the optimization. So if an item that starts in cluster 1 is only swapped with elements in cluster 1, 2 and 3, it would remain in one of these groups (if it is assured that none of these other items is swapped with items in cluster 4...). This is not as much as your feature would allow, but maybe it is enough for a concrete application. If you have an example data set I should be able to check out if such an approach would work for you.

I am leaving this issue open because I intend to work on combining cluster membership constraints with anticlustering in the future. I am not sure if in the final implementation, the argument(s) would work as you describe, but it is possible.

Hello. For my application of anticlust, it would be tremendously helpful to use a "must link" constraint -- is this available? My use case involves the assignment of hundreds of tissue samples to batches (anticlusters) of 16. The batches (anticlusters) should be balanced on features such as patient demographics and condition severity.
A given patient can have 1 or more (up to 8) samples. As all samples for a particular patient should be assigned to the same batch (anticluster), the "must link" constraint could ensure this. Or is there another way to make sure a group of samples are assigned to the same (anti-)cluster? Thank you for this package, and for any help!

@m-Py
Copy link
Owner

m-Py commented May 7, 2024

Hi Tomiko,

there is no (at least not yet) a dedicated argument for inducing must link constraints in anticlustering. I assume what you wish for might be possible using the current instruments, but some manual data preparation will be necessary. I was testing around some code and I think whether your endeavour is successful strongly depends on the structure of your data. If you could share some data, that would be great. Or at least I would need the following information:

What is the total sample size (N), the number of variables (M), and most importantly: what is the exact distribution of the patient IDs (so how often does each patient occur in the data set). If you provide me with this information, I will see what is possible.

Cheers

Martin

@m-Py
Copy link
Owner

m-Py commented May 7, 2024

Well this was very interesting for me, so I just provide you with the code that I produced so far (implementing must-link constraints with anticlustering is on my to do list anyway, might start now as well). I guess the code should work, but having the information I was inquiring about in my last post would help even more.

With your data, I guess the strategy involving the arguments K and categories in your quoted post will not work. But another strategy might work, which I first document here, involving the argument K and x (i.e., the data).

First, you have to specify an initial vector where all must-link constraints are satisfied. In anticlustering(), this vector is passed to the argument K. You have to create this vector yourself (e.g. using my code below). That means, you have a vector of length "nrow your data set" where all elements belonging to the same patient have the same number. When this is accomplished, the other numbers have to be filled in such a way that the size constraints of your application are met (so, 16 samples per cluster).

So, I first generate some fake data that might resemble your use case.

set.seed(1234)
# Generate some example data, N = 128, M = 5
K <- 8
group_sizes <- 16
N <- group_sizes * K # here 128
M <- 5
data <- matrix(rnorm(N * M), ncol = M)

# Generate random patient IDs
ID <- sample(N, replace = TRUE)
while (any(table(ID) > 8)) { # ensure that "at most 8" constraint is fulfilled
  ID <- sample(N, replace = TRUE)
}

table(table(ID)) # most IDs occur only once, the maximum here was five
#> 1  2  3  4  5
#> 46 18  8  3  2

Now, we have to create the vector where all "patients" with the same ID have the same value. This is an example of how I would do it, it is somewhat long code (something like this really should be in an anticlust function and exposed as an argument, but it is not yet done). I use a for loop to accomplish this, it is slightly ugly but should work in most cases, I guess (if it does not work for you, tell me, then your data structure might be more complicated and this first step is actually the difficult one):

# All data points having the same ID should have the same initial value in the argument `K`:

# Initialize all as NA
init <- rep(NA, N)

# use for loop to satisfy must-link constraints, will not work in all cases (I believe this can
# be solved optimally in polynomial time but I do not have this algorithm right now):
cluster_sizes <- rep(0, K)
multiple_IDs <- as.numeric(names(table(ID)[table(ID) > 1]))

for (current_id in multiple_IDs) {
  random_order_clusters <- sample(K)
  for (k in random_order_clusters) {
    # only fill into cluster if it fits
    if ((cluster_sizes[k] + sum(ID == current_id)) > group_sizes) {
      if (k == random_order_clusters[K]) {
        stop("Sorry, this failed") # can happen
      }
      next
    }
    init[ID == current_id] <- k
    cluster_sizes[k] <- cluster_sizes[k] + sum(ID == current_id)
    break
  }
}
init
#>   [1]  5  3  5 NA  7  5  8  1 NA  8 NA  6  6 NA  6 NA  8  5  8  5 NA  5 NA  1 NA
#>  [26]  6  7 NA NA  5  3  3  7 NA  6 NA  4  8  4  4  1 NA NA NA  3  4  3  5  3  4
#>  [51]  2  8  8  3  3  8  1  3  3 NA  1  8  3  6 NA  1  6 NA  2  5 NA  5  6 NA NA
#>  [76]  3  5  2 NA NA NA NA  5  7 NA NA  8 NA NA  5  4  3  4  3  3 NA  4  1  3  8
#> [101]  6  1  4  1  5 NA  6 NA  4 NA NA  6  7 NA  7  5 NA NA NA NA NA  5 NA NA NA
#> [126] NA NA NA

# validate that the correct number of cases were already assigned:
sum(!is.na(init))  == sum(ID %in% multiple_IDs)
#>  TRUE

Now, we generate a distance matrix and replace the values between data points that belong to the same patient by a large value to ensure that anticlustering will not place them in the same group by accident.

# Replace distances by large value to ensure must link constraint during optimization
# (uses some internal anticlust functions, but this could also be done quite easily "by hand")
distances <- as.matrix(dist(data)) # important this is a matrix
copy <- distances
copy <- anticlust:::edit_distances(copy, anticlust:::replace_na_by_index(init), sum(distances) + 1)

Now, we fill the remaining values in init arbitrarily, which is done using the following code:

# assign elements that have no group (unfortunately, this "simple" task is quite difficult in general,
# and generates a lot of boiler code)
target_groups <- rep(group_sizes, K)
table_assigned <- data.frame(K = 1:K, size = 0)
df <- as.data.frame(table(init))
df$K <- df$init
df$init <- NULL
together <- merge(table_assigned, df, by = "K", all = TRUE)
table_assigned <- ifelse(is.na(together$Freq), 0, together$Freq)
freq_not_assigned <- target_groups - table_assigned
init[is.na(init)] <- anticlust:::sample_(rep(1:K, freq_not_assigned))
table(init) # what we wanted
#>  1  2  3  4  5  6  7  8
#> 16 16 16 16 16 16 16 16

# validate that must-link constraints were preserved
same <- as.numeric(names(table(ID)[table(ID) > 1]))
for (i in same) {
  stopifnot(all(init[ID == i] == init[ID == i][1]))
}

The init vector can now be used as input for the K argument in anticlustering(), and we use the adjusted distance matrix to ensure that the inital values of the same-patient data points remain in the same group:

groups <- anticlustering(copy, K = init, method = "local-maximum")

# validate that must-link constraints are still preserved
for (i in same) {
  stopifnot(all(groups[ID == i] == groups[ID == i][1]))
}

We can see that the anticlustered version is better than the initial grouping through the objective value, and I verified that the constraints are still met in the end.

# is optimized solution better than initial solution?
diversity_objective(distances, init)
#> [1] 2911.154
diversity_objective(distances, groups)
#> [1] 2971.869

If you have any questions regarding the code, please tell me, and please tell me if it worked for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants