Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use k-medoids? #36

Closed
waTeim opened this issue Jan 28, 2015 · 9 comments
Closed

How to use k-medoids? #36

waTeim opened this issue Jan 28, 2015 · 9 comments
Labels

Comments

@waTeim
Copy link

waTeim commented Jan 28, 2015

The documentation for k-medoids requires a cost matrix C, and parameter k, the number of clusters. But C must be a kxm matrix, so k can be inferred from C, why are both necessary? Also the matlab version doesn't require C as input at all. And also, C must be re-calculated whenever a new candidate medoid is selected, I don't see any hooks to allow this, is it possible this is half completed, and does't do step 5 like described in the wiki? Or maybe it's expected that step 5 is done outside.

@johnmyleswhite
Copy link
Member

I think @lendle may be the main person who knows about this code.

@lendle
Copy link
Contributor

lendle commented Feb 2, 2015

The k-medoids code was rewritten by @cyocum in #22.

@lendle
Copy link
Contributor

lendle commented Feb 2, 2015

I just looked at the implementation and docs. C is actually an n x n matrix (not k x n).

The docs say "C – The cost matrix, where C[i,j] is the cost of assigning sample j to the medoid i" is a bit unclear. The ith row does correspond to the ith medoid for i = 1, ..., k, but corresponds to the cost associated with assigning each sample to a medoid defined by sample i for i = 1, ..., n.
@waTeim, would "C – The cost matrix, where C[i,j] is the cost of assigning sample j to a cluster with medoid sample i" be more clear?

@waTeim
Copy link
Author

waTeim commented Feb 2, 2015

I don't think so. Are you sure because that approach has a lot of problems. It's not really the algorithm as documented in Wikipedia, it should be kxn and be re-calculated every iteration because the medoids can change. If it's nxn then that becomes unusable because take for instance how I was going to use it -- 90,000 data points gives rise to a 90000x90000 matrix which uses up more memory that the machine I have to run it on (40 GB RAM), and frankly that's kinda small compared to todays dataset sizes.

@lindahua
Copy link
Contributor

lindahua commented Mar 9, 2015

The current approach takes as input a pre-computed pairwise cost matrix. When n is very large, one should use a different algorithm.

@waTeim
Copy link
Author

waTeim commented Mar 10, 2015

Yea, that's more efficient if the number of rows^2 is storable, but in this case, nope. And like I said, this isn't even the largest of the datasets. I was looking around for a simple implementation of PAM to submit as a PR so there there would be options, but am stuck on the "compare cost of each non-medioid with a mediod cost and swap if lower" seems too inefficient to implement directly like that.

@kingzbauer
Copy link

This might be related. I keep getting this error when trying to run kmedoids

julia> kmedoids(mat, 8)
ERROR: AssertionError: !(isempty(grp))
 in _find_medoid at /root/.julia/v0.4/Clustering/src/kmedoids.jl:189
 in _kmedoids! at /root/.julia/v0.4/Clustering/src/kmedoids.jl:100
 in kmedoids at /root/.julia/v0.4/Clustering/src/kmedoids.jl:39

mat is a distance matrix.

@diegozea
Copy link

diegozea commented May 6, 2016

@waTeim If your matrix is symmetric, maybe https://github.com/diegozea/PairwiseListMatrices.jl can be useful to you.

@alyst alyst added the question label Mar 28, 2019
@alyst
Copy link
Member

alyst commented Mar 24, 2023

This is an old question, that does not apply to the new distance-based API.

@alyst alyst closed this as completed Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants