How to use k-medoids? #36

waTeim · 2015-01-28T19:12:16Z

The documentation for k-medoids requires a cost matrix C, and parameter k, the number of clusters. But C must be a kxm matrix, so k can be inferred from C, why are both necessary? Also the matlab version doesn't require C as input at all. And also, C must be re-calculated whenever a new candidate medoid is selected, I don't see any hooks to allow this, is it possible this is half completed, and does't do step 5 like described in the wiki? Or maybe it's expected that step 5 is done outside.

johnmyleswhite · 2015-02-02T15:48:44Z

I think @lendle may be the main person who knows about this code.

lendle · 2015-02-02T17:24:00Z

The k-medoids code was rewritten by @cyocum in #22.

lendle · 2015-02-02T17:39:42Z

I just looked at the implementation and docs. C is actually an n x n matrix (not k x n).

The docs say "C – The cost matrix, where C[i,j] is the cost of assigning sample j to the medoid i" is a bit unclear. The ith row does correspond to the ith medoid for i = 1, ..., k, but corresponds to the cost associated with assigning each sample to a medoid defined by sample i for i = 1, ..., n.
@waTeim, would "C – The cost matrix, where C[i,j] is the cost of assigning sample j to a cluster with medoid sample i" be more clear?

waTeim · 2015-02-02T18:35:04Z

I don't think so. Are you sure because that approach has a lot of problems. It's not really the algorithm as documented in Wikipedia, it should be kxn and be re-calculated every iteration because the medoids can change. If it's nxn then that becomes unusable because take for instance how I was going to use it -- 90,000 data points gives rise to a 90000x90000 matrix which uses up more memory that the machine I have to run it on (40 GB RAM), and frankly that's kinda small compared to todays dataset sizes.

lindahua · 2015-03-09T07:44:14Z

The current approach takes as input a pre-computed pairwise cost matrix. When n is very large, one should use a different algorithm.

waTeim · 2015-03-10T02:17:40Z

Yea, that's more efficient if the number of rows^2 is storable, but in this case, nope. And like I said, this isn't even the largest of the datasets. I was looking around for a simple implementation of PAM to submit as a PR so there there would be options, but am stuck on the "compare cost of each non-medioid with a mediod cost and swap if lower" seems too inefficient to implement directly like that.

kingzbauer · 2016-04-29T15:11:04Z

This might be related. I keep getting this error when trying to run kmedoids

julia> kmedoids(mat, 8)
ERROR: AssertionError: !(isempty(grp))
 in _find_medoid at /root/.julia/v0.4/Clustering/src/kmedoids.jl:189
 in _kmedoids! at /root/.julia/v0.4/Clustering/src/kmedoids.jl:100
 in kmedoids at /root/.julia/v0.4/Clustering/src/kmedoids.jl:39

mat is a distance matrix.

diegozea · 2016-05-06T20:16:50Z

@waTeim If your matrix is symmetric, maybe https://github.com/diegozea/PairwiseListMatrices.jl can be useful to you.

alyst · 2023-03-24T21:03:17Z

This is an old question, that does not apply to the new distance-based API.

alyst added the question label Mar 28, 2019

alyst closed this as completed Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use k-medoids? #36

How to use k-medoids? #36

waTeim commented Jan 28, 2015

johnmyleswhite commented Feb 2, 2015

lendle commented Feb 2, 2015

lendle commented Feb 2, 2015

waTeim commented Feb 2, 2015

lindahua commented Mar 9, 2015

waTeim commented Mar 10, 2015

kingzbauer commented Apr 29, 2016

diegozea commented May 6, 2016

alyst commented Mar 24, 2023

How to use k-medoids? #36

How to use k-medoids? #36

Comments

waTeim commented Jan 28, 2015

johnmyleswhite commented Feb 2, 2015

lendle commented Feb 2, 2015

lendle commented Feb 2, 2015

waTeim commented Feb 2, 2015

lindahua commented Mar 9, 2015

waTeim commented Mar 10, 2015

kingzbauer commented Apr 29, 2016

diegozea commented May 6, 2016

alyst commented Mar 24, 2023