Clustering measurement/ clustering error #39

mkcedward · 2017-04-10T03:27:29Z

Recently, I have both categories and numeric data for clustering and found that k-prototypes fit my cases. Able to fit and predict my data but cannot find a good way to identify an "optimal"* number of cluster due to unable to extract "cost" for every centroid/ data point.

Jump into "k_prototypes" function, found that it will return the "best" centroid but not all centroid. So tried to train the model and predicting data and finding the cost for every single data point. However, "predict" function return label but not both label and cost. Does "cost" return of "_labels_cost" helps on this case?

After studied source code, does it good to

return "all_centroids" and "all_costs" in "k_prototypes" function, so that we may able to get the cost per centroid.
return "cost" in "predict" function, so that we can measure the distance for prediction/ classification?
return average cost rather than total cost. When try to find an optimal number of cluster, cost must be smaller when there is more cluster. It does not able to indicate whether the cost is good or not.

*In my current situation, lowest cost is the optimal result

nicodv · 2017-04-12T18:18:32Z

Basically, after training there is a .cost_ attribute on the clusterer available for this. This is compatible with how scikit-learn does it, and I'd like to stick to that.

This is documented in the KPrototypes class: https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py#L365

For finding optimal number of clusters, simply run the algorithm in a loop with varying k.

mkcedward · 2017-04-13T03:14:02Z

Refer to finding an optimal number of clusters. The definition of ".cost_" is "sum distance of all points to their respective cluster centroids". I am thinking how can I find the optimal number of cluster base on ".cost_"

For example (an extreme case for easier explanation ), I have 10 data points and want to know whether I should cluster it to 2 clusters or 10 clusters.
For 2 Clusters. Total distance is 5 for Cluster A while it is 3 for Cluster B. So the best cost is 3. For 10 Clusters. As I only have 10 data points, the cost of all centroid should be 0? So the optimal number of cluster is 10

nicodv · 2017-04-13T23:22:41Z

Cost is defined as the sum of dissimilarities of all points with their closest centroids, not as a cost per cluster. (You have to look at the total picture instead of cluster-by-cluster, because if you leave 1 cluster out, the cost of the other changes because the data points would need to get re-assigned.)

In you example, you would get cost=8 in the first scenario, cost=0 in the second. Simply re-run the algorithm across a range of k and see what minimizes cost.

nicodv added the question label Apr 12, 2017

nicodv closed this as completed Jun 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering measurement/ clustering error #39

Clustering measurement/ clustering error #39

mkcedward commented Apr 10, 2017 •

edited

Loading

nicodv commented Apr 12, 2017

mkcedward commented Apr 13, 2017 •

edited

Loading

nicodv commented Apr 13, 2017

Clustering measurement/ clustering error #39

Clustering measurement/ clustering error #39

Comments

mkcedward commented Apr 10, 2017 • edited Loading

nicodv commented Apr 12, 2017

mkcedward commented Apr 13, 2017 • edited Loading

nicodv commented Apr 13, 2017

mkcedward commented Apr 10, 2017 •

edited

Loading

mkcedward commented Apr 13, 2017 •

edited

Loading