Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering measurement/ clustering error #39

Closed
mkcedward opened this issue Apr 10, 2017 · 3 comments
Closed

Clustering measurement/ clustering error #39

mkcedward opened this issue Apr 10, 2017 · 3 comments
Labels

Comments

@mkcedward
Copy link

mkcedward commented Apr 10, 2017

Recently, I have both categories and numeric data for clustering and found that k-prototypes fit my cases. Able to fit and predict my data but cannot find a good way to identify an "optimal"* number of cluster due to unable to extract "cost" for every centroid/ data point.

Jump into "k_prototypes" function, found that it will return the "best" centroid but not all centroid. So tried to train the model and predicting data and finding the cost for every single data point. However, "predict" function return label but not both label and cost. Does "cost" return of "_labels_cost" helps on this case?

After studied source code, does it good to

  1. return "all_centroids" and "all_costs" in "k_prototypes" function, so that we may able to get the cost per centroid.
  2. return "cost" in "predict" function, so that we can measure the distance for prediction/ classification?
  3. return average cost rather than total cost. When try to find an optimal number of cluster, cost must be smaller when there is more cluster. It does not able to indicate whether the cost is good or not.

*In my current situation, lowest cost is the optimal result

@nicodv
Copy link
Owner

nicodv commented Apr 12, 2017

Basically, after training there is a .cost_ attribute on the clusterer available for this. This is compatible with how scikit-learn does it, and I'd like to stick to that.

This is documented in the KPrototypes class: https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py#L365

For finding optimal number of clusters, simply run the algorithm in a loop with varying k.

@mkcedward
Copy link
Author

mkcedward commented Apr 13, 2017

Refer to finding an optimal number of clusters. The definition of ".cost_" is "sum distance of all points to their respective cluster centroids". I am thinking how can I find the optimal number of cluster base on ".cost_"

For example (an extreme case for easier explanation ), I have 10 data points and want to know whether I should cluster it to 2 clusters or 10 clusters.
For 2 Clusters. Total distance is 5 for Cluster A while it is 3 for Cluster B. So the best cost is 3. For 10 Clusters. As I only have 10 data points, the cost of all centroid should be 0? So the optimal number of cluster is 10

@nicodv
Copy link
Owner

nicodv commented Apr 13, 2017

Cost is defined as the sum of dissimilarities of all points with their closest centroids, not as a cost per cluster. (You have to look at the total picture instead of cluster-by-cluster, because if you leave 1 cluster out, the cost of the other changes because the data points would need to get re-assigned.)

In you example, you would get cost=8 in the first scenario, cost=0 in the second. Simply re-run the algorithm across a range of k and see what minimizes cost.

@nicodv nicodv closed this as completed Jun 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants