-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering measurement/ clustering error #39
Comments
Basically, after training there is a This is documented in the KPrototypes class: https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py#L365 For finding optimal number of clusters, simply run the algorithm in a loop with varying |
Refer to finding an optimal number of clusters. The definition of ".cost_" is "sum distance of all points to their respective cluster centroids". I am thinking how can I find the optimal number of cluster base on ".cost_" For example (an extreme case for easier explanation ), I have 10 data points and want to know whether I should cluster it to 2 clusters or 10 clusters. |
Cost is defined as the sum of dissimilarities of all points with their closest centroids, not as a cost per cluster. (You have to look at the total picture instead of cluster-by-cluster, because if you leave 1 cluster out, the cost of the other changes because the data points would need to get re-assigned.) In you example, you would get cost=8 in the first scenario, cost=0 in the second. Simply re-run the algorithm across a range of |
Recently, I have both categories and numeric data for clustering and found that k-prototypes fit my cases. Able to fit and predict my data but cannot find a good way to identify an "optimal"* number of cluster due to unable to extract "cost" for every centroid/ data point.
Jump into "k_prototypes" function, found that it will return the "best" centroid but not all centroid. So tried to train the model and predicting data and finding the cost for every single data point. However, "predict" function return label but not both label and cost. Does "cost" return of "_labels_cost" helps on this case?
After studied source code, does it good to
*In my current situation, lowest cost is the optimal result
The text was updated successfully, but these errors were encountered: