Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silhouette kmean #39

Merged
merged 10 commits into from
Sep 13, 2021
Merged

Silhouette kmean #39

merged 10 commits into from
Sep 13, 2021

Conversation

rogerkuou
Copy link
Contributor

Replace the elbow-curve method with the Silhouette analysis, to determine the number of k in K-Mean.

Copy link
Contributor

@fnattino fnattino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @rogerkuou, looks great! Very nice idea to add the former k-means algorithm to the legacy folder. I think we should leave the set up as the results object as it was earlier, but for the rest all looks good!

cgc/kmeans.py Show resolved Hide resolved
cgc/kmeans.py Outdated Show resolved Hide resolved
[1, 1, 0]
[0, 0, 0, 1, 1],
[0, 0, 0, 1, 1],
[1, 1, 0, 0, 0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more self evident, wouldn't it be better to have Z[2,2] and Z[3,2] set to 1 as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx Francesco! I changed this, and also found that equal numbers in a cluster may cause some statistics to be 0, which may lead to a 0 dividing issue. Therefore I also added a small perturbation to Z.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmh, having maximum value equal to the minimum value in a co-cluster should currently raise a warning, but this is fine I think (the NaN values are then replaced with zeros afterwards).

@rogerkuou
Copy link
Contributor Author

Hi @fnattino, I updated the PR. Would you like to have a look?

Copy link
Contributor

@meiertgrootes meiertgrootes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! I agree with @fnattino's comment but it looks like that has largely been addressed.
I have been wondering whether the ability to select or provide which statistical measures to use for the k-means might be desirable.

cgc/legacy/kmeans.py Outdated Show resolved Hide resolved
Fix statistic bug in legacy
@rogerkuou rogerkuou merged commit 354615e into development Sep 13, 2021
@fnattino fnattino deleted the silhouette_kmean branch September 15, 2021 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants