Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-6447 constrained kmeans POC #4067

Merged
merged 20 commits into from
Dec 12, 2019

Conversation

maurever
Copy link
Contributor

@maurever maurever commented Nov 13, 2019

Constrained K-means - Experimental

Calculate K-means using the minimal size of cluster constrain.

Implemented according to https://pdfs.semanticscholar.org/ecad/eb93378d7911c2f7b9bd83a8af55d7fa9e06.pdf

JIRA: https://0xdata.atlassian.net/projects/PUBDEV/issues/PUBDEV-6447

Currently implemented an only serial version of minimal cost flow calculation. A map-reduce version will be implemented soon.

EDIT:

  • solve all bugs
  • improve finding minimal reduced weight
  • performance is still very bad even for small datasets

-> assign as experimental
-> remove from Python/R API
-> remove Doc

@maurever maurever self-assigned this Nov 13, 2019
@maurever maurever force-pushed the maurever_PUBDEV-6447_constrained_kmeans branch from bccb82c to 6e11999 Compare November 14, 2019 09:41
@maurever maurever force-pushed the maurever_PUBDEV-6447_constrained_kmeans branch from a67e5e4 to 93b9806 Compare December 2, 2019 15:54
@maurever
Copy link
Contributor Author

maurever commented Dec 9, 2019

@angela0xdata, could you review the documentation part of this PR, please? Thank you!

Copy link
Contributor

@michalkurka michalkurka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving the experimental nature of this feature - I think this PR is okay to be merged.

@@ -36,6 +36,8 @@
#' @param categorical_encoding Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit",
#' "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO.
#' @param export_checkpoints_dir Automatically export generated models to this directory.
#' @param cluster_size_constraints Specify how many points should be at least in each cluster. The length of constraints array has to be same as

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor edit:
Specify how many points should be at least in each cluster. The length of constraints array must be the same as the number of clusters (experimental).

@ABartzGit
Copy link

@maurever, I added a simple example to h2o-bindings/bin/custom/python/gen_kmeans.py (similar to other KMeans options).

Copy link

@ABartzGit ABartzGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming my change didn't break anything. (h2o-3 built locally without error.)

@maurever
Copy link
Contributor Author

maurever commented Dec 9, 2019

Assuming my change didn't break anything.

No, it is ok. My code causes the fail of the test. Thank you very much for your improvements and review @angela0xdata!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants