Add sample weights to KPrototypes. #171

kklein · 2022-03-14T12:02:23Z

As of now, every data point contributes equally to the loss function and derived cluster updates.

Yet, in some use cases, it might be desirable to attach weights to data points.

This PR introduces sample_weights, a sequence of numeric values, as an optional parameter for KPrototypes' fit method as well as all downstream functions.

Some basic input validation as well as some testing are provided.

kklein · 2022-03-14T12:03:52Z

Hi @nicodv ! Happy to hear your thoughts. :)

nicodv

Wonderful contribution, @kklein . Thank you!

I left some comments and questions.

kmodes/kprototypes.py

nicodv · 2022-03-17T04:14:37Z

kmodes/kprototypes.py

@@ -130,13 +130,17 @@ def __init__(self, n_clusters=8, max_iter=100, num_dissim=euclidean_dissim,
                      "Setting n_init to 1.")
            self.n_init = 1

-    def fit(self, X, y=None, categorical=None):
+    def fit(self, X, y=None, categorical=None, sample_weights=None):


I think we can also add it to KModes, since you've done most of the legwork for that already.

This can be a follow-up PR.

It also seemed to me as if it would be fitting and consistent to enable the functionality for KModes as well.

If possible, I would appreciate it if this could be done in a follow-up PR.

Sure, that would be great

nicodv · 2022-03-17T04:15:47Z

kmodes/kprototypes.py

@@ -513,3 +527,16 @@ def _split_num_cat(X, categorical):
                               if ii not in categorical]]).astype(np.float64)
    Xcat = np.asanyarray(X[:, categorical])
    return Xnum, Xcat
+
+
+def _validate_sample_weights(sample_weights, n_samples):


If we're enabling this for both KModes and KPrototypes, I suggest moving this method to the former.

Noted (for a possible follow-up PR)!

kmodes/tests/test_kprototypes.py

kmodes/kprototypes.py

nicodv · 2022-03-17T04:35:40Z

kmodes/kprototypes.py

            # Initial assignment to clusters
            clust = np.argmin(
                num_dissim(centroids[0], Xnum[ipoint]) + gamma *
                cat_dissim(centroids[1], Xcat[ipoint], X=Xcat, membship=membship)
            )
            membship[clust, ipoint] = 1
-            cl_memb_sum[clust] += 1
+            cl_memb_sum[clust] += sample_weight


Ultimately, we calculate the mean by dividing the cl_attr_sum by this, the cl_mem_sum: https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py#L471

If we apply the weight to both the numerator and denominator, they cancel out, no?

Shouldn't we solely apply the weight to cl_attr_sum?

Most definitely! This slipped through the cracks.
978d973

I do wonder now how the unit test that tests for a single overweighted sample was able to pass. 🤔

Yeah, so do I. I think the problem was only for one of numerical/categorical features. Maybe the other having been correct was sufficient to push the centroid to the right point?

coveralls · 2022-03-17T04:39:45Z

Coverage increased (+1.3%) to 97.908% when pulling 03e9ac6 on kklein:master into 370d64b on nicodv:master.

nicodv

LGTM

kklein · 2022-03-17T17:37:31Z

Thanks a bunch for your fast and very useful feedback! :)

kklein added 6 commits March 8, 2022 11:04

Draft usage of sample weights.

4a0e5be

Add example for sample weights.

59ad46f

Explicitly check of Noneness.

957beed

Validate input and move sample_weights to function parameters.

d0f4469

Add tests for sample weights.

8068a91

Merge branch 'master' of github.com:nicodv/kmodes

0e922ed

nicodv reviewed Mar 17, 2022

View reviewed changes

kklein added 4 commits March 17, 2022 14:22

Remove sample_weights parameter from predict call.

20e8dd6

Rename parameter as to imitate sklearn.

03b3555

Fix cl_memb_sum weighting.

978d973

Check for error message in assertions.

03e9ac6

kklein requested a review from nicodv March 17, 2022 15:52

nicodv approved these changes Mar 17, 2022

View reviewed changes

nicodv merged commit 9222369 into nicodv:master Mar 17, 2022

nicodv mentioned this pull request Mar 30, 2022

Support feature weights #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample weights to KPrototypes. #171

Add sample weights to KPrototypes. #171

kklein commented Mar 14, 2022

kklein commented Mar 14, 2022

nicodv left a comment

nicodv Mar 17, 2022

kklein Mar 17, 2022 •

edited

Loading

nicodv Mar 17, 2022

nicodv Mar 17, 2022

kklein Mar 17, 2022

nicodv Mar 17, 2022

kklein Mar 17, 2022

nicodv Mar 17, 2022

kklein Mar 17, 2022

coveralls commented Mar 17, 2022 •

edited

Loading

nicodv left a comment

kklein commented Mar 17, 2022

Add sample weights to KPrototypes. #171

Add sample weights to KPrototypes. #171

Conversation

kklein commented Mar 14, 2022

kklein commented Mar 14, 2022

nicodv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kklein Mar 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Mar 17, 2022 • edited Loading

nicodv left a comment

Choose a reason for hiding this comment

kklein commented Mar 17, 2022

kklein Mar 17, 2022 •

edited

Loading

coveralls commented Mar 17, 2022 •

edited

Loading