Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample weights to KPrototypes. #171

Merged
merged 10 commits into from
Mar 17, 2022
Merged

Add sample weights to KPrototypes. #171

merged 10 commits into from
Mar 17, 2022

Conversation

kklein
Copy link
Contributor

@kklein kklein commented Mar 14, 2022

As of now, every data point contributes equally to the loss function and derived cluster updates.

Yet, in some use cases, it might be desirable to attach weights to data points.

This PR introduces sample_weights, a sequence of numeric values, as an optional parameter for KPrototypes' fit method as well as all downstream functions.

Some basic input validation as well as some testing are provided.

@kklein
Copy link
Contributor Author

kklein commented Mar 14, 2022

Hi @nicodv ! Happy to hear your thoughts. :)

Copy link
Owner

@nicodv nicodv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful contribution, @kklein . Thank you!

I left some comments and questions.

kmodes/kprototypes.py Outdated Show resolved Hide resolved
@@ -130,13 +130,17 @@ def __init__(self, n_clusters=8, max_iter=100, num_dissim=euclidean_dissim,
"Setting n_init to 1.")
self.n_init = 1

def fit(self, X, y=None, categorical=None):
def fit(self, X, y=None, categorical=None, sample_weights=None):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also add it to KModes, since you've done most of the legwork for that already.

This can be a follow-up PR.

Copy link
Contributor Author

@kklein kklein Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seemed to me as if it would be fitting and consistent to enable the functionality for KModes as well.

If possible, I would appreciate it if this could be done in a follow-up PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that would be great

@@ -513,3 +527,16 @@ def _split_num_cat(X, categorical):
if ii not in categorical]]).astype(np.float64)
Xcat = np.asanyarray(X[:, categorical])
return Xnum, Xcat


def _validate_sample_weights(sample_weights, n_samples):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're enabling this for both KModes and KPrototypes, I suggest moving this method to the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted (for a possible follow-up PR)!

kmodes/tests/test_kprototypes.py Outdated Show resolved Hide resolved
kmodes/tests/test_kprototypes.py Outdated Show resolved Hide resolved
kmodes/tests/test_kprototypes.py Outdated Show resolved Hide resolved
kmodes/kprototypes.py Outdated Show resolved Hide resolved
# Initial assignment to clusters
clust = np.argmin(
num_dissim(centroids[0], Xnum[ipoint]) + gamma *
cat_dissim(centroids[1], Xcat[ipoint], X=Xcat, membship=membship)
)
membship[clust, ipoint] = 1
cl_memb_sum[clust] += 1
cl_memb_sum[clust] += sample_weight
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, we calculate the mean by dividing the cl_attr_sum by this, the cl_mem_sum: https://github.com/nicodv/kmodes/blob/master/kmodes/kprototypes.py#L471

If we apply the weight to both the numerator and denominator, they cancel out, no?

Shouldn't we solely apply the weight to cl_attr_sum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most definitely! This slipped through the cracks.
978d973

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder now how the unit test that tests for a single overweighted sample was able to pass. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so do I. I think the problem was only for one of numerical/categorical features. Maybe the other having been correct was sufficient to push the centroid to the right point?

@coveralls
Copy link

coveralls commented Mar 17, 2022

Coverage Status

Coverage increased (+1.3%) to 97.908% when pulling 03e9ac6 on kklein:master into 370d64b on nicodv:master.

@kklein kklein requested a review from nicodv March 17, 2022 15:52
Copy link
Owner

@nicodv nicodv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicodv nicodv merged commit 9222369 into nicodv:master Mar 17, 2022
@kklein
Copy link
Contributor Author

kklein commented Mar 17, 2022

Thanks a bunch for your fast and very useful feedback! :)

@nicodv nicodv mentioned this pull request Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants