New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented Ng et al.'s dissimilarity measure #44

Merged
merged 12 commits into from Nov 15, 2017

Conversation

Projects
None yet
3 participants
@benandow
Copy link
Contributor

benandow commented Jun 7, 2017

Implemented Ng et al.'s dissimilarity measure[1] including necessary modifications to the base libraries. Feel free to pull into the main branch.

[1] Michael K. Ng, Mark Junjie Li, Joshua Zhexue Huang, and Zengyou He, "On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 29, No. 3, January, 2007

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 7, 2017

Coverage Status

Coverage decreased (-1.6%) to 95.134% when pulling 9785a5b on benandow:master into a30da80 on nicodv:master.

@nicodv

This comment has been minimized.

Copy link
Owner

nicodv commented Jun 7, 2017

This is awesome, thank you!

I have 2 remarks, though:

  • Do we need separate dissimilarity measures between the initialization and the main loop of the algorithm? I think it makes the API less clear, and I don't see the added value.
  • The new dissimilarity measure needs to be properly unit tested before we can merge this.
@benandow

This comment has been minimized.

Copy link
Contributor

benandow commented Jun 7, 2017

Not a problem! It's a very useful library. Thanks for releasing it!

To address your remarks:

  • That's a fair point. I'll revert it to use the same dissimilarity measure for the initialization and main loop. The main reason why I implemented it with different parameters for dissimilarity measures is that it is more efficient to just invoke the simple matching_diss measure for initialization, as Ng. et al.'s measure inherently produces the same result due to the clusters being empty at that point.
    • Explanation: The benefit of Ng et al.'s dissimilarity measure is that it essentially accounts for cluster impurity by penalizing attribute mismatches within a cluster. When the attribute value of the current record xi,j matches the corresponding attribute value of the cluster centroid zl,j, you look at the current members of the cluster and calculate phi(zl,j, xi,j) as 1 - |Cl,j,r| / |Cl}|, where |Cl| is the number of objects in the lth cluster, and |Cl,j,r| is the number of objects in the lth cluster whose jth attribute has the same attribute value of the current record xi,j. If the cluster is empty and the attribute value of the centroid matches the attribute value of the record (i.e., xi,j == zl,j), then there should be no penalty (i.e., 1 - |Cl,j,r| / |Cl}| = 0, as both |Cl,j,r| and |Cl}| equals 0). If xi,j != zl,j, then the penalty is 1. Therefore, when the clusters are empty, such as during centroid initialization, Ng. et al.'s dissimilarity measure produces the same result as the simple matching_diss measure.
  • I'll try to find some free time to include unit tests.

Best,
-Ben

benandow added some commits Jun 7, 2017

Unified init+kmodes dissimilarity functions, added unit tests for ng_…
…diss, and added error checking to dissimilarity function for membship array
@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 7, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.602% when pulling 5c08222 on benandow:master into a30da80 on nicodv:master.

2 similar comments
@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 7, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.602% when pulling 5c08222 on benandow:master into a30da80 on nicodv:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 7, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.602% when pulling 5c08222 on benandow:master into a30da80 on nicodv:master.

@benandow

This comment has been minimized.

Copy link
Contributor

benandow commented Jun 7, 2017

As discussed above, I've reverted k-modes and k-prototypes to use the same dissimilarity measure for initialization and the main loop. I added unit tests for the Ng dissimilarity function, and also error checking in the Ng dissimilarity measure to ensure that the 'membship' array has the correct shape.

xcids = np.where(np.in1d(memj.ravel(), [1]).reshape(memj.shape))
return float((np.take(X, xcids, axis=0)[0][:, idr] == b[idr]).sum(0))

def calc_dissim(b, X, memj, idr, idj):

This comment has been minimized.

@nicodv

nicodv Jun 9, 2017

Owner

idj is not used in the function at all?

This comment has been minimized.

@benandow

benandow Jun 9, 2017

Contributor

Good catch! I removed that parameter. I had implemented support for a weighting function, which was removed for the PR and that parameter was a leftover remnant.

@@ -373,6 +376,9 @@ def fit_predict(self, X, y=None, **kwargs):
"""
return self.fit(X, **kwargs).labels_

def genMembshipArray(self):

This comment has been minimized.

@nicodv

nicodv Jun 9, 2017

Owner

Let's make this a function in utils/init.py

This comment has been minimized.

@benandow

benandow Jun 9, 2017

Contributor

Moved it and updated the references.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 9, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.602% when pulling 7e4e865 on benandow:master into a30da80 on nicodv:master.

@@ -415,6 +416,9 @@ def fit(self, X, y=None, categorical=None):
self.verbose)
return self

def genMembshipArray(self):

This comment has been minimized.

@nicodv

nicodv Jun 9, 2017

Owner

Let's make this a function in utils/init.py

This comment has been minimized.

@benandow

benandow Jun 9, 2017

Contributor

Moved it and updated the references.

@nicodv

This comment has been minimized.

Copy link
Owner

nicodv commented Jun 9, 2017

@benandow , in test_kmodes and test_kprototypes, we need to add tests that use this dissimilarity measure to prove that the end-to-end flow works, and that the results are as expected.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 9, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.585% when pulling 1d9526a on benandow:master into a30da80 on nicodv:master.

benandow added some commits Jun 9, 2017

Bugfix for invoking fit then predicting on new data. Need to keep tra…
…ck of initial feature vector for attribute frequency calculation in ng_diss. Although this means that the encoded FV will not be freed by the GC! Alternative is to build a map of attribute frequencies at the beginning, but this adds unnecessary memory overhead if not predicting new data (e.g., just invoking fit or fit_predict). Therefore, I believe that maintaining a reference to the encoded FV is the best choice.
@benandow

This comment has been minimized.

Copy link
Contributor

benandow commented Jun 9, 2017

@nicodv I added the test cases in test_kmodes and test_kprototypes to show that the end-to-end flow works and the results are expected when using the Ng dissimilarity measure.

After testing, I also submitted a bug fix that occurred when invoking fit and then trying to predict new data. I overlooked that we need to keep track of initial encoded feature vector for the attribute frequency calculation in Ng's dissimilarity measure. I implemented the patch. However, this means that the encoded FV will not be freed by the GC. An alternative solution is to build a map of attribute frequencies at the beginning, so that we can allow the encoded FV to be freed, but this adds unnecessary memory overhead if not predicting new data (e.g., if you're just invoking fit or fit_predict). Therefore, I believe that maintaining a reference to the encoded FV is the best choice.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 9, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.585% when pulling ef93f15 on benandow:master into a30da80 on nicodv:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 9, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.585% when pulling ef93f15 on benandow:master into a30da80 on nicodv:master.

1 similar comment
@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 9, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.585% when pulling ef93f15 on benandow:master into a30da80 on nicodv:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 9, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.585% when pulling ef93f15 on benandow:master into a30da80 on nicodv:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 9, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.585% when pulling ef93f15 on benandow:master into a30da80 on nicodv:master.

2 similar comments
@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 9, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.585% when pulling ef93f15 on benandow:master into a30da80 on nicodv:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jun 9, 2017

Coverage Status

Coverage decreased (-0.1%) to 96.585% when pulling ef93f15 on benandow:master into a30da80 on nicodv:master.

@nicodv

This comment has been minimized.

Copy link
Owner

nicodv commented Aug 15, 2017

@benandow , I'm revisiting this PR presently.

I don't think we should save the (potentially very large) training matrix as self.X. At predict time, it's just for calculating cost anyway, so not really crucial.

Could we use the prediction matrix X in the predict method for calculating cost? How can we deal with membship then? Or shall we simply use matching dissimilarity at predict time?

I think I prefer the last option. What are your thoughts?

EDIT:
Please review #52 for my suggested solution.

@nicodv nicodv merged commit ef93f15 into nicodv:master Nov 15, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@nicodv

This comment has been minimized.

Copy link
Owner

nicodv commented Nov 15, 2017

I went with using simple matching dissimilarity when predicting. A warning is printed for the user so that he/she is aware that Ng's dissimilarity is not used when predicting.

@benandow , thanks again for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment