Implemented Ng et al.'s dissimilarity measure #44
Conversation
This is awesome, thank you! I have 2 remarks, though:
|
Not a problem! It's a very useful library. Thanks for releasing it! To address your remarks:
Best, |
…diss, and added error checking to dissimilarity function for membship array
2 similar comments
As discussed above, I've reverted k-modes and k-prototypes to use the same dissimilarity measure for initialization and the main loop. I added unit tests for the Ng dissimilarity function, and also error checking in the Ng dissimilarity measure to ensure that the 'membship' array has the correct shape. |
xcids = np.where(np.in1d(memj.ravel(), [1]).reshape(memj.shape)) | ||
return float((np.take(X, xcids, axis=0)[0][:, idr] == b[idr]).sum(0)) | ||
|
||
def calc_dissim(b, X, memj, idr, idj): |
nicodv
Jun 9, 2017
Owner
idj
is not used in the function at all?
idj
is not used in the function at all?
benandow
Jun 9, 2017
Author
Contributor
Good catch! I removed that parameter. I had implemented support for a weighting function, which was removed for the PR and that parameter was a leftover remnant.
Good catch! I removed that parameter. I had implemented support for a weighting function, which was removed for the PR and that parameter was a leftover remnant.
…ant from weighting function)
@@ -373,6 +376,9 @@ def fit_predict(self, X, y=None, **kwargs): | |||
""" | |||
return self.fit(X, **kwargs).labels_ | |||
|
|||
def genMembshipArray(self): |
nicodv
Jun 9, 2017
Owner
Let's make this a function in utils/init.py
Let's make this a function in utils/init.py
benandow
Jun 9, 2017
Author
Contributor
Moved it and updated the references.
Moved it and updated the references.
@@ -415,6 +416,9 @@ def fit(self, X, y=None, categorical=None): | |||
self.verbose) | |||
return self | |||
|
|||
def genMembshipArray(self): |
nicodv
Jun 9, 2017
Owner
Let's make this a function in utils/init.py
Let's make this a function in utils/init.py
benandow
Jun 9, 2017
Author
Contributor
Moved it and updated the references.
Moved it and updated the references.
@benandow , in test_kmodes and test_kprototypes, we need to add tests that use this dissimilarity measure to prove that the end-to-end flow works, and that the results are as expected. |
…ck of initial feature vector for attribute frequency calculation in ng_diss. Although this means that the encoded FV will not be freed by the GC! Alternative is to build a map of attribute frequencies at the beginning, but this adds unnecessary memory overhead if not predicting new data (e.g., just invoking fit or fit_predict). Therefore, I believe that maintaining a reference to the encoded FV is the best choice.
…ure and results are as expected.
@nicodv I added the test cases in test_kmodes and test_kprototypes to show that the end-to-end flow works and the results are expected when using the Ng dissimilarity measure. After testing, I also submitted a bug fix that occurred when invoking fit and then trying to predict new data. I overlooked that we need to keep track of initial encoded feature vector for the attribute frequency calculation in Ng's dissimilarity measure. I implemented the patch. However, this means that the encoded FV will not be freed by the GC. An alternative solution is to build a map of attribute frequencies at the beginning, so that we can allow the encoded FV to be freed, but this adds unnecessary memory overhead if not predicting new data (e.g., if you're just invoking fit or fit_predict). Therefore, I believe that maintaining a reference to the encoded FV is the best choice. |
1 similar comment
2 similar comments
@benandow , I'm revisiting this PR presently. I don't think we should save the (potentially very large) training matrix as Could we use the prediction matrix I think I prefer the last option. What are your thoughts? EDIT: |
I went with using simple matching dissimilarity when predicting. A warning is printed for the user so that he/she is aware that Ng's dissimilarity is not used when predicting. @benandow , thanks again for your contribution! |
Implemented Ng et al.'s dissimilarity measure[1] including necessary modifications to the base libraries. Feel free to pull into the main branch.
[1] Michael K. Ng, Mark Junjie Li, Joshua Zhexue Huang, and Zengyou He, "On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 29, No. 3, January, 2007