Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

negative gap values #44

Closed
shahabh opened this issue Feb 15, 2020 · 5 comments
Closed

negative gap values #44

shahabh opened this issue Feb 15, 2020 · 5 comments

Comments

@shahabh
Copy link

shahabh commented Feb 15, 2020

Can someone please explain why gap values in this implementation are negative?
Based on the formulas in the paper, a negative gap value means that distances in random clusterings are smaller than the original clustering => random ones are better than the original one. I have verified that in another book as well.

@dvukolov
Copy link
Contributor

This is due to #38, which has been fixed recently.

@milesgranger
Copy link
Owner

milesgranger commented Feb 17, 2020

Yes, that's right, fixed thanks to you @dvukolov 👍

@shahabh you can do pip install gap-stat==2.0.0.rc1 for that fix. The only (planned) difference between that release and 2.0.0 is that I still need to do the same fix on the Rust side of things.

@shahabh
Copy link
Author

shahabh commented Feb 17, 2020

thanks for your reply.
Now it gives this error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-19-6205513b5d14> in <module>
      3 
      4 
----> 5 n_clusters = optimalK(X, cluster_array=np.arange(1, 15))
      6 print('Optimal clusters: ', n_clusters)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gap_statistic\optimalK.py in __call__(self, X, n_refs, cluster_array)
    129 
    130         # Calculate the gaps for each cluster count.
--> 131         for gap_calc_result in engine(X, n_refs, cluster_array):
    132 
    133             # Assign this loop's gap statistic to gaps

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gap_statistic\optimalK.py in _process_with_rust(self, X, n_refs, cluster_array)
    317         Process gap stat using pure rust
    318         """
--> 319         import gapstat_rs
    320 
    321         for (


ModuleNotFoundError: No module named 'gapstat_rs'

also, I was wondering if the new implementation would just turn negative values to the same positive values, or the generated values will be different?
Did the previous version with the negative values still produce the correct optimalk?

@dvukolov
Copy link
Contributor

The Rust part has not been fixed yet, please use the default parallel backend "joblib". The positive values will be completely different. As for the optimal number of clusters, I believe that depends on the data. In my case, the outcome didn't change much. However, it was very important to increase the number of reference datasets and use a custom clusterer to obtain stable results.

@milesgranger
Copy link
Owner

milesgranger commented Feb 18, 2020

Going to close this as the Rust side is fixed now. After CI finished you can do pip install gap-stat[rust]==2.0.0 for the updated Rust changes if you want, otherwise use joblib as @dvukolov suggested. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants