Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determining the optimal number of clusters #46

Open
eugeniahrho opened this issue Jun 10, 2017 · 7 comments
Open

Determining the optimal number of clusters #46

eugeniahrho opened this issue Jun 10, 2017 · 7 comments

Comments

@eugeniahrho
Copy link

eugeniahrho commented Jun 10, 2017

Hi I've been using kmodes (https://www.rdocumentation.org/packages/klaR/versions/0.6-12/topics/kmodes) from the KlaR, an R package to cluster my data set. I wanted to try using kmodes in python to see if I get similar results. However, I don't see how I can determine the optimal number of clusters in the python version of kmodes.

In the klaR package, I can use the $withindiff function to get the within-cluster simple-matching distance for each cluster. This allows me to calculate the sum of error for for k= 2, 3, 4...., etc. and select the optimal number of clusters based on the largest sum of error difference between each iteration of clustering with varying k values.

In the kmodes for python, how do you determine the optimal k?

@nicodv
Copy link
Owner

nicodv commented Jun 16, 2017

Simply by running the clustering for multiple k values, as there currently is no wrapper that does this for you automatically.

It would be nice to combine this with the silhouette plot mentioned here

PRs are welcomed. :)

@dexdimas
Copy link

And how do you determine the optimal k for the k-prototypes?

I am working on doing clustering on mixed categorical and numerical attributes. When I stumbled across your k-prototypes implementation, I want to implement it in my case. However, I'm a bit confused on how to evaluate the result from the k-prototypes algorithm (e.g. determine the optimal k).

But as mentioned that silhouette plot would do the trick, I've been thinking to change the Euclidean distance into the k-prototypes cost function to determine the intra- and inter- cluster distance on silhouette analysis.

Do you think that would work?

@doyager
Copy link

doyager commented Mar 10, 2019

@dexdimas

Hi @dexdimas , @nicodv , All

I am also working with K-prototypes , and trying to find the optimal K value, can you please share your experience/approach to find optimal K when using K-prototypes,it would be great if you can share some code and links .

Any suggestions for plotting very hight dimensional data , I am working with 56 features where I have 35 categorical columns[ 3 cols have about 10,000 categories and all others have about 10-12 categories] , 11 Numerical columns and 10 binary columns, with data size of 80 Million records

ps: I am trying to find patterns and outliers , trying to find outliers that would not fit in with normal clusters, I am using health care data.

Thank you in advance , any help is appreciated.

@supreetkt
Copy link

supreetkt commented Mar 21, 2019

Hi @nicodv,

I'm working on an implementation of silhouette score, which uses dissimilarity (between each element of the array) as a distance metric and gives the optimal number of clusters, k. What other metric would you consider as a good basis for silhouette score calculation?

@PabloVergara
Copy link

Using silhouette for the numerical variables, and continue using the cost for all
with a small change here in kprototypes.py
Captura4

and this piece of code in the implementation:

lista=[]
for i in range(20,23):
    nc=i
    start = time.time()
    kp = KPrototypes(n_clusters = nc, init = 'Cao', n_init =22, verbose = 1, random_state=4, n_jobs=8 )
    clusters=kp.fit_predict(data.values, categorical = [9])
    end = time.time()
    lista.append([i,"Silhouette Coefficient: %0.3f"% metrics.silhouette_score(data.iloc[:,0:9], kp.labels_),'cost: %0.3f'%kp.cost_,
                                     'tiempo (s): %0.3f'% (end-start),'best run: %0.3f'% (list(kp.best.keys())[0]+1)])

you can have a half result
image

@matiasscorsetti
Copy link

hello,

how to calculate the silhouette score in k prototypes, if I have a silhouette score of categorical data (hamming) and a silhouette score of numerical data (euclidean)?
Should I average weighted between the two coefficients according to the gamma value?

How would this weighted average be calculated?

It could be done this way:

( silhouette_category * kp.gamma ) + ( silhouette_numeric * (1 - kp.gamma ) )

thanks

@arnaud-nt2i
Copy link

arnaud-nt2i commented Mar 12, 2021

@matiasscorsetti
gamma is not from [0,1] (a proportionality coef) but from [0,+inf[

From reading the R implementation of "silhouette_kproto" line 1134 : Rdocumentation
(gamma is called lambda there)

It seems to me they are weighting both silhouettes values like following:
( silhouette_category * gamma ) + ( silhouette_numeric )

but I may be wrong...

an idea @nicodv ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants