Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to achieve same clustering result #143

Open
xuesongle opened this Issue Feb 15, 2018 · 2 comments

Comments

Projects
None yet
2 participants
@xuesongle
Copy link

xuesongle commented Feb 15, 2018

Hi, I am using clustering algorithm to cluster all the sift points extracted from 3 images (As I intend to create a code book for bag of visual words). I notice that the generated centroids are different each time I run the program. I understand the randomness of seeds selection gives different KMeans clustering result. But is there any class in the library, which allows me to provide same clustering results, such as fixed set of seeds or fixed selection algorithm. Thanks.

@jonhare

This comment has been minimized.

Copy link
Member

jonhare commented Feb 15, 2018

Sure, there are a couple of ways to do this:

  1. On your xxxKMeans instance call setInit() with you own custom xxxKMeansInit instance in which you control the random seed (or do the init in a totally different way if you like).
  2. The default xxxKMeansInit uses randomly selected points from the ones being clustered. It does this by using the getRandomRows method of a DataSource<xxx[]> object that represents the data you've passed in (if you're passing in a native array, a ArrayBackedDataSource<xxx[]> is created for you). If you create the ArrayBackedDataSource<xxx[]> yourself, you can set the random generator in the constructor & thus control the seed yourself.

I'm not sure what your use-case is, however In practice, you might be better off just creating the codebook once & saving the centroids - you can then load them and reuse them as much as you like to create BoVW representations for new images without the overhead of redoing the clustering.

@xuesongle

This comment has been minimized.

Copy link
Author

xuesongle commented Feb 15, 2018

Hi, jonhare:

Thanks for your prompt reply. Yes I did save the clustering result to a binary file and reload it to recreate ByteCentroidsResult object. The reason I raised the question is that the clustering result varies each time, which leads to concern about efficiency of code book generation. I guess partly it is because that I was trying to cluster a small data set, 1207 SIFT points into 200 classes ( using the given example) and therefore the randomness of the centroid generation increases. If I use larger data set, for example 1207*10 SIFT points in 200 classes, the randomness of the centroid generation will become stable. Thanks for your answers, it does solve the puzzle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.