Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum number of clusters argument #15

Closed
GarryGelade opened this issue Mar 23, 2019 · 21 comments
Closed

Minimum number of clusters argument #15

GarryGelade opened this issue Mar 23, 2019 · 21 comments
Labels

Comments

@GarryGelade
Copy link

I think it might be useful to have a minclusters argument for the Optimal_Clusters_ type functions.

Instead of processing all numbers of clusters between 1 and maxclusters, the function would just examine the numbers between minclusters and maxclusters.

This would enable users to explore selected segments of the clustering space without having to process large numbers of cluster solutions. This would be very useful for large clustering problems which can be quite time-consuming.

Another useful option would be the ability to specify a vector of cluster numbers such as (1,10,20,30,40,50) which would allow quick exploration of the cluster space .

Thanks
Garry

@mlampros
Copy link
Owner

hello @GarryGelade and I'm sorry for the late reply,

you are right, this would be actually a nice feature. I took a look to the relevant code snippets and I have to modify

  • the R code of the 'Optimal_Clusters_KMeans'
  • the Rcpp code of the 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids'

The options will be

  • a number such as 6 or 9 or 12 etc.
  • a numeric vector (contiguous or non-contiguous) such as 1:5, 2:8 or c(10,20,30)

If you are not in a hurry it might take a couple of days as I currently work on other stuff too. In any case I'll notify you once I upload the updated version on Github. thanks.

@GarryGelade
Copy link
Author

GarryGelade commented Mar 24, 2019 via email

@mlampros
Copy link
Owner

mlampros commented Mar 26, 2019

@GarryGelade,

I attempted to modify the 'Optimal_Clusters_KMeans' function yesterday. It is possible, however the plotting of non-contiguous sequences might break things so I'll have to re-implement it from scratch and I currently do not have the time. Is plotting of sequences (except for single values) a requirement for you or have you thought it solely of a vector consisting of the results based on the evaluation metric?

@GarryGelade
Copy link
Author

GarryGelade commented Mar 26, 2019 via email

@mlampros
Copy link
Owner

@GarryGelade,

I've completed the first function 'Optimal_Clusters_KMeans' ( applies to both KMeans_rcpp and MiniBatchKmeans). Please read the NEWS.md file about the limitations. If this is one of the functions that you intended to use please give it a try and let me know. I've added also test cases for the applicable 'criteria'. You can download the updated version (1.1.9) using

devtools::install_github('mlampros/ClusterR')

I'll continue tomorrow with the other two functions ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids' )

@GarryGelade
Copy link
Author

GarryGelade commented Mar 27, 2019 via email

@mlampros
Copy link
Owner

@GarryGelade,

can you try with 'dependencies = FALSE'. The problem appears during removal of the old version of 'Rcpp'. So in case that you have 'Rcpp' installed and its version is >= 0.12.5 , use

devtools::install_github('mlampros/ClusterR', dependencies = FALSE)

@GarryGelade
Copy link
Author

GarryGelade commented Mar 27, 2019 via email

@GarryGelade
Copy link
Author

GarryGelade commented Mar 27, 2019 via email

@mlampros
Copy link
Owner

@GarryGelade,

I uploaded the updated versions of the other two functions too ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids'). I'll keep this issue open for a few days before I upload the updated version to CRAN, so let me know in case that they do not work as expected.

@GarryGelade
Copy link
Author

GarryGelade commented Mar 28, 2019 via email

@mlampros
Copy link
Owner

mlampros commented Mar 28, 2019

@GarryGelade thanks for making me aware of this error,

I tried to reproduce this error with the 3 data sets included in the ClusterR package

  • dietary_survey_IBS
  • soybean
  • mushroom ( I passed a distance matrix to the 'Optimal_Clusters_Medoids' using the gower distance)

but I can't reproduce your error with 'max_clusters = 2', which means this error might have to do with your data set. Would you mind sharing a reproducible example using your 'train.data.std' (if possible) so that I can fix the error and add a test case for this purpose. thanks.

@GarryGelade
Copy link
Author

GarryGelade commented Mar 29, 2019 via email

@mlampros
Copy link
Owner

hi @GarryGelade,

if you receive the error also with a subset of you initial data then you can send me the subset.

@GarryGelade
Copy link
Author

GarryGelade commented Mar 30, 2019 via email

@GarryGelade
Copy link
Author

GarryGelade commented Mar 30, 2019 via email

@GarryGelade
Copy link
Author

GarryGelade commented Mar 30, 2019 via email

@GarryGelade
Copy link
Author

GarryGelade commented Mar 30, 2019 via email

@mlampros
Copy link
Owner

hi @GarryGelade,

what do you mean by 'First half of data' and 'Second half of data' ? I don't see any observations. Do you attempt to upload the data in a specific account? thanks.

@mlampros
Copy link
Owner

@GarryGelade just an additional note,

it is highly probable that the

 std::bad_alloc

error that you receive is related with the size of your data and your personal computer RAM.
You receive this error in the Optimal_Clusters_Medoids() function, which takes your data and computes a distance matrix. That means if your data consists of 100.000 observations then the Optimal_Clusters_Medoids() function will first attempt to build a distance matrix of size 100.000 x 100.000 observations.
There are some threads on the web which can give you a hint on how much memory your data set will occupy (require), such as this one.
If this is the case then I would suggest that you use the Clara Medoids function when you compute the optimal clusters, which performs clustering based on samples of the input data set. You can find more information about the clara_samples and clara_sample_size in the package documentation,

Optimal_Clusters_Medoids(data, 
                         max_clusters, 
                         distance_metric,
                         criterion = "dissimilarity", 
                         clara_samples = 0,
                         clara_sample_size = 0, 
                         minkowski_p = 1, 
                         swap_phase = TRUE,
                         threads = 1, 
                         verbose = FALSE, 
                         plot_clusters = TRUE,
                         seed = 1)

@stale
Copy link

stale bot commented Apr 11, 2019

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.

@stale stale bot added the stale label Apr 11, 2019
@stale stale bot closed this as completed Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants