Minimum number of clusters argument #15

GarryGelade · 2019-03-23T15:30:12Z

I think it might be useful to have a minclusters argument for the Optimal_Clusters_ type functions.

Instead of processing all numbers of clusters between 1 and maxclusters, the function would just examine the numbers between minclusters and maxclusters.

This would enable users to explore selected segments of the clustering space without having to process large numbers of cluster solutions. This would be very useful for large clustering problems which can be quite time-consuming.

Another useful option would be the ability to specify a vector of cluster numbers such as (1,10,20,30,40,50) which would allow quick exploration of the cluster space .

Thanks
Garry

mlampros · 2019-03-24T12:37:29Z

hello @GarryGelade and I'm sorry for the late reply,

you are right, this would be actually a nice feature. I took a look to the relevant code snippets and I have to modify

the R code of the 'Optimal_Clusters_KMeans'
the Rcpp code of the 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids'

The options will be

a number such as 6 or 9 or 12 etc.
a numeric vector (contiguous or non-contiguous) such as 1:5, 2:8 or c(10,20,30)

If you are not in a hurry it might take a couple of days as I currently work on other stuff too. In any case I'll notify you once I upload the updated version on Github. thanks.

GarryGelade · 2019-03-24T13:44:12Z

Dear Lampros Great! Thanks so much. Regards Garry From: Lampros Mouselimis <notifications@github.com> Sent: 24 March 2019 12:38 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) hello @GarryGelade <https://github.com/GarryGelade> and I'm sorry for the late reply, you are right, this would be actually a nice feature. I took a look to the relevant code snippets and I have to modify * the R code of the 'Optimal_Clusters_KMeans' * the Rcpp code of the 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids' The options will be * a number such as 6 or 9 or 12 etc. * a numeric vector (contiguous or non-contiguous) such as 1:5, 2:8 or c(10,20,30) If you are not in a hurry it might take a couple of days as I currently work on other stuff too. In any case I'll notify you once I upload the updated version on Github. thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxPt1NrvJbSHAjFlGvl4vakl-WKyHks5vZ3GKgaJpZM4cE-ZO> .

mlampros · 2019-03-26T11:32:39Z

@GarryGelade,

I attempted to modify the 'Optimal_Clusters_KMeans' function yesterday. It is possible, however the plotting of non-contiguous sequences might break things so I'll have to re-implement it from scratch and I currently do not have the time. Is plotting of sequences (except for single values) a requirement for you or have you thought it solely of a vector consisting of the results based on the evaluation metric?

GarryGelade · 2019-03-26T12:43:28Z

Dear Lampros I am not so interested in the plot, as I can reproduce that myself. A vector of evaluation metric scores would be fine. Regards Garry From: Lampros Mouselimis <notifications@github.com> Sent: 26 March 2019 11:33 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) @GarryGelade <https://github.com/GarryGelade> , I attempted to modify the 'Optimal_Clusters_KMeans' function yesterday. It is possible, however the plotting of non-contiguous sequences might break things so I'll have to re-implement it from scratch and I currently do not have the time. Is plotting of sequences (except for single values) a requirement for you or have you thought solely of a vector consisting of the results based on the evaluation metric? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxA5y2FOOwK7CjOHI6vu_YqIgjvWsks5vagVYgaJpZM4cE-ZO> .

mlampros · 2019-03-26T21:38:40Z

@GarryGelade,

I've completed the first function 'Optimal_Clusters_KMeans' ( applies to both KMeans_rcpp and MiniBatchKmeans). Please read the NEWS.md file about the limitations. If this is one of the functions that you intended to use please give it a try and let me know. I've added also test cases for the applicable 'criteria'. You can download the updated version (1.1.9) using

devtools::install_github('mlampros/ClusterR')

I'll continue tomorrow with the other two functions ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids' )

GarryGelade · 2019-03-27T10:51:55Z

Dear Lampros Unfortunately I got an Installation error devtools::install_github('mlampros/ClusterR') Downloading GitHub repo mlampros/ClusterR@master from URL https://api.github.com/repos/mlampros/ClusterR/zipball/master Installing ClusterR Installing 1 package: ggplot2 Installing package into ‘C:/rPackages’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ggplot2_3.1.0.zip' Content type 'application/zip' length 3623184 bytes (3.5 MB) downloaded 3.5 MB package ‘ggplot2’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages Installing 1 package: gmp Installing package into ‘C:/rPackages’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/gmp_0.5-13.5.zip' Content type 'application/zip' length 1109717 bytes (1.1 MB) downloaded 1.1 MB package ‘gmp’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages Installing 1 package: gtools Installing package into ‘C:/rPackages’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/gtools_3.8.1.zip' Content type 'application/zip' length 325812 bytes (318 KB) downloaded 318 KB package ‘gtools’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages Installing 1 package: Rcpp Installing package into ‘C:/rPackages’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/Rcpp_1.0.1.zip' Content type 'application/zip' length 4509616 bytes (4.3 MB) downloaded 4.3 MB package ‘Rcpp’ successfully unpacked and MD5 sums checked Warning: cannot remove prior installation of package ‘Rcpp’ The downloaded binary packages are in C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages Installing 1 package: RcppArmadillo Installing package into ‘C:/rPackages’ (as ‘lib’ is unspecified) also installing the dependency ‘Rcpp’ trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/Rcpp_1.0.1.zip' Content type 'application/zip' length 4509616 bytes (4.3 MB) downloaded 4.3 MB trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/RcppArmadillo_0.9.300.2.0.zip' Content type 'application/zip' length 2252589 bytes (2.1 MB) downloaded 2.1 MB package ‘Rcpp’ successfully unpacked and MD5 sums checked Warning: cannot remove prior installation of package ‘Rcpp’ package ‘RcppArmadillo’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages "C:/PROGRA~1/R/R-35~1.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL \ "C:/Users/garry/AppData/Local/Temp/RtmpaikF3W/devtools1e50491d5d36/mlampros-ClusterR-59c0cab" --library="C:/rPackages" --install-tests ERROR: dependency 'Rcpp' is not available for package 'ClusterR' * removing 'C:/rPackages/ClusterR' In R CMD INSTALL Installation failed: Command failed (1) Any thoughts? Garry From: Lampros Mouselimis <notifications@github.com> Sent: 26 March 2019 21:39 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) @GarryGelade <https://github.com/GarryGelade> , I've completed the first function 'Optimal_Clusters_KMeans' ( applies to both KMeans_rcpp and MiniBatchKmeans). Please read the NEWS.md <https://github.com/mlampros/ClusterR/blob/master/NEWS.md> file about the limitations. If this is one of the functions that you intended to use please give it a try and let me know. I've added also test cases for the applicable 'criteria'. You can download the updated version (1.1.9) using devtools::install_github('mlampros/ClusterR') I'll continue tomorrow with the other two functions ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids' ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxKqL6HrIpx-wojvqESs97lNczSeBks5vapNhgaJpZM4cE-ZO> .

mlampros · 2019-03-27T11:16:40Z

@GarryGelade,

can you try with 'dependencies = FALSE'. The problem appears during removal of the old version of 'Rcpp'. So in case that you have 'Rcpp' installed and its version is >= 0.12.5 , use

devtools::install_github('mlampros/ClusterR', dependencies = FALSE)

GarryGelade · 2019-03-27T12:20:06Z

Thanks. It was actually a session problem. When I restarted R clean the installation worked. From: Lampros Mouselimis <notifications@github.com> Sent: 27 March 2019 11:17 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) @GarryGelade <https://github.com/GarryGelade> , can you try with 'dependencies = FALSE'. The problem appears during removal of the old version of 'Rcpp'. So in case that you have 'Rcpp' installed and its version is >= 0.12.5 , use devtools::install_github('mlampros/ClusterR', dependencies = FALSE) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJsl3LHdv5_fuEr5_y2X7JwoczQUks5va1MYgaJpZM4cE-ZO> .

GarryGelade · 2019-03-27T15:20:26Z

Dear Lampros Looks like it works! maxclus <- c(1, 10, 20, 30, 40, 50, 60, 70) optK <- Optimal_Clusters_KMeans(train.data.std, maxclus, criterion = "BIC", fK_threshold = 0.85, num_init = 1, max_iters = 100, initializer = "kmeans++", tol = 1e-04, plot_clusters = TRUE, verbose = TRUE, tol_optimal_init = 0.3, seed = 1) bic <- cbind(optK, maxclus) %>% as.data.frame() names(bic) <- c("BIC", "nclus") ggplot(bic, aes(y=BIC, x = nclus)) + geom_line() + geom_point() theme_bw() + geom_vline(xintercept = 16, linetype="dotted") Regards, Garry From: Lampros Mouselimis <notifications@github.com> Sent: 27 March 2019 11:17 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) @GarryGelade <https://github.com/GarryGelade> , can you try with 'dependencies = FALSE'. The problem appears during removal of the old version of 'Rcpp'. So in case that you have 'Rcpp' installed and its version is >= 0.12.5 , use devtools::install_github('mlampros/ClusterR', dependencies = FALSE) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJsl3LHdv5_fuEr5_y2X7JwoczQUks5va1MYgaJpZM4cE-ZO> .

mlampros · 2019-03-27T20:14:28Z

@GarryGelade,

I uploaded the updated versions of the other two functions too ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids'). I'll keep this issue open for a few days before I upload the updated version to CRAN, so let me know in case that they do not work as expected.

GarryGelade · 2019-03-28T01:22:05Z

GMM seems to be working OK, but I get an error with mediods maxclus <- 2 opt <- Optimal_Clusters_Medoids(train.data.std, maxclus, distance_metric = "euclidean", criterion = "dissimilarity", clara_samples = 0, clara_sample_size = 0, minkowski_p = 1, swap_phase = TRUE, threads = 1, verbose = FALSE, plot_clusters = FALSE, seed = 1) Error in OptClust(data, pass_vector, distance_metric, FALSE, clara_samples, : std::bad_alloc From: Lampros Mouselimis <notifications@github.com> Sent: 27 March 2019 20:14 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) @GarryGelade <https://github.com/GarryGelade> , I uploaded the updated versions of the other two functions too ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids'). I'll keep this issue open for a few days before I upload the updated version to CRAN, so let me know in case that they do not work as expected. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxC87L6TADCPB0mWe0M3s7imKuEUfks5va9EkgaJpZM4cE-ZO> .

mlampros · 2019-03-28T07:16:01Z

@GarryGelade thanks for making me aware of this error,

I tried to reproduce this error with the 3 data sets included in the ClusterR package

dietary_survey_IBS
soybean
mushroom ( I passed a distance matrix to the 'Optimal_Clusters_Medoids' using the gower distance)

but I can't reproduce your error with 'max_clusters = 2', which means this error might have to do with your data set. Would you mind sharing a reproducible example using your 'train.data.std' (if possible) so that I can fix the error and add a test case for this purpose. thanks.

GarryGelade · 2019-03-29T18:55:23Z

Dear Lampros I tried to send you my data, but it is 4Mb, and it seems to have been rejected by the server. I will try to Zip it. This is the mail system at host outmx-028.london.gridhost.co.uk. I'm sorry to have to inform you that your message could not be delivered to one or more recipients. It's attached below. For further assistance, please send mail to postmaster. If you do so, please include this problem report. You can delete your own text from the attached returned message. The mail system <reply@reply.github.com>: message size 5701448 exceeds size limit 5120000 of server in-2.smtp.github.com[192.30.253.171] From: Lampros Mouselimis <notifications@github.com> Sent: 28 March 2019 07:16 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) @GarryGelade <https://github.com/GarryGelade> thanks for making me aware of this error, I tried to reproduce this error with the 3 data sets included in the ClusterR package * dietary_survey_IBS * soybean * mushroom ( I passed a distance matrix to the 'Optimal_Clusters_Medoids' using the gower distance) but I can't reproduce your error with 'max_clusters = 2', which means this error might have to do with your data set. Would you mind sharing a reproducible example using your 'train.data.std' (if possible) so that I can fix the error and add a test case for this purpose. — You are receiving this because you were mentioned. Reply to this email directly, <#15 (comment)> view it on GitHub, or <https://github.com/notifications/unsubscribe-auth/AIVmxM43wpJ94y-6ksKqxVuo8M8tlBKuks5vbGwxgaJpZM4cE-ZO> mute the thread.

mlampros · 2019-03-30T06:43:55Z

hi @GarryGelade,

if you receive the error also with a subset of you initial data then you can send me the subset.

GarryGelade · 2019-03-30T08:07:22Z

Dear Lampros The size of the data makes a difference. When I use a dataset of 50000 examples, my computer completely freezes I will you the data in 2 parts. Train.data.std1.RDS = rows 1:50000 Train.data.std2.RDS = rows 50001:10000 Regards From: Lampros Mouselimis <notifications@github.com> Sent: 30 March 2019 06:44 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) hi @GarryGelade <https://github.com/GarryGelade> , if you receive the error also with a subset of you initial data then you can send me the subset. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO> .

GarryGelade · 2019-03-30T08:08:31Z

First half of data From: Lampros Mouselimis <notifications@github.com> Sent: 30 March 2019 06:44 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) hi @GarryGelade <https://github.com/GarryGelade> , if you receive the error also with a subset of you initial data then you can send me the subset. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO> .

GarryGelade · 2019-03-30T08:09:17Z

Second half of data From: Lampros Mouselimis <notifications@github.com> Sent: 30 March 2019 06:44 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) hi @GarryGelade <https://github.com/GarryGelade> , if you receive the error also with a subset of you initial data then you can send me the subset. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO> .

GarryGelade · 2019-03-30T08:14:28Z

NB if I only use 500 rows, the function performs OK, so the problem is something to do with large datasets. From: Lampros Mouselimis <notifications@github.com> Sent: 30 March 2019 06:44 To: mlampros/ClusterR <ClusterR@noreply.github.com> Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com> Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15) hi @GarryGelade <https://github.com/GarryGelade> , if you receive the error also with a subset of you initial data then you can send me the subset. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO> .

mlampros · 2019-03-30T13:54:48Z

hi @GarryGelade,

what do you mean by 'First half of data' and 'Second half of data' ? I don't see any observations. Do you attempt to upload the data in a specific account? thanks.

mlampros · 2019-03-30T19:34:39Z

@GarryGelade just an additional note,

it is highly probable that the

 std::bad_alloc

error that you receive is related with the size of your data and your personal computer RAM.
You receive this error in the Optimal_Clusters_Medoids() function, which takes your data and computes a distance matrix. That means if your data consists of 100.000 observations then the Optimal_Clusters_Medoids() function will first attempt to build a distance matrix of size 100.000 x 100.000 observations.
There are some threads on the web which can give you a hint on how much memory your data set will occupy (require), such as this one.
If this is the case then I would suggest that you use the Clara Medoids function when you compute the optimal clusters, which performs clustering based on samples of the input data set. You can find more information about the clara_samples and clara_sample_size in the package documentation,

Optimal_Clusters_Medoids(data, 
                         max_clusters, 
                         distance_metric,
                         criterion = "dissimilarity", 
                         clara_samples = 0,
                         clara_sample_size = 0, 
                         minkowski_p = 1, 
                         swap_phase = TRUE,
                         threads = 1, 
                         verbose = FALSE, 
                         plot_clusters = TRUE,
                         seed = 1)

stale · 2019-04-11T20:16:44Z

This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond.

stale bot added the stale label Apr 11, 2019

stale bot closed this as completed Apr 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimum number of clusters argument #15

Minimum number of clusters argument #15

GarryGelade commented Mar 23, 2019

mlampros commented Mar 24, 2019

GarryGelade commented Mar 24, 2019 via email

mlampros commented Mar 26, 2019 •

edited

Loading

GarryGelade commented Mar 26, 2019 via email

mlampros commented Mar 26, 2019

GarryGelade commented Mar 27, 2019 via email

mlampros commented Mar 27, 2019

GarryGelade commented Mar 27, 2019 via email

GarryGelade commented Mar 27, 2019 via email

mlampros commented Mar 27, 2019

GarryGelade commented Mar 28, 2019 via email

mlampros commented Mar 28, 2019 •

edited

Loading

GarryGelade commented Mar 29, 2019 via email

mlampros commented Mar 30, 2019

GarryGelade commented Mar 30, 2019 via email

GarryGelade commented Mar 30, 2019 via email

GarryGelade commented Mar 30, 2019 via email

GarryGelade commented Mar 30, 2019 via email

mlampros commented Mar 30, 2019

mlampros commented Mar 30, 2019

stale bot commented Apr 11, 2019

Minimum number of clusters argument #15

Minimum number of clusters argument #15

Comments

GarryGelade commented Mar 23, 2019

mlampros commented Mar 24, 2019

GarryGelade commented Mar 24, 2019 via email

mlampros commented Mar 26, 2019 • edited Loading

GarryGelade commented Mar 26, 2019 via email

mlampros commented Mar 26, 2019

GarryGelade commented Mar 27, 2019 via email

mlampros commented Mar 27, 2019

GarryGelade commented Mar 27, 2019 via email

GarryGelade commented Mar 27, 2019 via email

mlampros commented Mar 27, 2019

GarryGelade commented Mar 28, 2019 via email

mlampros commented Mar 28, 2019 • edited Loading

GarryGelade commented Mar 29, 2019 via email

mlampros commented Mar 30, 2019

GarryGelade commented Mar 30, 2019 via email

GarryGelade commented Mar 30, 2019 via email

GarryGelade commented Mar 30, 2019 via email

GarryGelade commented Mar 30, 2019 via email

mlampros commented Mar 30, 2019

mlampros commented Mar 30, 2019

stale bot commented Apr 11, 2019

mlampros commented Mar 26, 2019 •

edited

Loading

mlampros commented Mar 28, 2019 •

edited

Loading