-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimum number of clusters argument #15
Comments
hello @GarryGelade and I'm sorry for the late reply, you are right, this would be actually a nice feature. I took a look to the relevant code snippets and I have to modify
The options will be
If you are not in a hurry it might take a couple of days as I currently work on other stuff too. In any case I'll notify you once I upload the updated version on Github. thanks. |
Dear Lampros
Great! Thanks so much.
Regards
Garry
From: Lampros Mouselimis <notifications@github.com>
Sent: 24 March 2019 12:38
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
hello @GarryGelade <https://github.com/GarryGelade> and I'm sorry for the late reply,
you are right, this would be actually a nice feature. I took a look to the relevant code snippets and I have to modify
* the R code of the 'Optimal_Clusters_KMeans'
* the Rcpp code of the 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids'
The options will be
* a number such as 6 or 9 or 12 etc.
* a numeric vector (contiguous or non-contiguous) such as 1:5, 2:8 or c(10,20,30)
If you are not in a hurry it might take a couple of days as I currently work on other stuff too. In any case I'll notify you once I upload the updated version on Github. thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxPt1NrvJbSHAjFlGvl4vakl-WKyHks5vZ3GKgaJpZM4cE-ZO> .
|
I attempted to modify the 'Optimal_Clusters_KMeans' function yesterday. It is possible, however the plotting of non-contiguous sequences might break things so I'll have to re-implement it from scratch and I currently do not have the time. Is plotting of sequences (except for single values) a requirement for you or have you thought it solely of a vector consisting of the results based on the evaluation metric? |
Dear Lampros
I am not so interested in the plot, as I can reproduce that myself. A vector of evaluation metric scores would be fine.
Regards
Garry
From: Lampros Mouselimis <notifications@github.com>
Sent: 26 March 2019 11:33
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
@GarryGelade <https://github.com/GarryGelade> ,
I attempted to modify the 'Optimal_Clusters_KMeans' function yesterday. It is possible, however the plotting of non-contiguous sequences might break things so I'll have to re-implement it from scratch and I currently do not have the time. Is plotting of sequences (except for single values) a requirement for you or have you thought solely of a vector consisting of the results based on the evaluation metric?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxA5y2FOOwK7CjOHI6vu_YqIgjvWsks5vagVYgaJpZM4cE-ZO> .
|
I've completed the first function 'Optimal_Clusters_KMeans' ( applies to both KMeans_rcpp and MiniBatchKmeans). Please read the NEWS.md file about the limitations. If this is one of the functions that you intended to use please give it a try and let me know. I've added also test cases for the applicable 'criteria'. You can download the updated version (1.1.9) using devtools::install_github('mlampros/ClusterR')
I'll continue tomorrow with the other two functions ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids' ) |
Dear Lampros
Unfortunately I got an Installation error
devtools::install_github('mlampros/ClusterR')
Downloading GitHub repo mlampros/ClusterR@master
from URL https://api.github.com/repos/mlampros/ClusterR/zipball/master
Installing ClusterR
Installing 1 package: ggplot2
Installing package into ‘C:/rPackages’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/ggplot2_3.1.0.zip'
Content type 'application/zip' length 3623184 bytes (3.5 MB)
downloaded 3.5 MB
package ‘ggplot2’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages
Installing 1 package: gmp
Installing package into ‘C:/rPackages’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/gmp_0.5-13.5.zip'
Content type 'application/zip' length 1109717 bytes (1.1 MB)
downloaded 1.1 MB
package ‘gmp’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages
Installing 1 package: gtools
Installing package into ‘C:/rPackages’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/gtools_3.8.1.zip'
Content type 'application/zip' length 325812 bytes (318 KB)
downloaded 318 KB
package ‘gtools’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages
Installing 1 package: Rcpp
Installing package into ‘C:/rPackages’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/Rcpp_1.0.1.zip'
Content type 'application/zip' length 4509616 bytes (4.3 MB)
downloaded 4.3 MB
package ‘Rcpp’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘Rcpp’
The downloaded binary packages are in
C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages
Installing 1 package: RcppArmadillo
Installing package into ‘C:/rPackages’
(as ‘lib’ is unspecified)
also installing the dependency ‘Rcpp’
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/Rcpp_1.0.1.zip'
Content type 'application/zip' length 4509616 bytes (4.3 MB)
downloaded 4.3 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/RcppArmadillo_0.9.300.2.0.zip'
Content type 'application/zip' length 2252589 bytes (2.1 MB)
downloaded 2.1 MB
package ‘Rcpp’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘Rcpp’
package ‘RcppArmadillo’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\garry\AppData\Local\Temp\RtmpaikF3W\downloaded_packages
"C:/PROGRA~1/R/R-35~1.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL \
"C:/Users/garry/AppData/Local/Temp/RtmpaikF3W/devtools1e50491d5d36/mlampros-ClusterR-59c0cab" --library="C:/rPackages" --install-tests
ERROR: dependency 'Rcpp' is not available for package 'ClusterR'
* removing 'C:/rPackages/ClusterR'
In R CMD INSTALL
Installation failed: Command failed (1)
Any thoughts?
Garry
From: Lampros Mouselimis <notifications@github.com>
Sent: 26 March 2019 21:39
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
@GarryGelade <https://github.com/GarryGelade> ,
I've completed the first function 'Optimal_Clusters_KMeans' ( applies to both KMeans_rcpp and MiniBatchKmeans). Please read the NEWS.md <https://github.com/mlampros/ClusterR/blob/master/NEWS.md> file about the limitations. If this is one of the functions that you intended to use please give it a try and let me know. I've added also test cases for the applicable 'criteria'. You can download the updated version (1.1.9) using
devtools::install_github('mlampros/ClusterR')
I'll continue tomorrow with the other two functions ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids' )
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxKqL6HrIpx-wojvqESs97lNczSeBks5vapNhgaJpZM4cE-ZO> .
|
can you try with 'dependencies = FALSE'. The problem appears during removal of the old version of 'Rcpp'. So in case that you have 'Rcpp' installed and its version is >= 0.12.5 , use devtools::install_github('mlampros/ClusterR', dependencies = FALSE)
|
Thanks. It was actually a session problem. When I restarted R clean the installation worked.
From: Lampros Mouselimis <notifications@github.com>
Sent: 27 March 2019 11:17
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
@GarryGelade <https://github.com/GarryGelade> ,
can you try with 'dependencies = FALSE'. The problem appears during removal of the old version of 'Rcpp'. So in case that you have 'Rcpp' installed and its version is >= 0.12.5 , use
devtools::install_github('mlampros/ClusterR', dependencies = FALSE)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJsl3LHdv5_fuEr5_y2X7JwoczQUks5va1MYgaJpZM4cE-ZO> .
|
Dear Lampros
Looks like it works!
maxclus <- c(1, 10, 20, 30, 40, 50, 60, 70)
optK <- Optimal_Clusters_KMeans(train.data.std, maxclus, criterion = "BIC",
fK_threshold = 0.85, num_init = 1, max_iters = 100,
initializer = "kmeans++", tol = 1e-04, plot_clusters = TRUE,
verbose = TRUE, tol_optimal_init = 0.3, seed = 1)
bic <- cbind(optK, maxclus) %>% as.data.frame()
names(bic) <- c("BIC", "nclus")
ggplot(bic, aes(y=BIC, x = nclus)) + geom_line() + geom_point()
theme_bw() + geom_vline(xintercept = 16, linetype="dotted")
Regards, Garry
From: Lampros Mouselimis <notifications@github.com>
Sent: 27 March 2019 11:17
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
@GarryGelade <https://github.com/GarryGelade> ,
can you try with 'dependencies = FALSE'. The problem appears during removal of the old version of 'Rcpp'. So in case that you have 'Rcpp' installed and its version is >= 0.12.5 , use
devtools::install_github('mlampros/ClusterR', dependencies = FALSE)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJsl3LHdv5_fuEr5_y2X7JwoczQUks5va1MYgaJpZM4cE-ZO> .
|
I uploaded the updated versions of the other two functions too ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids'). I'll keep this issue open for a few days before I upload the updated version to CRAN, so let me know in case that they do not work as expected. |
GMM seems to be working OK, but I get an error with mediods
maxclus <- 2
opt <- Optimal_Clusters_Medoids(train.data.std, maxclus, distance_metric = "euclidean",
criterion = "dissimilarity", clara_samples = 0,
clara_sample_size = 0, minkowski_p = 1, swap_phase = TRUE,
threads = 1, verbose = FALSE, plot_clusters = FALSE, seed = 1)
Error in OptClust(data, pass_vector, distance_metric, FALSE, clara_samples, :
std::bad_alloc
From: Lampros Mouselimis <notifications@github.com>
Sent: 27 March 2019 20:14
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
@GarryGelade <https://github.com/GarryGelade> ,
I uploaded the updated versions of the other two functions too ( 'Optimal_Clusters_GMM' and 'Optimal_Clusters_Medoids'). I'll keep this issue open for a few days before I upload the updated version to CRAN, so let me know in case that they do not work as expected.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxC87L6TADCPB0mWe0M3s7imKuEUfks5va9EkgaJpZM4cE-ZO> .
|
@GarryGelade thanks for making me aware of this error, I tried to reproduce this error with the 3 data sets included in the ClusterR package
but I can't reproduce your error with 'max_clusters = 2', which means this error might have to do with your data set. Would you mind sharing a reproducible example using your 'train.data.std' (if possible) so that I can fix the error and add a test case for this purpose. thanks. |
Dear Lampros
I tried to send you my data, but it is 4Mb, and it seems to have been rejected by the server. I will try to Zip it.
This is the mail system at host outmx-028.london.gridhost.co.uk.
I'm sorry to have to inform you that your message could not be delivered to one or more recipients. It's attached below.
For further assistance, please send mail to postmaster.
If you do so, please include this problem report. You can delete your own text from the attached returned message.
The mail system
<reply@reply.github.com>:
message size 5701448 exceeds size limit 5120000 of server
in-2.smtp.github.com[192.30.253.171]
From: Lampros Mouselimis <notifications@github.com>
Sent: 28 March 2019 07:16
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
@GarryGelade <https://github.com/GarryGelade> thanks for making me aware of this error,
I tried to reproduce this error with the 3 data sets included in the ClusterR package
* dietary_survey_IBS
* soybean
* mushroom ( I passed a distance matrix to the 'Optimal_Clusters_Medoids' using the gower distance)
but I can't reproduce your error with 'max_clusters = 2', which means this error might have to do with your data set. Would you mind sharing a reproducible example using your 'train.data.std' (if possible) so that I can fix the error and add a test case for this purpose.
—
You are receiving this because you were mentioned.
Reply to this email directly, <#15 (comment)> view it on GitHub, or <https://github.com/notifications/unsubscribe-auth/AIVmxM43wpJ94y-6ksKqxVuo8M8tlBKuks5vbGwxgaJpZM4cE-ZO> mute the thread.
|
hi @GarryGelade, if you receive the error also with a subset of you initial data then you can send me the subset. |
Dear Lampros
The size of the data makes a difference.
When I use a dataset of 50000 examples, my computer completely freezes
I will you the data in 2 parts.
Train.data.std1.RDS = rows 1:50000
Train.data.std2.RDS = rows 50001:10000
Regards
From: Lampros Mouselimis <notifications@github.com>
Sent: 30 March 2019 06:44
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
hi @GarryGelade <https://github.com/GarryGelade> ,
if you receive the error also with a subset of you initial data then you can send me the subset.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO> .
|
First half of data
From: Lampros Mouselimis <notifications@github.com>
Sent: 30 March 2019 06:44
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
hi @GarryGelade <https://github.com/GarryGelade> ,
if you receive the error also with a subset of you initial data then you can send me the subset.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO> .
|
Second half of data
From: Lampros Mouselimis <notifications@github.com>
Sent: 30 March 2019 06:44
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
hi @GarryGelade <https://github.com/GarryGelade> ,
if you receive the error also with a subset of you initial data then you can send me the subset.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO> .
|
NB if I only use 500 rows, the function performs OK, so the problem is something to do with large datasets.
From: Lampros Mouselimis <notifications@github.com>
Sent: 30 March 2019 06:44
To: mlampros/ClusterR <ClusterR@noreply.github.com>
Cc: GarryGelade <garry@business-analytic.co.uk>; Mention <mention@noreply.github.com>
Subject: Re: [mlampros/ClusterR] Minimum number of clusters argument (#15)
hi @GarryGelade <https://github.com/GarryGelade> ,
if you receive the error also with a subset of you initial data then you can send me the subset.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#15 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AIVmxJDMeFQlUQiHEiXr7sLm1W8ch8Aeks5vbwergaJpZM4cE-ZO> .
|
hi @GarryGelade, what do you mean by 'First half of data' and 'Second half of data' ? I don't see any observations. Do you attempt to upload the data in a specific account? thanks. |
@GarryGelade just an additional note, it is highly probable that the std::bad_alloc
error that you receive is related with the size of your data and your personal computer RAM. Optimal_Clusters_Medoids(data,
max_clusters,
distance_metric,
criterion = "dissimilarity",
clara_samples = 0,
clara_sample_size = 0,
minkowski_p = 1,
swap_phase = TRUE,
threads = 1,
verbose = FALSE,
plot_clusters = TRUE,
seed = 1)
|
This is Robo-lampros because the Human-lampros is lazy. This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs. Feel free to re-open a closed issue and the Human-lampros will respond. |
I think it might be useful to have a minclusters argument for the Optimal_Clusters_ type functions.
Instead of processing all numbers of clusters between 1 and maxclusters, the function would just examine the numbers between minclusters and maxclusters.
This would enable users to explore selected segments of the clustering space without having to process large numbers of cluster solutions. This would be very useful for large clustering problems which can be quite time-consuming.
Another useful option would be the ability to specify a vector of cluster numbers such as (1,10,20,30,40,50) which would allow quick exploration of the cluster space .
Thanks
Garry
The text was updated successfully, but these errors were encountered: