Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add predict methods #24

Merged
merged 10 commits into from
Oct 15, 2021
Merged

Add predict methods #24

merged 10 commits into from
Oct 15, 2021

Conversation

vspinu
Copy link
Contributor

@vspinu vspinu commented Aug 18, 2021

I love this package because it's fills the bill for all clustering methods in one place. But the interface and UI is rather non-Rish.

As a starter I am adding predict methods for cluster objects. For this I

  • have streamlined class names (preserving the old ones for compatibility). Classes with spaces (like cluster medoids silhouette) are really non-standard and are cumbersome to work with when defining generics.
  • removed classes from predictor output. Mostly because those are same classes as for the actual fitted cluster objects. Same class should not be used to represent different data structures.

If these changes are ok with you I will add tests, update vignette etc. Also, time permitting, will add print, plot and summary methods to make the UI of the package more user friendly for the seasoned R users.

With this change one can do:

     gmm <- GMM(dat, 2, "maha_dist", "random_subset", 10, 10)
     predict(gmm, dat)

instead of the current

     gmm <- GMM(dat, 2, "maha_dist", "random_subset", 10, 10)
     predict_GMM(dat, gmm$centroids, gmm$covariance_matrices, gmm$weights)

  - no spaces in class names
  - don't use same class name for different data structures
@mlampros
Copy link
Owner

mlampros commented Aug 18, 2021

@vspinu thanks for the contribution. give me some time to review the changes

out <- predict_Medoids(newdata, MEDOIDS = object$medoids,
distance_metric = object$distance_metric,
fuzzy = fuzzy, threads = threads)
if (fuzzy)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the predict_Medoids() function when the user specifies fuzzy = TRUE then the function returns 3 objects:

  • clusters
  • fuzzy_clusters
  • dissimilarity

Is your intention to follow the same output format of the predict() function as is the case with other algorithms that return either a vector or a matrix (in case of binary classification for instance)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When fuzzy = TRUE the returned object is a matrix with class probabilities. This is consistent with methods in other packages which return a matrix of probabilities/memberships.

We can add an extra argument clusters = TRUE which would return a full list. This would be similar to se.fit = TRUE argument of predict.glm for example. I don't think this is strictly necessary though given that the workhorse predict_Medoids will still stay.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then no need of an extra parameter. I thought initially that the previous predict functions will be deprecated. Thus, it's fine the 'predict.MedoidsCluster()' function.

@mlampros
Copy link
Owner

mlampros commented Aug 18, 2021

@vspinu thanks again for the additions and the corrections (regarding the classes). I'll be glad if you add also tests and update the vignette. It will be also nice to have the print, plot and summary methods.
I'd rather prefer to keep the old predict functions than adding deprecation messages. I think keeping the old functions is also necessary for the users that already use these functions in their workflows or packages.
Once this PR is completed please mention your necessary personal information (vorname, lastname, e-mail) so that I can add you as a contributor to the package DESCRIPTION file. thanks again.

@vspinu
Copy link
Contributor Author

vspinu commented Aug 18, 2021

I'll be glad if you add also tests and update the vignette. It will be also nice to have the print, plot and summary methods.

Ok. I will spend some time on this.

I'd rather prefer to keep the old predict functions than adding deprecation messages.

Of course. I didn't mean to deprecate those.

@github-actions
Copy link

This is Robo-lampros because the Human-lampros is lazy. This PR has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs.

@github-actions github-actions bot added the stale label Aug 30, 2021
@mlampros
Copy link
Owner

I've added a message so that the PR is not closed from the github-actions-bot till the changes are applied (tests, etc.)

@vspinu
Copy link
Contributor Author

vspinu commented Aug 31, 2021

Sorry for stalling this. Things started piling on my weekends suddenly, will try to resume this week.

@github-actions github-actions bot removed the stale label Aug 31, 2021
@github-actions
Copy link

This is Robo-lampros because the Human-lampros is lazy. This PR has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs.

@github-actions github-actions bot added the stale label Sep 12, 2021
@vspinu
Copy link
Contributor Author

vspinu commented Sep 13, 2021

I have re-added the comment, added tests and adjusted the vignette to use predict() method. Also added print() methods. The objects are now printed succinctly as follows:

> gmm
GMM Cluster
 Call: GMM(data = dat, gaussian_comps = 3) 
 Data cols: 42 
 Centroids: 3 
> km
KMeans Cluster
 Call: KMeans_rcpp(data = dat, clusters = clust, num_init = 5, max_iters = 100,      initializer = "optimal_init") 
 Data cols: 42 
 Centroids: 2 
 BSS/SS: 0.2800772 
 SS: 1312201 = 944683.6 (WSS) + 367517.6 (BSS)
> cm
Medoids Cluster
 Call: Clara_Medoids(data = dat, clusters = 2, samples = 5, sample_size = 0.2,      swap_phase = TRUE, fuzzy = T) 
 Data cols: 42 
 Medoids: 2 
 Cluster stats:
                                  2          3
   index                        278         34
   number_obs                   226        174
   max_dissimilarity      91.879999 114.495518
   average_dissimilarity  36.482054  67.212241
   isolation               1.396992   1.740851

You can modify those as you see fit.

One problem is Kmeans_arma which returns only centroids instead of the complete structure as returned by KMeans_rcpp. So the returned object is not classed and I have added a FIXME comment there.

Regarding the UI, for consistency and simplicity, I would propose to actually have only 3 top level functions: KMeans, Medoids and GMM instead of the current (KMeans_rcpp, Kmeans_arma, Cluster_Medoids, Clara_Medoids and GMM). KMeans would have a method argument "rcpp" or "arma", Medoids() would have method argument "pam" or "clara". WDYT?

@github-actions github-actions bot removed the stale label Sep 13, 2021
@mlampros
Copy link
Owner

First of all thanks for all your implementations. Let me see the changes in the next days.

Regarding the UI, I think it's a good idea but we have to add deprecation messages for the previous functions so that the users can get accustomed to the new functions.

@github-actions
Copy link

This is Robo-lampros because the Human-lampros is lazy. This PR has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs.

@github-actions github-actions bot added the stale label Sep 25, 2021
@mlampros
Copy link
Owner

I've added a message so that this PR is not closed.

Copy link
Owner

@mlampros mlampros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@mlampros
Copy link
Owner

@vspinu,

can you also update the NEWS.md file by adding your changes under version 1.2.6?

@vspinu
Copy link
Contributor Author

vspinu commented Sep 30, 2021

Done! I have also bumped the dev version.

@github-actions github-actions bot removed the stale label Sep 30, 2021
@github-actions
Copy link

This is Robo-lampros because the Human-lampros is lazy. This PR has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs.

@github-actions github-actions bot added the stale label Oct 12, 2021
@mlampros
Copy link
Owner

thank you for all the changes, I'll need a few days to do the following:

  • update the DESCRIPTION file with your name
  • submit the new version to CRAN

@mlampros
Copy link
Owner

mlampros commented Oct 13, 2021

@vspinu,
specific tests fail. are the failures related to your changes? It seems that the functions of the failed tests require a matrix.

@vspinu
Copy link
Contributor Author

vspinu commented Oct 13, 2021

Yes. It has to do with the fact that interally C++ is using positional references instead of names. So each time one modifies the fields of the result internal code would break.

I have fixed that by addressing with names. I had to rename a few internal slots for that. This also makes the internal names consistent with the user level output. I don't think you will object that, but please have a look at the last commit.

@mlampros
Copy link
Owner

thank you @vspinu I'll have a look today in the afternoon

@github-actions github-actions bot removed the stale label Oct 13, 2021
  Had to rename internal slots for internal consistency reason:
    - bst_sample_silhouette_matrix -> silhouette_matrix
    - bst_dissimilarity -> best_dissimilarity
    - bst_sample_dissimilarity_matrix -> dissimilarity_matrix

  This is also consistent with user level fields names.
@mlampros
Copy link
Owner

thank you very much @vspinu for all the changes. Give me a few days to update the version on CRAN.
I'll add also your details in the DESCRIPTION file, is it ok by you?

person("Vitalie", "Spinu", role = "ctb")

Let me know. thanks.

@mlampros mlampros merged commit c69e8d1 into mlampros:master Oct 15, 2021
@vspinu
Copy link
Contributor Author

vspinu commented Oct 17, 2021

I'll add also your details in the DESCRIPTION file, is it ok by you?

Sure. Thanks! I still think my contribution is a bit too slim for a contributor tag, but ok. I will try to get some more stuff in as I go with using the package.

@mlampros
Copy link
Owner

@vspinu,

I intended today to submit the new version of the ClusterR package to CRAN and I ran an R CMD build after I cloned the Github repository. However, on Linux I receive the following error:

> system("R CMD build ClusterR")
* checking for fileClusterR/DESCRIPTION... OK
* preparingClusterR:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-buildingthe_clusterR_package.Rmdusing rmarkdown
Quitting from lines 189-198 (the_clusterR_package.Rmd) 
Error: processing vignette 'the_clusterR_package.Rmd' failed with diagnostics:
no applicable method for 'predict' applied to an object of class "c('KMeansCluster', 'k-means clustering')"
--- failed re-buildingthe_clusterR_package.RmdSUMMARY: processing the following file failed:the_clusterR_package.RmdError: Vignette re-building failed.
Execution halted

Can you reproduce this error locally with the cloned repository? It seems that it's related to this pull request.

@mlampros
Copy link
Owner

I explained in another PR on how I build and check an R package that includes Rcpp code. Therefore, if you want to reproduce the error can you follow these steps?

@vspinu
Copy link
Contributor Author

vspinu commented Oct 24, 2021

I am pretty sure everything is ok and it's some strange local issue. Those methods are registered in the NAMESPACE file. Make sure you don't introduce any diffs to NAMESPACE with your process.

R CMD build ClusterR and checks do work ok for me. It also works on travis, right?

I have also triggered a winbuild (don't be surprised to see the email with a result).

@mlampros
Copy link
Owner

thanks @vspinu, the winbuild looks ok (see the attached). the github action of this PR worked too. I'm wondering why I get this error. I'll have a look again and notify you.

00check_winbuilder.log

@mlampros
Copy link
Owner

mlampros commented Oct 24, 2021

you were right about the differences in the NAMESPACE file but I do not understand why do I get these diffs? I'm not experienced with S3 Methods. What I do is I make a clean build, check and install after I remove the .Rd files of the 'man' folder and create the documentation using devtools::document(). Is there something wrong with this procedure?
For instance before the clean build, check I have

S3method(predict,GMMCluster)
S3method(predict,KMeansCluster)
S3method(predict,MedoidsCluster)
S3method(print,GMMCluster)
S3method(print,KMeansCluster)
S3method(print,MedoidsCluster)

in the NAMESPACE and after I use devtools::document() I have

S3method(print,GMMCluster)
S3method(print,KMeansCluster)
S3method(print,MedoidsCluster)

do we have to use Collate in the DESCRIPTION file now that the ClusterR package makes use of S3 methods?

@vspinu
Copy link
Contributor Author

vspinu commented Oct 24, 2021

I also just use devtools::document. Can it be an old version of devtools or roxygen2?

@mlampros
Copy link
Owner

I use the latest versions of 'devtools' (2.4.2) and 'roxygen2' (7.1.2) on Linux with R version 4.1.1 (2021-08-10)

Which Operating System do you use? Is it possible that you can run the `devtools::document()' function on a Linux machine? It seems to me that it has to do with how we document the S3 methods (a similar SO issue)

When I run devtools::document() I receive,

S3method(print,GMMCluster)
S3method(print,KMeansCluster)
S3method(print,MedoidsCluster)
export(print.GMMCluster)
export(predict.GMMCluster)
export(predict.KMeansCluster)
export(predict.MedoidsCluster)

whereas currently in this repo we have,

S3method(predict,GMMCluster)
S3method(predict,KMeansCluster)
S3method(predict,MedoidsCluster)
S3method(print,GMMCluster)
S3method(print,KMeansCluster)
S3method(print,MedoidsCluster)

The Run Rscript -e 'tic::script()' that the Github Action of this repo uses just only 'builds' and 'checks' the package and does not create the documentation or updates the NAMESPACE.

I know that 'devtools::document' is not a requirement but if I'm not able to re-create the .Rd files then I might have issues in the future in case that I want to update any function of the package

@vspinu
Copy link
Contributor Author

vspinu commented Oct 25, 2021

When I run devtools::document() I receive,

But that's exactly what is needed. It fixes your original issue, right?

The export(print.GMMCluster) is optional and is probably there because you have added #' @export print.GMMCluster. I don't think it's necessary to export those functions, but it's surely up to you.

I am on linux.

@mlampros
Copy link
Owner

But that's exactly what is needed. It fixes your original issue, right?

I'm still receiving the error,

> system("R CMD build ClusterR")
* checking for fileClusterR/DESCRIPTION... OK
* preparingClusterR:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-buildingthe_clusterR_package.Rmdusing rmarkdown
Quitting from lines 189-198 (the_clusterR_package.Rmd) 
Error: processing vignette 'the_clusterR_package.Rmd' failed with diagnostics:
no applicable method for 'predict' applied to an object of class "c('KMeansCluster', 'k-means clustering')"
--- failed re-buildingthe_clusterR_package.RmdSUMMARY: processing the following file failed:the_clusterR_package.RmdError: Vignette re-building failed.
Execution halted

and my NAMESPACE file includes the following,

# Generated by roxygen2: do not edit by hand

S3method(print,GMMCluster)
S3method(print,KMeansCluster)
S3method(print,MedoidsCluster)
export(AP_affinity_propagation)
export(AP_preferenceRange)
export(Clara_Medoids)
export(Cluster_Medoids)
export(GMM)
export(KMeans_arma)
export(KMeans_rcpp)
export(MiniBatchKmeans)
export(Optimal_Clusters_GMM)
export(Optimal_Clusters_KMeans)
export(Optimal_Clusters_Medoids)
export(Silhouette_Dissimilarity_Plot)
export(center_scale)
export(distance_matrix)
export(external_validation)
export(plot_2d)
export(predict.GMMCluster)
export(predict.KMeansCluster)
export(predict.MedoidsCluster)
export(predict_GMM)
export(predict_KMeans)
export(predict_MBatchKMeans)
export(predict_Medoids)
import(gtools)
importFrom(Rcpp,evalCpp)
importFrom(ggplot2,aes)
importFrom(ggplot2,element_blank)
importFrom(ggplot2,geom_point)
importFrom(ggplot2,ggplot)
importFrom(ggplot2,scale_shape_manual)
importFrom(ggplot2,scale_size_manual)
importFrom(ggplot2,theme)
importFrom(gmp,as.bigz)
importFrom(gmp,asNumeric)
importFrom(gmp,chooseZ)
importFrom(grDevices,dev.cur)
importFrom(grDevices,dev.off)
importFrom(graphics,abline)
importFrom(graphics,axis)
importFrom(graphics,barplot)
importFrom(graphics,legend)
importFrom(graphics,lines)
importFrom(graphics,mtext)
importFrom(graphics,par)
importFrom(graphics,plot)
importFrom(graphics,text)
importFrom(graphics,title)
importFrom(stats,na.omit)
importFrom(utils,globalVariables)
importFrom(utils,setTxtProgressBar)
importFrom(utils,txtProgressBar)
useDynLib(ClusterR, .registration = TRUE)

I don't know why and you can't reproduce it although on linux

@vspinu
Copy link
Contributor Author

vspinu commented Oct 27, 2021

That's a bit crazy. CMD build is not touching the namespace. So what is changing your namespace then? Didn't you say that devtools::document generates the NAMESPACE with predict exports in this comment?

I have a hunch. Could you add predict to the list of imported functions from stats here.

Do you have stats package in your search path when you run devtools::document? For me stats is always loaded by default.

@mlampros
Copy link
Owner

devtools::document generates the NAMESPACE with predict exports in this comment?

In this comment I modified you function to make the R CMD build work without success. Don't take this comment into account.

I have a hunch. Could you add predict to the list of imported functions from stats here.

I tried this as well but I receive the same error.

For me stats is always loaded by default.

Althouth the stats package is a System Library package I always deal with its functions as if it were a User Library package. This because I receive NOTEs when I perform R CMD check --as-cran
Therefore, if functions of the stats package exist in my code then I always include it in the imports of the DESCRIPTION file.

Although I had the chance to go through the S3 and S4 classes in the past, I preferred in general the R6 classes due to its simplicity and familiarity with Python classes. One of the reasons that I'm now not in position to recognize the source of the problem that arises from the predict error during the R CMD build of the vignettes.

@mlampros
Copy link
Owner

I submitted the updated version to CRAN (1.2.6). This was actually a problem with the R version I used. After updating to 4.1.1 the issue was resolved.

@vspinu
Copy link
Contributor Author

vspinu commented Jan 28, 2022

Perfectos! Happy to hear it solved by itself :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants