New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Featureselection/reduction for Clustering #541
Comments
Coming back to one of my orig questions: |
I think if we could pick the appropriate performance measure, this would definitely help. It would have been my first try anyway. Unfortunately my experience with clustering is very limited. So I'd appreciate any help in picking the measure. |
Well you asked for the reason why we currently disallowed normal sfs with a measure for clustering. I would really like for somebody to "weigh in" who knows more about this, so we can offer something in mlr which is an accepted approach in this scenario, and not something we come up with in an adhoc fashion.... |
In principle there's nothing stopping us from optimising e.g. the Dunn Index (and I think this should already be possible for tuning the parameters of the learner?). Since there's no ground truth in clustering and hence you can't really do something completely wrong, I don't have anything against supporting this. I've had a brief look at the sparcl package and it doesn't seem to implement a way of assigning new data points to clusters. This would need to be implemented for integration with |
I think, this sounds very reasonable. The Dunn Index should work fine for selecting the features. For the sparcl package: |
another thing I just read about is that you can actually use random.forests in unsupervised mode to do clustering. would it be possible to include this functionality? then the random.forest.importance could be used as a filter as in the supervised learning case... I am just throwing out ideas here. If you think they are garbage, just tell me. |
what exactly is "use random.forests in unsupervised mode" ? |
I read about it here In the second answer the idea is explained. I also read about it in several other posts, but there were just too many of them to keep track of all. However, this seems to be a fairly common strategy, so I would assume that this functionality should be implemented "somewhere" in R already, preferably of course in the random.forest package you are using in mlr. |
The unsupervised mode creates a set of synthetic data by a univariate bootstrap of the features (which breaks any dependence structure between the features), creates a label ("synthetic" "real") and then predicts this label using a random forest. Then you can do clustering using some sort of decomposition of the proximity matrix (the 1:n entries which correspond to the real data), which gives the proportion of times the i,jth observation in the real data co-occurred in the same terminal node. I guess you can get a permutation importance from this as well. |
@zmjones Do you know if any R package already implements this? Sounds like it would be a non-trivial amount of work to implement ourselves. |
randomForest does it. e.g. library(randomForest)
data(iris)
fit = randomForest(iris[, -ncol(iris)], type = "unsupervised", proximity = TRUE)
fit$proximity
.... Followed by some decomposition of the resultant matrix. I have an implementation of it that works for the other packages that I am working on now but it will probably be a while before that ends up on cran. |
It is described (poorly imo) in this paper. As far as I am aware there hasn't been anything else written about the method in particular. |
Would it be feasible to port this to mlr or is your package going to expose this in some way we can use it from mlr? |
Yea when I have it on cran I will integrate it in. We can use the canonical implementation in randomForest without my stuff though. I guess the trick with using it for clustering is going to be choosing a good method for decomposition/clustering of the proximity matrix. Then we can just call your new classification via clustering function. |
Wait, wouldn't this work the other way round? I.e. clustering via classification. |
Well the point of the unsupervised random forest is to get a rf measure of similarity between observations using only the features which is usually then decomposed and used for clustering. I am not sure what you mean by clustering via classification. You mean to learn the random forest classifier using the target feature, then compute the proximity matrix and decompose that for clustering? That wouldn't really be unsupervised. |
Well I'm just not sure what you mean with the last sentence in your previous comment. I don't see how the classification via clustering would be used in this context. |
"That wouldn't really be unsupervised"? I am confused about what you are confused about :) |
"Then we can just call your new classification via clustering function." |
I am probably off my rocker. I don't know why you would want to do classification this way, sorry. What I meant was that if you can do clustering with the RF in this way (by applying a decomposition method to the unsupervised similarity matrix), then you could plug this into the classif via clustering function. Does that make more sense? |
So then the end goal would be to do classification? Sorry I'm slightly lost. |
Yes you could do classification with the RF clustering algorithm. Either by applying something like KNN directly to the proximity matrix or by decomposing it using something else and then plugging it into your function. Like I said though, I am off my rocker. I don't think that would be ideal: you would just do classification using the RF which would be superior in (I suspect) all cases. |
OK ... after the general confusion last week there seems not to have been any further development in this matter. I was wondering if you had any further ideas? |
Hi mlr-Experts,
I am attaching an E-mail conversation I had with Bernd at the bottom so that we can get a little more input to the matter.
Here the core points in English:
If you have any ideas/input to the matter, it would be very helpful
Thanks
Sebastian
Here the email conversation between Bernd and me (Sorry, it's all German):
On 22.10.2015 10:16, Sebastian Wandernoth wrote:
The text was updated successfully, but these errors were encountered: