New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unsupervised random forest with ranger #514
Comments
Not out of the box, but isn't it working like this: synth <- as.data.frame(lapply(iris[, 1:4], function(x) {
sample(x, length(x), replace = TRUE)
}))
dat <- rbind(data.frame(y = 0, iris[, 1:4]),
data.frame(y = 1, synth))
dat$y <- factor(dat$y)
rf.fit <- ranger(y ~ ., dat, keep.inbag = TRUE)
prox <- extract_proximity_oob(rf.fit, dat)[1:nrow(iris), 1:nrow(iris)] with Maybe we should add this as a function? |
That looks pretty well. Using some code I found on the "Statquest" youtube channel we get this: library(ggplot2)
# Generate dist matrix and % of variation for x and y axis
row.names(prox) <- row.names(iris)
distance_matrix <- dist(1 - prox)
PCA_object <- cmdscale(distance_matrix, eig = TRUE, x.ret = TRUE)
PCA_variation <- round(PCA_object$eig/sum(PCA_object$eig) * 100, 1)
PCA_values <- PCA_object$points
# Combine to plot ready data frame
PCA_data <- data.frame(Species = iris$Species,
X = PCA_values[, 1],
Y = PCA_values[, 2])
# plotting
ggplot(PCA_data, aes(x = X, y = Y, colour = Species, )) +
geom_point(size = 3) +
labs(x = paste("PC1 - ", PCA_variation[1], "%", sep = ""),
y = paste("PC2 - ", PCA_variation[2], "%", sep = ""))
![plot_zoom_png](https://user-images.githubusercontent.com/35432752/82638442-31a7b500-9c07-11ea-97e1-259007043422.png)
It would be great to have this running out of the box if no response variable was provided. I understand the theory how an unsupervised RF should work (I guess) but not the actual implementation. But the way you created the synthetic data set is described here identically: http://gradientdescending.com/unsupervised-random-forest-example/ However, I don't know if there are other good practices involved. Did you come to a conclusion with regard to "The proximity should just be computed for trees where both observations are OOB?" in #234? |
At least the sampling from empirical distributions works better than from uniform distributions, according to Shi & Horvath (2006). But there might be newer results or alternatives.
Yes, both observations have to be OOB to have a proper OOB proximity measure (it's also the same in |
Hello again, Thanks for your answer with regard to OOB proximity. I'm testing the unsupervised version on my full data set (300 x 17000). This part synth <- as.data.frame(lapply(iris[, 1:4], function(x) {
sample(x, length(x), replace = TRUE)
})) can take quite a while and (at least on my machine) does not make use of multiple cores. Do you have any ideas on how to speed it up? EDIT: I realized that in your example you use a data.frame and I had a matrix (which also gave a non-sensical result. Converting from matrix to data.frame made it work in less than a second. However, as I don't know much about performance tuning there might still be some potential there |
You could parallelize the |
Hello Marvin, Following this thread I am just curious if you are considering to integrate this functionality in The current implementions in Thanks! |
Hello there,
I was wondering if it is possible to generate unsupervised forests as in the
RandomForest
package? This is the only use case where I have to fall back toRandomForest
and it would be great to have this parallelized and all under one hood.Thanks for the ranger package, anyway!
Cheers
René
The text was updated successfully, but these errors were encountered: