Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupervised random forest with ranger #514

Open
RJ333 opened this issue May 20, 2020 · 6 comments
Open

Unsupervised random forest with ranger #514

RJ333 opened this issue May 20, 2020 · 6 comments

Comments

@RJ333
Copy link

RJ333 commented May 20, 2020

Hello there,

I was wondering if it is possible to generate unsupervised forests as in the RandomForest package? This is the only use case where I have to fall back to RandomForest and it would be great to have this parallelized and all under one hood.

Thanks for the ranger package, anyway!

Cheers

René

@mnwright
Copy link
Member

Not out of the box, but isn't it working like this:

synth <- as.data.frame(lapply(iris[, 1:4], function(x) {
  sample(x, length(x), replace = TRUE)
}))
dat <- rbind(data.frame(y = 0, iris[, 1:4]), 
             data.frame(y = 1, synth))
dat$y <- factor(dat$y)
rf.fit <- ranger(y ~ ., dat, keep.inbag = TRUE)
prox <- extract_proximity_oob(rf.fit, dat)[1:nrow(iris), 1:nrow(iris)]

with extract_proximity_oob() from #234 (comment)?

Maybe we should add this as a function?

@RJ333
Copy link
Author

RJ333 commented May 22, 2020

That looks pretty well. Using some code I found on the "Statquest" youtube channel we get this:

library(ggplot2)
# Generate dist matrix and % of variation for x and y axis
row.names(prox) <- row.names(iris)
distance_matrix <- dist(1 - prox)
PCA_object <- cmdscale(distance_matrix, eig = TRUE, x.ret = TRUE)
PCA_variation <- round(PCA_object$eig/sum(PCA_object$eig) * 100, 1)
PCA_values <- PCA_object$points

# Combine to plot ready data frame
PCA_data <- data.frame(Species = iris$Species,
  X = PCA_values[, 1],
  Y = PCA_values[, 2])

# plotting
ggplot(PCA_data, aes(x = X, y = Y, colour = Species, )) +
  geom_point(size = 3) +
  labs(x = paste("PC1 - ", PCA_variation[1], "%", sep = ""),
    y = paste("PC2 - ", PCA_variation[2], "%", sep = ""))

![plot_zoom_png](https://user-images.githubusercontent.com/35432752/82638442-31a7b500-9c07-11ea-97e1-259007043422.png)

It would be great to have this running out of the box if no response variable was provided. I understand the theory how an unsupervised RF should work (I guess) but not the actual implementation. But the way you created the synthetic data set is described here identically:

http://gradientdescending.com/unsupervised-random-forest-example/

However, I don't know if there are other good practices involved.

Did you come to a conclusion with regard to "The proximity should just be computed for trees where both observations are OOB?" in #234?

@mnwright
Copy link
Member

However, I don't know if there are other good practices involved.

At least the sampling from empirical distributions works better than from uniform distributions, according to Shi & Horvath (2006). But there might be newer results or alternatives.

Did you come to a conclusion with regard to "The proximity should just be computed for trees where both observations are OOB?" in #234?

Yes, both observations have to be OOB to have a proper OOB proximity measure (it's also the same in randomForest(..., oob.prox = TRUE).

@RJ333
Copy link
Author

RJ333 commented Jun 4, 2020

Hello again,

Thanks for your answer with regard to OOB proximity.

I'm testing the unsupervised version on my full data set (300 x 17000). This part

synth <- as.data.frame(lapply(iris[, 1:4], function(x) {
  sample(x, length(x), replace = TRUE)
}))

can take quite a while and (at least on my machine) does not make use of multiple cores. Do you have any ideas on how to speed it up?

EDIT: I realized that in your example you use a data.frame and I had a matrix (which also gave a non-sensical result. Converting from matrix to data.frame made it work in less than a second. However, as I don't know much about performance tuning there might still be some potential there

@mnwright
Copy link
Member

mnwright commented Jun 5, 2020

You could parallelize the lapply(), but reading your edit that shouldn't be necessary.

@coforfe
Copy link

coforfe commented Apr 29, 2021

Hello Marvin,

Following this thread I am just curious if you are considering to integrate this functionality in ranger.

The current implementions in randomForest or in randomForestSRC are not fast enough, as I think it could be possible with the C++ integration here in your package.

Thanks!
Carlos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants