Unsupervised random forest with ranger #514

RJ333 · 2020-05-20T15:40:19Z

Hello there,

I was wondering if it is possible to generate unsupervised forests as in the RandomForest package? This is the only use case where I have to fall back to RandomForest and it would be great to have this parallelized and all under one hood.

Thanks for the ranger package, anyway!

Cheers

René

The text was updated successfully, but these errors were encountered:

mnwright · 2020-05-22T05:15:08Z

Not out of the box, but isn't it working like this:

synth <- as.data.frame(lapply(iris[, 1:4], function(x) {
  sample(x, length(x), replace = TRUE)
}))
dat <- rbind(data.frame(y = 0, iris[, 1:4]), 
             data.frame(y = 1, synth))
dat$y <- factor(dat$y)
rf.fit <- ranger(y ~ ., dat, keep.inbag = TRUE)
prox <- extract_proximity_oob(rf.fit, dat)[1:nrow(iris), 1:nrow(iris)]

with extract_proximity_oob() from #234 (comment)?

Maybe we should add this as a function?

RJ333 · 2020-05-22T06:33:57Z

That looks pretty well. Using some code I found on the "Statquest" youtube channel we get this:

library(ggplot2)
# Generate dist matrix and % of variation for x and y axis
row.names(prox) <- row.names(iris)
distance_matrix <- dist(1 - prox)
PCA_object <- cmdscale(distance_matrix, eig = TRUE, x.ret = TRUE)
PCA_variation <- round(PCA_object$eig/sum(PCA_object$eig) * 100, 1)
PCA_values <- PCA_object$points

# Combine to plot ready data frame
PCA_data <- data.frame(Species = iris$Species,
  X = PCA_values[, 1],
  Y = PCA_values[, 2])

# plotting
ggplot(PCA_data, aes(x = X, y = Y, colour = Species, )) +
  geom_point(size = 3) +
  labs(x = paste("PC1 - ", PCA_variation[1], "%", sep = ""),
    y = paste("PC2 - ", PCA_variation[2], "%", sep = ""))

![plot_zoom_png](https://user-images.githubusercontent.com/35432752/82638442-31a7b500-9c07-11ea-97e1-259007043422.png)

It would be great to have this running out of the box if no response variable was provided. I understand the theory how an unsupervised RF should work (I guess) but not the actual implementation. But the way you created the synthetic data set is described here identically:

http://gradientdescending.com/unsupervised-random-forest-example/

However, I don't know if there are other good practices involved.

Did you come to a conclusion with regard to "The proximity should just be computed for trees where both observations are OOB?" in #234?

mnwright · 2020-05-22T11:50:33Z

However, I don't know if there are other good practices involved.

At least the sampling from empirical distributions works better than from uniform distributions, according to Shi & Horvath (2006). But there might be newer results or alternatives.

Did you come to a conclusion with regard to "The proximity should just be computed for trees where both observations are OOB?" in #234?

Yes, both observations have to be OOB to have a proper OOB proximity measure (it's also the same in randomForest(..., oob.prox = TRUE).

RJ333 · 2020-06-04T14:40:47Z

Hello again,

Thanks for your answer with regard to OOB proximity.

I'm testing the unsupervised version on my full data set (300 x 17000). This part

synth <- as.data.frame(lapply(iris[, 1:4], function(x) {
  sample(x, length(x), replace = TRUE)
}))

can take quite a while and (at least on my machine) does not make use of multiple cores. Do you have any ideas on how to speed it up?

EDIT: I realized that in your example you use a data.frame and I had a matrix (which also gave a non-sensical result. Converting from matrix to data.frame made it work in less than a second. However, as I don't know much about performance tuning there might still be some potential there

mnwright · 2020-06-05T05:01:10Z

You could parallelize the lapply(), but reading your edit that shouldn't be necessary.

coforfe · 2021-04-29T07:22:18Z

Hello Marvin,

Following this thread I am just curious if you are considering to integrate this functionality in ranger.

The current implementions in randomForest or in randomForestSRC are not fast enough, as I think it could be possible with the C++ integration here in your package.

Thanks!
Carlos.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupervised random forest with ranger #514

Unsupervised random forest with ranger #514

RJ333 commented May 20, 2020

mnwright commented May 22, 2020

RJ333 commented May 22, 2020 •

edited

mnwright commented May 22, 2020

RJ333 commented Jun 4, 2020 •

edited

mnwright commented Jun 5, 2020

coforfe commented Apr 29, 2021

Unsupervised random forest with ranger #514

Unsupervised random forest with ranger #514

Comments

RJ333 commented May 20, 2020

mnwright commented May 22, 2020

RJ333 commented May 22, 2020 • edited

mnwright commented May 22, 2020

RJ333 commented Jun 4, 2020 • edited

mnwright commented Jun 5, 2020

coforfe commented Apr 29, 2021

RJ333 commented May 22, 2020 •

edited

RJ333 commented Jun 4, 2020 •

edited