New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out-of-Bag Proximity Matrix #234
Comments
The proximity should just be computed for trees where both observations are OOB? |
Based on extract_proximity_oob = function(fit, olddata) {
pred = predict(fit, olddata, type = "terminalNodes")$predictions
prox = matrix(NA, nrow(pred), nrow(pred))
ntree = ncol(pred)
n = nrow(prox)
if (is.null(fit$inbag.counts)) {
stop("call ranger with keep.inbag = TRUE")
}
# Get inbag counts
inbag = simplify2array(fit$inbag.counts)
for (i in 1:n) {
for (j in 1:n) {
# Use only trees where both obs are OOB
tree_idx = inbag[i, ] == 0 & inbag[j, ] == 0
prox[i, j] = sum(pred[i, tree_idx] == pred[j, tree_idx]) / sum(tree_idx)
}
}
prox
} @zmjones: I'd like to create a PR in zmjones/edarf but we don't have the original data in the ranger object so it's probably not possible with |
Awesome, thanks! |
I've written a cpp function to take care of the inner loops in the above. Also, there are some short circuits built in:
I'm no cpp expert, so this is probably horrible code, but it gets the job done. extract_proximity_oob = function(fit, olddata) {
pred = predict(fit, olddata, type = "terminalNodes")$predictions
prox = matrix(NA, nrow(pred), nrow(pred))
ntree = ncol(pred)
ns = nrow(prox)
if (is.null(fit$inbag.counts)) {
stop("call ranger with keep.inbag = TRUE")
}
# Get inbag counts
inbag = simplify2array(fit$inbag.counts)
oob_proximity <- extract_proximity_oob_cpp(pred = pred, prox = prox, inbag = inbag, ntree = ntree, ns = ns)
return(oob_proximity)
} #include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix extract_proximity_oob_cpp(NumericMatrix pred,
NumericMatrix prox,
NumericMatrix inbag,
int ntree,
int ns) {
// define variables
NumericVector tree_idx;
double same_bag_total;
double pred_total;
for(int i = 0; i < ns; i++ ){
// loop down the rows
if(i % 100 == 0){
std::cout << "i: ";
std::cout << i;
std::cout << "\n";
}
for(int j = 0; j < ns; j ++){
if(i == j){
// self similarity
prox(i, j) = 1;
}
else{
if(j < i){
// we are below the diagonal, we can copy down the already computed similarity
prox(i, j) = prox(j, i);
}else{
tree_idx = inbag(i, _) + inbag(j, _);
pred_total = 0;
same_bag_total = 0;
for(int nt = 0; nt < ntree; nt ++){
if( tree_idx[nt] == 0){
same_bag_total += 1;
if(pred(i, nt) == pred(j, nt)){
pred_total += 1;
}
} // end if
} // end for nt
prox(i, j) = pred_total / same_bag_total;
}
}
} // end for j
} // end for i
return prox;
}``` |
@mnwright just to let you know, the solution can be demanding on memory. > pred %>% dim
[1] 107880 500
> prox <- matrix(NA, nrow(pred), nrow(pred))
Error: cannot allocate vector of size 43.4 Gb |
I don't think this is specific to this method, it's just the amount of memory needed for a 107880x 107880 matrix (it seems NA's need 32 bit): |
It is possible to create a proximity matrix using
type = "terminalNodes"
in thepredict
statement or with the functionedarf::extract_proximity
. However, if you want the proximity matrix for the same data that you used to train the random forest, it will use all of the trees, not just the out-of-bag ones, which will result in optimistic proximity metrics.With
randomForest
, you can use the argumentoob.prox=TRUE
to calculate the proximity matrix on training data with out-of-bag trees.Is there a way to get the out-of-bag estimate of the terminal nodes? If not, can this option be added?
Thanks!
The text was updated successfully, but these errors were encountered: