Out-of-Bag Proximity Matrix #234

andland · 2017-11-07T23:24:16Z

It is possible to create a proximity matrix using type = "terminalNodes" in the predict statement or with the function edarf::extract_proximity. However, if you want the proximity matrix for the same data that you used to train the random forest, it will use all of the trees, not just the out-of-bag ones, which will result in optimistic proximity metrics.

With randomForest, you can use the argument oob.prox=TRUE to calculate the proximity matrix on training data with out-of-bag trees.

Is there a way to get the out-of-bag estimate of the terminal nodes? If not, can this option be added?

Thanks!

The text was updated successfully, but these errors were encountered:

mnwright · 2017-11-08T07:34:26Z

The proximity should just be computed for trees where both observations are OOB?

mnwright · 2017-11-08T08:17:43Z

Based on edarf::extract_proximity I've created a function for this:

extract_proximity_oob = function(fit, olddata) {
  pred = predict(fit, olddata, type = "terminalNodes")$predictions
  prox = matrix(NA, nrow(pred), nrow(pred))
  ntree = ncol(pred)
  n = nrow(prox)
  
  if (is.null(fit$inbag.counts)) {
    stop("call ranger with keep.inbag = TRUE")
  }
  
  # Get inbag counts
  inbag = simplify2array(fit$inbag.counts)
  
  for (i in 1:n) {
    for (j in 1:n) {
      # Use only trees where both obs are OOB
      tree_idx = inbag[i, ] == 0 & inbag[j, ] == 0
      prox[i, j] = sum(pred[i, tree_idx] == pred[j, tree_idx]) / sum(tree_idx)
    }
  }
  
  prox
}

@zmjones: I'd like to create a PR in zmjones/edarf but we don't have the original data in the ranger object so it's probably not possible with newdata = NULL as in randomForest, correct?

andland · 2017-11-08T19:03:48Z

Awesome, thanks!

aptperson · 2018-02-23T03:02:26Z

I've written a cpp function to take care of the inner loops in the above. Also, there are some short circuits built in:

if i == j, we know the proximity is 1
if i > j, we are below the diagonal and we can copy the proximity from above the diagonal

I'm no cpp expert, so this is probably horrible code, but it gets the job done.

extract_proximity_oob = function(fit, olddata) {
  pred = predict(fit, olddata, type = "terminalNodes")$predictions
  prox = matrix(NA, nrow(pred), nrow(pred))
  ntree = ncol(pred)
  ns = nrow(prox)
  
  if (is.null(fit$inbag.counts)) {
    stop("call ranger with keep.inbag = TRUE")
  }
  
  # Get inbag counts
  inbag = simplify2array(fit$inbag.counts)
  
  oob_proximity <- extract_proximity_oob_cpp(pred = pred, prox = prox, inbag = inbag, ntree = ntree, ns = ns)

  return(oob_proximity)
}

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix extract_proximity_oob_cpp(NumericMatrix pred,
                                        NumericMatrix prox,
                                        NumericMatrix inbag,
                                        int ntree,
                                        int ns) {
  
  // define variables
  NumericVector tree_idx;
  double same_bag_total;
  double pred_total;
  
  for(int i = 0; i < ns; i++ ){
    // loop down the rows
    if(i % 100 == 0){
      std::cout << "i: ";
      std::cout << i;
      std::cout << "\n";
    }
    
    for(int j = 0; j < ns; j ++){

      if(i == j){
        // self similarity
        prox(i, j) = 1;
      }
      else{
        
        if(j < i){
          // we are below the diagonal, we can copy down the already computed similarity
          prox(i, j) = prox(j, i);
        }else{
       
        tree_idx = inbag(i, _) + inbag(j, _);
        
        pred_total = 0;
        same_bag_total = 0;
        
        for(int nt = 0; nt < ntree; nt ++){
          if( tree_idx[nt] == 0){
           
            same_bag_total += 1;
            
            if(pred(i, nt) == pred(j, nt)){
              pred_total +=  1;
            }
            
          } // end if
          
        } // end for nt
       
        prox(i, j) = pred_total / same_bag_total;
        }
      }
    } // end for j
    
  } // end for i
  
  return prox;
}```

grayskripko · 2019-06-18T16:26:53Z

@mnwright just to let you know, the solution can be demanding on memory.

> pred %>% dim
[1] 107880    500

> prox <- matrix(NA, nrow(pred), nrow(pred))
Error: cannot allocate vector of size 43.4 Gb

mnwright · 2019-06-21T10:05:38Z

I don't think this is specific to this method, it's just the amount of memory needed for a 107880x 107880 matrix (it seems NA's need 32 bit):
107880^2 * 32 / 8 / 2^30 = 43.35528

mnwright closed this as completed Dec 14, 2017

mnwright mentioned this issue May 22, 2020

Unsupervised random forest with ranger #514

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-Bag Proximity Matrix #234

Out-of-Bag Proximity Matrix #234

andland commented Nov 7, 2017

mnwright commented Nov 8, 2017

mnwright commented Nov 8, 2017

andland commented Nov 8, 2017

aptperson commented Feb 23, 2018

grayskripko commented Jun 18, 2019 •

edited

mnwright commented Jun 21, 2019

Out-of-Bag Proximity Matrix #234

Out-of-Bag Proximity Matrix #234

Comments

andland commented Nov 7, 2017

mnwright commented Nov 8, 2017

mnwright commented Nov 8, 2017

andland commented Nov 8, 2017

aptperson commented Feb 23, 2018

grayskripko commented Jun 18, 2019 • edited

mnwright commented Jun 21, 2019

grayskripko commented Jun 18, 2019 •

edited