Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-Bag Proximity Matrix #234

Closed
andland opened this issue Nov 7, 2017 · 6 comments
Closed

Out-of-Bag Proximity Matrix #234

andland opened this issue Nov 7, 2017 · 6 comments

Comments

@andland
Copy link

andland commented Nov 7, 2017

It is possible to create a proximity matrix using type = "terminalNodes" in the predict statement or with the function edarf::extract_proximity. However, if you want the proximity matrix for the same data that you used to train the random forest, it will use all of the trees, not just the out-of-bag ones, which will result in optimistic proximity metrics.

With randomForest, you can use the argument oob.prox=TRUE to calculate the proximity matrix on training data with out-of-bag trees.

Is there a way to get the out-of-bag estimate of the terminal nodes? If not, can this option be added?

Thanks!

@mnwright
Copy link
Member

mnwright commented Nov 8, 2017

The proximity should just be computed for trees where both observations are OOB?

@mnwright
Copy link
Member

mnwright commented Nov 8, 2017

Based on edarf::extract_proximity I've created a function for this:

extract_proximity_oob = function(fit, olddata) {
  pred = predict(fit, olddata, type = "terminalNodes")$predictions
  prox = matrix(NA, nrow(pred), nrow(pred))
  ntree = ncol(pred)
  n = nrow(prox)
  
  if (is.null(fit$inbag.counts)) {
    stop("call ranger with keep.inbag = TRUE")
  }
  
  # Get inbag counts
  inbag = simplify2array(fit$inbag.counts)
  
  for (i in 1:n) {
    for (j in 1:n) {
      # Use only trees where both obs are OOB
      tree_idx = inbag[i, ] == 0 & inbag[j, ] == 0
      prox[i, j] = sum(pred[i, tree_idx] == pred[j, tree_idx]) / sum(tree_idx)
    }
  }
  
  prox
}

@zmjones: I'd like to create a PR in zmjones/edarf but we don't have the original data in the ranger object so it's probably not possible with newdata = NULL as in randomForest, correct?

@andland
Copy link
Author

andland commented Nov 8, 2017

Awesome, thanks!

@aptperson
Copy link

I've written a cpp function to take care of the inner loops in the above. Also, there are some short circuits built in:

  1. if i == j, we know the proximity is 1
  2. if i > j, we are below the diagonal and we can copy the proximity from above the diagonal

I'm no cpp expert, so this is probably horrible code, but it gets the job done.

extract_proximity_oob = function(fit, olddata) {
  pred = predict(fit, olddata, type = "terminalNodes")$predictions
  prox = matrix(NA, nrow(pred), nrow(pred))
  ntree = ncol(pred)
  ns = nrow(prox)
  
  if (is.null(fit$inbag.counts)) {
    stop("call ranger with keep.inbag = TRUE")
  }
  
  # Get inbag counts
  inbag = simplify2array(fit$inbag.counts)
  
  oob_proximity <- extract_proximity_oob_cpp(pred = pred, prox = prox, inbag = inbag, ntree = ntree, ns = ns)

  return(oob_proximity)
}
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix extract_proximity_oob_cpp(NumericMatrix pred,
                                        NumericMatrix prox,
                                        NumericMatrix inbag,
                                        int ntree,
                                        int ns) {
  
  // define variables
  NumericVector tree_idx;
  double same_bag_total;
  double pred_total;
  
  for(int i = 0; i < ns; i++ ){
    // loop down the rows
    if(i % 100 == 0){
      std::cout << "i: ";
      std::cout << i;
      std::cout << "\n";
    }
    
    for(int j = 0; j < ns; j ++){

      if(i == j){
        // self similarity
        prox(i, j) = 1;
      }
      else{
        
        if(j < i){
          // we are below the diagonal, we can copy down the already computed similarity
          prox(i, j) = prox(j, i);
        }else{
       
        tree_idx = inbag(i, _) + inbag(j, _);
        
        pred_total = 0;
        same_bag_total = 0;
        
        for(int nt = 0; nt < ntree; nt ++){
          if( tree_idx[nt] == 0){
           
            same_bag_total += 1;
            
            if(pred(i, nt) == pred(j, nt)){
              pred_total +=  1;
            }
            
          } // end if
          
        } // end for nt
       
        prox(i, j) = pred_total / same_bag_total;
        }
      }
    } // end for j
    
  } // end for i
  
  return prox;
}```

@grayskripko
Copy link

grayskripko commented Jun 18, 2019

@mnwright just to let you know, the solution can be demanding on memory.

> pred %>% dim
[1] 107880    500

> prox <- matrix(NA, nrow(pred), nrow(pred))
Error: cannot allocate vector of size 43.4 Gb

@mnwright
Copy link
Member

I don't think this is specific to this method, it's just the amount of memory needed for a 107880x 107880 matrix (it seems NA's need 32 bit):
107880^2 * 32 / 8 / 2^30 = 43.35528

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants