speeding up `test_clusters_approx` #5

aguang · 2022-03-21T16:59:48Z

Like others who have opened issues, I'm interested in running this neat method to look at the difference of means in my clusters from single cell data. (Well, really I would like to adjust p-values for differential expression analysis, similar to #3 and #4, but as I understand it that extension is not possible right now). However running test_clusters_approx is painfully slow, as in #2. I took a look at the source code and it seems like the function could easily be sped up by running the piece where it runs cl_fun in a for loop in parallel (using future or any other parallellization package):

From

for(j in 1:ndraws) {
    if(phi[j] < 0) next
    
    # Compute perturbed data set
    Xphi <- X
    Xphi[cl == k1, ] <- t(orig_k1 + (phi[j] - stat)*k1_constant)
    Xphi[cl == k2, ] <- t(orig_k2 + (phi[j] - stat)*k2_constant)
    
    # Recluster the perturbed data set
    cl_Xphi <- cl_fun(Xphi)
    if(preserve_cl(cl, cl_Xphi, k1, k2)) {
      log_survives[j] <- -(phi[j]/scale_factor)^2/2 + (q-1)*log(phi[j]/scale_factor) - (q/2 - 1)*log(2) - log(gamma(q/2)) - log(scale_factor) -
        stats::dnorm(phi[j], mean=stat, sd=scale_factor, log=TRUE)
    }
  }

to

future_lapply(X = 1:ndraws, FUN = function(j) {
    if(phi[j] < 0) next
    
    # Compute perturbed data set
    Xphi <- X
    Xphi[cl == k1, ] <- t(orig_k1 + (phi[j] - stat)*k1_constant)
    Xphi[cl == k2, ] <- t(orig_k2 + (phi[j] - stat)*k2_constant)
    
    # Recluster the perturbed data set
    cl_Xphi <- cl_fun(Xphi)
    if(preserve_cl(cl, cl_Xphi, k1, k2)) {
      log_survives[j] <- -(phi[j]/scale_factor)^2/2 + (q-1)*log(phi[j]/scale_factor) - (q/2 - 1)*log(2) - log(gamma(q/2)) - log(scale_factor) -
        stats::dnorm(phi[j], mean=stat, sd=scale_factor, log=TRUE)
    }
  }

Would you be willing to implement this? I am also happy to make the modifications and submit a pull request for it.

The text was updated successfully, but these errors were encountered:

lucylgao · 2022-03-23T16:08:55Z

Thank you very much for the suggestion! I totally agree that it would be helpful to make the code less embarrassingly parallel.

I just pushed a commit that replaces the for loop in the test_clusters_approx function with future_lapply, as per your suggestion.

lucylgao closed this as completed Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speeding up `test_clusters_approx` #5

speeding up `test_clusters_approx` #5

aguang commented Mar 21, 2022

lucylgao commented Mar 23, 2022

speeding up test_clusters_approx #5

speeding up test_clusters_approx #5

Comments

aguang commented Mar 21, 2022

lucylgao commented Mar 23, 2022

speeding up `test_clusters_approx` #5

speeding up `test_clusters_approx` #5