Annoy algorithm in `bluster` cannot accelerate the clustering process #12

Yunuuuu · 2022-11-06T09:08:47Z

Hi, It seems Annoy algorithm in bluster cannot accelerate the clustering process. the uncorrected_sce is a SingleCellExperiment object with a reducedDimNames slot PCA which has a dim ([1] 111455 50).
seurat_obj is converted from uncorrected_sce

microbenchmark::microbenchmark(
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "rank"
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L,
            BNPARAM = BiocNeighbors::AnnoyParam()
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "jaccard", cluster.fun = "louvain"
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "jaccard", cluster.fun = "louvain",
            BNPARAM = BiocNeighbors::AnnoyParam()
        )
    ),
    {
        seurat_nei <- FindNeighbors(
            seurat_obj,
            dims = 1:50,
            k.param = 25L
        )
        seurat_nei <- FindClusters(seurat_nei, resolution = 1)
    },
    times = 1L
)

                                                               expr
                                                                    scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "rank"))
                                            scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      BNPARAM = BiocNeighbors::AnnoyParam()))
                                        scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "jaccard", cluster.fun = "louvain"))
 scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "jaccard", cluster.fun = "louvain", BNPARAM = BiocNeighbors::AnnoyParam()))
                                                       {     seurat_nei <- FindNeighbors(seurat_obj, dims = 1:50, k.param = 25L)     seurat_nei <- FindClusters(seurat_nei, resolution = 1) }
        min         lq       mean     median
 2519.89881 2519.89881 2519.89881 2519.89881
 2402.44808 2402.44808 2402.44808 2402.44808
   59.89610   59.89610   59.89610   59.89610
   66.01533   66.01533   66.01533   66.01533
   39.35732   39.35732   39.35732   39.35732
         uq        max neval
 2519.89881 2519.89881     1
 2402.44808 2402.44808     1
   59.89610   59.89610     1
   66.01533   66.01533     1
   39.35732   39.35732     1

The text was updated successfully, but these errors were encountered:

Yunuuuu · 2022-11-06T09:03:19Z

@LTLA , thanks for your reply and the development of the great single cell toolkit in Bioconducot. Since I only used the first 50 PCs to clustering, annoy cannot provide the optimization. I recently compare the performance between scran and seurat. Seurat used annoy algorithm as the default and used louvain and jaccard to cluster cells, which I found much faster than scran in the same paramters, but I cannot get a similar performance. is it possible to run faster here ?

Yunuuuu · 2022-11-06T09:04:21Z

here is my sessionInfo:

[R]> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8      
 [2] LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8      
 [8] LC_NAME=C                 
 [9] LC_ADDRESS=C              
[10] LC_TELEPHONE=C            
[11] LC_MEASUREMENT=zh_CN.UTF-8
[12] LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices
[5] utils     datasets  methods   base     

other attached packages:
 [1] here_1.0.1                 
 [2] batchelor_1.14.0           
 [3] SingleCellExperiment_1.20.0
 [4] SummarizedExperiment_1.28.0
 [5] Biobase_2.58.0             
 [6] GenomicRanges_1.50.0       
 [7] GenomeInfoDb_1.34.0        
 [8] IRanges_2.32.0             
 [9] S4Vectors_0.36.0           
[10] BiocGenerics_0.44.0        
[11] MatrixGenerics_1.10.0      
[12] matrixStats_0.62.0         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9               
 [2] bluster_1.8.0            
 [3] compiler_4.2.1           
 [4] progressr_0.11.0         
 [5] XVector_0.38.0           
 [6] bitops_1.0-7             
 [7] BiocNeighbors_1.16.0     
 [8] tools_4.2.1              
 [9] DelayedMatrixStats_1.20.0
[10] zlibbioc_1.44.0          
[11] statmod_1.4.37           
[12] metapod_1.6.0            
[13] digest_0.6.30            
[14] jsonlite_1.8.3           
[15] lattice_0.20-45          
[16] pkgconfig_2.0.3          
[17] rlang_1.0.6              
[18] igraph_1.3.5             
[19] Matrix_1.5-1             
[20] DelayedArray_0.24.0      
[21] cli_3.4.1                
[22] parallel_4.2.1           
[23] GenomeInfoDbData_1.2.9   
[24] cluster_2.1.4            
[25] locfit_1.5-9.6           
[26] rprojroot_2.0.3          
[27] grid_4.2.1               
[28] scuttle_1.8.0            
[29] BiocParallel_1.32.0      
[30] limma_3.54.0             
[31] irlba_2.3.5.1            
[32] edgeR_3.40.0             
[33] magrittr_2.0.3           
[34] BiocSingular_1.14.0      
[35] codetools_0.2-18         
[36] sparseMatrixStats_1.10.0 
[37] beachmat_2.14.0          
[38] rsvd_1.0.5               
[39] dqrng_0.3.0              
[40] ResidualMatrix_1.8.0     
[41] ScaledMatrix_1.6.0       
[42] RCurl_1.98-1.9           
[43] scran_1.26.0

LTLA · 2022-11-06T09:49:18Z

Works fine for me.

library(bluster)
library(BiocNeighbors)

m <- matrix(rnorm(5e6), ncol=50)
system.time(X <- clusterRows(m, BLUSPARAM = SNNGraphParam(cluster.fun="louvain", BNPARAM=AnnoyParam())))
##    user  system elapsed 
## 181.504   1.076 182.632 

system.time(X <- clusterRows(m, BLUSPARAM = SNNGraphParam(cluster.fun="louvain"))) 
# ... DNF after 10 minutes...

For smaller datasets, there may not be any speed-up as the Annoy approach writes its index to file for more general multi-threading via BPPARAM; the file I/O offsets any performance gain from approximation.

Session information

Still on the last devel cycle, but there weren't any major changes in the latest release, so it shouldn't matter.

R version 4.2.1 RC (2022-06-17 r82506)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-2-branch-devel/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-2-branch-devel/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocNeighbors_1.15.1 bluster_1.7.0       

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9           lattice_0.20-45      codetools_0.2-18    
 [4] grid_4.2.1           stats4_4.2.1         magrittr_2.0.3      
 [7] cli_3.4.1            rlang_1.0.6          S4Vectors_0.35.4    
[10] Matrix_1.5-1         BiocParallel_1.31.12 igraph_1.3.5        
[13] parallel_4.2.1       compiler_4.2.1       pkgconfig_2.0.3     
[16] BiocGenerics_0.43.4  cluster_2.1.4

LTLA · 2022-11-06T20:43:05Z

FWIW the vast majority of time is spent inside igraph's cluster_* functions. With 100,000 random cells, the nearest neighbors and graph construction takes about 30 seconds; the rest of it (> 4 minutes) is spent inside cluster_louvain.

You can speed up the graph construction by parallelizing it via BPPARAM, but it'll probably just save you a few seconds or so. The actual clustering itself is serial so it doesn't benefit from parallelization, at least not in igraph's C implementation.

IIRC Seurat had their own implementation of the Louvain algorithm. I don't know whether or not this is of the same quality as igraph's implementation, but if it's noticeably faster, they may be taking some short-cuts that igraph does not.

Yunuuuu · 2022-11-07T14:38:27Z

Thanks for your detail explanation, I'll persist in bioconductor single cell toolkit instead of pursuing performance blindly.

So from from my side it would be ok to close the issue now, and I also got the same description that we won't get any benifits from annoy for small datasets in http://bioconductor.org/books/3.15/OSCA.advanced/dealing-with-big-data.html#fast-approximations, I'll read this book again.

Thanks for your help @LTLA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annoy algorithm in `bluster` cannot accelerate the clustering process #12

Annoy algorithm in `bluster` cannot accelerate the clustering process #12

Yunuuuu commented Nov 6, 2022

Yunuuuu commented Nov 6, 2022

Yunuuuu commented Nov 6, 2022

LTLA commented Nov 6, 2022

LTLA commented Nov 6, 2022

Yunuuuu commented Nov 7, 2022

Annoy algorithm in bluster cannot accelerate the clustering process #12

Annoy algorithm in bluster cannot accelerate the clustering process #12

Comments

Yunuuuu commented Nov 6, 2022

Yunuuuu commented Nov 6, 2022

Yunuuuu commented Nov 6, 2022

LTLA commented Nov 6, 2022

LTLA commented Nov 6, 2022

Yunuuuu commented Nov 7, 2022

Annoy algorithm in `bluster` cannot accelerate the clustering process #12

Annoy algorithm in `bluster` cannot accelerate the clustering process #12