Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annoy algorithm in bluster cannot accelerate the clustering process #12

Open
Yunuuuu opened this issue Nov 6, 2022 · 5 comments
Open

Comments

@Yunuuuu
Copy link

Yunuuuu commented Nov 6, 2022

Hi, It seems Annoy algorithm in bluster cannot accelerate the clustering process. the uncorrected_sce is a SingleCellExperiment object with a reducedDimNames slot PCA which has a dim ([1] 111455 50).
seurat_obj is converted from uncorrected_sce

microbenchmark::microbenchmark(
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "rank"
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L,
            BNPARAM = BiocNeighbors::AnnoyParam()
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "jaccard", cluster.fun = "louvain"
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "jaccard", cluster.fun = "louvain",
            BNPARAM = BiocNeighbors::AnnoyParam()
        )
    ),
    {
        seurat_nei <- FindNeighbors(
            seurat_obj,
            dims = 1:50,
            k.param = 25L
        )
        seurat_nei <- FindClusters(seurat_nei, resolution = 1)
    },
    times = 1L
)
                                                               expr
                                                                    scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "rank"))
                                            scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      BNPARAM = BiocNeighbors::AnnoyParam()))
                                        scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "jaccard", cluster.fun = "louvain"))
 scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "jaccard", cluster.fun = "louvain", BNPARAM = BiocNeighbors::AnnoyParam()))
                                                       {     seurat_nei <- FindNeighbors(seurat_obj, dims = 1:50, k.param = 25L)     seurat_nei <- FindClusters(seurat_nei, resolution = 1) }
        min         lq       mean     median
 2519.89881 2519.89881 2519.89881 2519.89881
 2402.44808 2402.44808 2402.44808 2402.44808
   59.89610   59.89610   59.89610   59.89610
   66.01533   66.01533   66.01533   66.01533
   39.35732   39.35732   39.35732   39.35732
         uq        max neval
 2519.89881 2519.89881     1
 2402.44808 2402.44808     1
   59.89610   59.89610     1
   66.01533   66.01533     1
   39.35732   39.35732     1
@Yunuuuu
Copy link
Author

Yunuuuu commented Nov 6, 2022

@LTLA , thanks for your reply and the development of the great single cell toolkit in Bioconducot. Since I only used the first 50 PCs to clustering, annoy cannot provide the optimization. I recently compare the performance between scran and seurat. Seurat used annoy algorithm as the default and used louvain and jaccard to cluster cells, which I found much faster than scran in the same paramters, but I cannot get a similar performance. is it possible to run faster here ?

@Yunuuuu
Copy link
Author

Yunuuuu commented Nov 6, 2022

here is my sessionInfo:

[R]> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8      
 [2] LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8      
 [8] LC_NAME=C                 
 [9] LC_ADDRESS=C              
[10] LC_TELEPHONE=C            
[11] LC_MEASUREMENT=zh_CN.UTF-8
[12] LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices
[5] utils     datasets  methods   base     

other attached packages:
 [1] here_1.0.1                 
 [2] batchelor_1.14.0           
 [3] SingleCellExperiment_1.20.0
 [4] SummarizedExperiment_1.28.0
 [5] Biobase_2.58.0             
 [6] GenomicRanges_1.50.0       
 [7] GenomeInfoDb_1.34.0        
 [8] IRanges_2.32.0             
 [9] S4Vectors_0.36.0           
[10] BiocGenerics_0.44.0        
[11] MatrixGenerics_1.10.0      
[12] matrixStats_0.62.0         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9               
 [2] bluster_1.8.0            
 [3] compiler_4.2.1           
 [4] progressr_0.11.0         
 [5] XVector_0.38.0           
 [6] bitops_1.0-7             
 [7] BiocNeighbors_1.16.0     
 [8] tools_4.2.1              
 [9] DelayedMatrixStats_1.20.0
[10] zlibbioc_1.44.0          
[11] statmod_1.4.37           
[12] metapod_1.6.0            
[13] digest_0.6.30            
[14] jsonlite_1.8.3           
[15] lattice_0.20-45          
[16] pkgconfig_2.0.3          
[17] rlang_1.0.6              
[18] igraph_1.3.5             
[19] Matrix_1.5-1             
[20] DelayedArray_0.24.0      
[21] cli_3.4.1                
[22] parallel_4.2.1           
[23] GenomeInfoDbData_1.2.9   
[24] cluster_2.1.4            
[25] locfit_1.5-9.6           
[26] rprojroot_2.0.3          
[27] grid_4.2.1               
[28] scuttle_1.8.0            
[29] BiocParallel_1.32.0      
[30] limma_3.54.0             
[31] irlba_2.3.5.1            
[32] edgeR_3.40.0             
[33] magrittr_2.0.3           
[34] BiocSingular_1.14.0      
[35] codetools_0.2-18         
[36] sparseMatrixStats_1.10.0 
[37] beachmat_2.14.0          
[38] rsvd_1.0.5               
[39] dqrng_0.3.0              
[40] ResidualMatrix_1.8.0     
[41] ScaledMatrix_1.6.0       
[42] RCurl_1.98-1.9           
[43] scran_1.26.0            

@LTLA
Copy link
Owner

LTLA commented Nov 6, 2022

Works fine for me.

library(bluster)
library(BiocNeighbors)

m <- matrix(rnorm(5e6), ncol=50)
system.time(X <- clusterRows(m, BLUSPARAM = SNNGraphParam(cluster.fun="louvain", BNPARAM=AnnoyParam())))
##    user  system elapsed 
## 181.504   1.076 182.632 

system.time(X <- clusterRows(m, BLUSPARAM = SNNGraphParam(cluster.fun="louvain"))) 
# ... DNF after 10 minutes...

For smaller datasets, there may not be any speed-up as the Annoy approach writes its index to file for more general multi-threading via BPPARAM; the file I/O offsets any performance gain from approximation.

Session information Still on the last devel cycle, but there weren't any major changes in the latest release, so it shouldn't matter.
R version 4.2.1 RC (2022-06-17 r82506)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-2-branch-devel/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-2-branch-devel/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocNeighbors_1.15.1 bluster_1.7.0       

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9           lattice_0.20-45      codetools_0.2-18    
 [4] grid_4.2.1           stats4_4.2.1         magrittr_2.0.3      
 [7] cli_3.4.1            rlang_1.0.6          S4Vectors_0.35.4    
[10] Matrix_1.5-1         BiocParallel_1.31.12 igraph_1.3.5        
[13] parallel_4.2.1       compiler_4.2.1       pkgconfig_2.0.3     
[16] BiocGenerics_0.43.4  cluster_2.1.4       

@LTLA
Copy link
Owner

LTLA commented Nov 6, 2022

FWIW the vast majority of time is spent inside igraph's cluster_* functions. With 100,000 random cells, the nearest neighbors and graph construction takes about 30 seconds; the rest of it (> 4 minutes) is spent inside cluster_louvain.

You can speed up the graph construction by parallelizing it via BPPARAM, but it'll probably just save you a few seconds or so. The actual clustering itself is serial so it doesn't benefit from parallelization, at least not in igraph's C implementation.

IIRC Seurat had their own implementation of the Louvain algorithm. I don't know whether or not this is of the same quality as igraph's implementation, but if it's noticeably faster, they may be taking some short-cuts that igraph does not.

@Yunuuuu
Copy link
Author

Yunuuuu commented Nov 7, 2022

Thanks for your detail explanation, I'll persist in bioconductor single cell toolkit instead of pursuing performance blindly.

So from from my side it would be ok to close the issue now, and I also got the same description that we won't get any benifits from annoy for small datasets in http://bioconductor.org/books/3.15/OSCA.advanced/dealing-with-big-data.html#fast-approximations, I'll read this book again.

Thanks for your help @LTLA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants