Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rebin_peaks: `Error in data.table::rbindlist(rebinned_peaks, use.names = TRUE, idcol = "assay") : #103

Closed
bschilder opened this issue Jul 30, 2022 · 5 comments
Assignees
Labels
bug Something isn't working
Projects

Comments

@bschilder
Copy link
Collaborator

bschilder commented Jul 30, 2022

1. Bug description

Error message prevents rebinning of peaks, but only with certain datasets and certain bin sizes.
bin_size=400 seemed to run fine, but bin_size=200 causes the error.

Console output

Loading required namespace: BSgenome.Hsapiens.UCSC.hg38
Standardising peak files in bins of 200 bp. 
|==================================================================================================| 100%

Error in data.table::rbindlist(rebinned_peaks, use.names = TRUE, idcol = "assay") : 
  Total rows in the list is 3894401091 which is larger than the maximum number of rows, currently 2147483647
In addition: There were 20 warnings (use warnings() to see them) 

Expected behaviour

Peaks rebin regardless of peak file inputs or bin size.

2. Reproducible example

Code

batches <- c("scTS_3_30_jun_2022",
             "reseq_k562_pfc_18_jul_2022")
batch_paths <- file.path("/home/bms20/RDS/project/neurogenomics-lab/live/Data/tip_seq/processed_data",batches,"03_peak_calling/04_called_peaks/")

peaks <- lapply(batch_paths,  
                FUN=function(dir){
    message("Processing: ",dir)                    
    EpiCompare::gather_files(dir = dir,
                             type="peaks.stringent",
                             nfcore_cutandrun = TRUE,
                             workers = 50)
}) |> unlist()

peaks_rebinned <- EpiCompare::rebin_peaks(peakfiles = peaks,
                                          genome_build = "hg38",
                                          bin_size = 200,
                                          workers = 20)

Data

Unfortunately, this does require access to our HPC lab folder, as I haven't seen this error come up with any other datasets. As such, happy to take the lead on fixing this, perhaps with input from @Al-Murphy if necessary.

I'll also bump the version as mentioned in #102, but @serachoi1230 you'll need to push upstream to Bioc devel once all the changes are ready.

3. Session info

R Under development (unstable) (2022-02-25 r81808)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8          
 [4] LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8       
[10] LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
  [1] graphlayouts_0.8.0                      pbapply_1.5-0                          
  [3] lattice_0.20-45                         rJava_1.0-6                            
  [5] vctrs_0.4.1                             expm_0.999-6                           
  [7] mgcv_1.8-40                             RBGL_1.73.0                            
  [9] blob_1.2.3                              survival_3.3-1                         
 [11] spatstat.data_2.2-0                     later_1.3.0                            
 [13] nloptr_2.0.3                            DBI_1.1.3                              
 [15] BRGenomics_1.9.0                        R.utils_2.12.0                         
 [17] SingleCellExperiment_1.19.0             rappdirs_0.3.3                         
 [19] uwot_0.1.11                             jpeg_0.1-9                             
 [21] zlibbioc_1.43.0                         rgeos_0.5-9                            
 [23] OrganismDbi_1.39.1                      ChIPseeker_1.33.1                      
 [25] dnet_1.1.7                              htmlwidgets_1.5.4                      
 [27] mvtnorm_1.1-3                           future_1.27.0                          
 [29] echoconda_0.99.6                        leiden_0.4.2                           
 [31] pbkrtest_0.5.1                          parallel_4.2.0                         
 [33] irlba_2.3.5                             tidygraph_1.2.1                        
 [35] Rcpp_1.0.9                              readr_2.1.2                            
 [37] KernSmooth_2.23-20                      DT_0.23                                
 [39] promises_1.2.0.1                        DelayedArray_0.23.0                    
 [41] limma_3.53.4                            Hmisc_4.7-0                            
 [43] graph_1.75.0                            EpiCompare_0.99.21                     
 [45] fs_1.5.2                                googleAuthR_2.0.0                      
 [47] fastmatch_1.1-3                         RhpcBLASctl_0.21-247.1                 
 [49] basilisk_1.9.2                          echodata_0.99.11                       
 [51] digest_0.6.29                           png_0.1-7                              
 [53] sctransform_0.3.3                       scatterpie_0.1.7                       
 [55] cowplot_1.1.1                           DOSE_3.23.2                            
 [57] here_1.0.1                              crul_1.2.0                             
 [59] ggraph_2.0.5                            pkgconfig_2.0.3                        
 [61] GO.db_3.15.0                            gridBase_0.4-7                         
 [63] spatstat.random_2.2-0                   iterators_1.0.14                       
 [65] minqa_1.2.4                             reticulate_1.25                        
 [67] SummarizedExperiment_1.27.1             xfun_0.31                              
 [69] zoo_1.8-10                              tidyselect_1.1.2                       
 [71] reshape2_1.4.4                          purrr_0.3.4                            
 [73] ica_1.0-3                               gprofiler2_0.2.1                       
 [75] snow_0.4-4                              viridisLite_0.4.0                      
 [77] rtracklayer_1.57.0                      rlang_1.0.4                            
 [79] hexbin_1.28.2                           glue_1.6.2                             
 [81] ensembldb_2.21.2                        RColorBrewer_1.1-3                     
 [83] orthogene_1.3.1                         registry_0.5-1                         
 [85] variancePartition_1.27.2                matrixStats_0.62.0                     
 [87] ArchR_1.0.2                             MatrixGenerics_1.9.1                   
 [89] stringr_1.4.0                           ggsignif_0.6.3                         
 [91] DESeq2_1.37.4                           GGally_2.1.2                           
 [93] httpuv_1.6.5                            class_7.3-20                           
 [95] XGR_1.1.8                               DO.db_2.9                              
 [97] annotate_1.75.0                         webshot_0.5.3                          
 [99] jsonlite_1.8.0                          XVector_0.37.0                         
[101] bit_4.0.4                               mime_0.12                              
[103] BSgenome.Hsapiens.UCSC.hg38_1.4.4       gridExtra_2.3                          
[105] gplots_3.1.3                            Rsamtools_2.13.3                       
[107] Exact_3.1                               stringi_1.7.8                          
[109] RcppRoll_0.3.0                          spatstat.sparse_2.1-1                  
[111] scattermore_0.8                         phenomix_0.99.3                        
[113] rbibutils_2.2.8                         yulab.utils_0.0.5                      
[115] bitops_1.0-7                            cli_3.3.0                              
[117] Rdpack_2.4                              RSQLite_2.2.15                         
[119] tidyr_1.2.0                             heatmaply_1.3.0                        
[121] homologene_1.4.68.19.3.27               data.table_1.14.2                      
[123] HGNChelper_0.8.1                        echoannot_0.99.7                       
[125] rstudioapi_0.13                         TSP_1.2-1                              
[127] GenomicAlignments_1.33.0                nlme_3.1-158                           
[129] qvalue_2.29.0                           locfit_1.5-9.6                         
[131] VariantAnnotation_1.43.2                listenv_0.8.0                          
[133] miniUI_0.1.1.1                          gridGraphics_0.5-1                     
[135] R.oo_1.25.0                             httpcode_0.3.0                         
[137] ggnetwork_0.5.10                        dbplyr_2.2.1                           
[139] BiocGenerics_0.43.0                     readxl_1.4.0                           
[141] lifecycle_1.0.1                         ExperimentHub_2.5.0                    
[143] munsell_0.5.0                           cellranger_1.1.0                       
[145] R.methodsS3_1.8.2                       caTools_1.18.2                         
[147] codetools_0.2-18                        Biobase_2.57.1                         
[149] GenomeInfoDb_1.33.3                     lmtest_0.9-40                          
[151] xlsxjars_0.6.1                          htmlTable_2.4.1                        
[153] ontologyIndex_2.7                       supraHex_1.35.0                        
[155] xtable_1.8-4                            ROCR_1.0-11                            
[157] BiocManager_1.30.18                     Signac_1.7.0                           
[159] xlsx_0.6.5                              abind_1.4-5                            
[161] farver_2.1.1                            parallelly_1.32.1                      
[163] AnnotationHub_3.5.0                     RANN_2.6.1                             
[165] aplot_0.1.6                             biovizBase_1.45.0                      
[167] sparsesvd_0.2                           SeuratObject_4.1.0                     
[169] RNOmni_1.0.0                            ggtree_3.5.1                           
[171] GenomicRanges_1.49.0                    BiocIO_1.7.1                           
[173] ggbio_1.45.0                            RcppAnnoy_0.0.19                       
[175] goftest_1.2-3                           patchwork_1.1.1                        
[177] tibble_3.1.8                            dichromat_2.0-0.1                      
[179] ggdendro_0.1.23                         cluster_2.1.3                          
[181] future.apply_1.9.0                      Seurat_4.1.1                           
[183] dendextend_1.16.0                       GeneOverlap_1.33.0                     
[185] Matrix_1.4-1                            tidytree_0.3.9                         
[187] ellipsis_0.3.2                          prettyunits_1.1.1                      
[189] ggridges_0.5.3                          igraph_1.3.4                           
[191] fgsea_1.23.0                            downloadR_0.99.3                       
[193] gargle_1.2.0                            basilisk.utils_1.9.1                   
[195] seqPattern_1.29.0                       spatstat.utils_2.3-1                   
[197] htmltools_0.5.3                         BiocFileCache_2.5.0                    
[199] piggyback_0.1.4                         yaml_2.3.5                             
[201] GenomicFeatures_1.49.5                  utf8_1.2.2                             
[203] plotly_4.10.0                           interactiveDisplayBase_1.35.0          
[205] XML_3.99-0.10                           ewceData_1.5.0                         
[207] e1071_1.7-11                            ggpubr_0.4.0                           
[209] foreign_0.8-82                          fitdistrplus_1.1-8                     
[211] BiocParallel_1.31.10                    bit64_4.0.5                            
[213] echotabix_0.99.7                        rootSolve_1.8.2.3                      
[215] foreach_1.5.2                           ProtGenerics_1.29.0                    
[217] Biostrings_2.65.1                       spatstat.core_2.4-4                    
[219] progressr_0.10.1                        GOSemSim_2.23.0                        
[221] MAGMA.Celltyping_2.0.4                  memoise_2.0.1                          
[223] evaluate_0.15                           geneplotter_1.75.0                     
[225] tzdb_0.3.0                              lmom_2.9                               
[227] curl_4.3.2                              fansi_1.0.3                            
[229] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 osfr_0.2.8                             
[231] tensor_1.5                              checkmate_2.1.0                        
[233] aod_1.3.2                               cachem_1.0.6                           
[235] interp_1.1-3                            deldir_1.0-6                           
[237] babelgene_22.3                          dir.expiry_1.5.0                       
[239] impute_1.71.0                           ggplot2_3.3.6                          
[241] rjson_0.2.21                            openxlsx_4.2.5                         
[243] rstatix_0.7.0                           ggrepel_0.9.1                          
[245] genomation_1.29.0                       rprojroot_2.0.3                        
[247] tools_4.2.0                             magrittr_2.0.3                         
[249] RCurl_1.98-1.7                          proxy_0.4-27                           
[251] car_3.1-0                               ape_5.6-2                              
[253] ggplotify_0.1.0                         xml2_1.3.3                             
[255] httr_1.4.3                              assertthat_0.2.1                       
[257] rmarkdown_2.14                          AnnotationFilter_1.21.0                
[259] boot_1.3-28                             globals_0.15.1                         
[261] R6_2.5.1                                nnet_7.3-17                            
[263] progress_1.2.2                          genefilter_1.79.0                      
[265] KEGGREST_1.37.3                         treeio_1.21.0                          
[267] gtools_3.9.3                            BiocVersion_3.16.0                     
[269] EWCE_1.5.5                              splines_4.2.0                          
[271] carData_3.0-5                           ggfun_0.0.6                            
[273] colorspace_2.0-3                        RCircos_1.2.2                          
[275] generics_0.1.3                          stats4_4.2.0                           
[277] base64enc_0.1-3                         pillar_1.8.0                           
[279] Rgraphviz_2.41.1                        tweenr_1.0.2                           
[281] sp_1.5-0                                GenomeInfoDbData_1.2.8                 
[283] plyr_1.8.7                              gtable_0.3.0                           
[285] rvest_1.0.2                             zip_2.2.0                              
[287] restfulr_0.0.15                         latticeExtra_0.6-30                    
[289] knitr_1.39                              shadowtext_0.1.2                       
[291] biomaRt_2.53.2                          IRanges_2.31.0                         
[293] fastmap_1.1.0                           doParallel_1.0.17                      
[295] seriation_1.3.6                         AnnotationDbi_1.59.1                   
[297] broom_1.0.0                             BSgenome_1.65.2                        
[299] scales_1.2.0                            filelock_1.0.2                         
[301] backports_1.4.1                         plotrix_3.8-2                          
[303] S4Vectors_0.35.1                        lme4_1.1-30                            
[305] enrichplot_1.17.0                       gld_2.6.5                              
[307] hms_1.1.1                               ggforce_0.3.3                          
[309] Rtsne_0.16                              dplyr_1.0.9                            
[311] shiny_1.7.2                             MungeSumstats_1.5.5                    
[313] polyclip_1.10-0                         grid_4.2.0                             
[315] DescTools_0.99.45                       lazyeval_0.2.2                         
[317] Formula_1.2-4                           crayon_1.5.1                           
[319] MASS_7.3-58                             reshape_0.8.9                          
[321] viridis_0.6.2                           rpart_4.1.16                           
[323] compiler_4.2.0                          spatstat.geom_2.4-0    

@bschilder bschilder added the bug Something isn't working label Jul 30, 2022
@bschilder bschilder self-assigned this Jul 30, 2022
@bschilder bschilder added this to To do in EpiCompare via automation Jul 30, 2022
@Al-Murphy
Copy link
Contributor

This seems to be a limit size of what rbindlist can handle but I'm not 100%. I'd first try running it on just one thread but I think this might require a per chromosome approach to the binning (like we discussed previously)

@bschilder
Copy link
Collaborator Author

This seems to be a limit size of what rbindlist can handle but I'm not 100%. I'd first try running it on just one thread but I think this might require a per chromosome approach to the binning (like we discussed previously)

Ohhh, that makes a lot of sense. I'll play around with this function a bit and see if I can improve the efficiency #101
This might also involve identifying tiles that are all 0s (which can be quite a few) earlier in the script and omitting them from subsequent steps.

@Al-Murphy
Copy link
Contributor

This might also involve identifying tiles that are all 0s (which can be quite a few) earlier in the script and omitting them from subsequent steps.

I'm not sure that's a good idea as then you are losing places where all peak files have noted 'no peak'. This will also mean your correlation values for a pair could change by adding other peak datasets which isn't desirable. So I'd avoid doing this

@bschilder
Copy link
Collaborator Author

I'm not sure that's a good idea as then you are losing places where all peak files have noted 'no peak'. This will also mean your correlation values for a pair could change by adding other peak datasets which isn't desirable. So I'd avoid doing this

Good point, this is also something I was concerned about. But from the perspective of a single-cell dataset, when you construct a Seurat object one of the initial steps is to filter out any features (bins) without any counts. Same goes for generating CTDs in EWCE, the idea being that they don't contribute to differentiating your samples. But I agree, it does affect the correlation structure and is a bit arbitrary since it's dependent on the samples you're including.

@bschilder
Copy link
Collaborator Author

Resolved by implementing improvements described here: #101

EpiCompare automation moved this from To do to Done Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

2 participants