Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rtracklayer possible memory leakage #12

Closed
cparsania opened this issue Oct 22, 2018 · 8 comments
Closed

rtracklayer possible memory leakage #12

cparsania opened this issue Oct 22, 2018 · 8 comments

Comments

@cparsania
Copy link

cparsania commented Oct 22, 2018

Hi, I have been stretching my head to understand possible memory usage in one of my complex R script. What I found is quite surprising. Using a function rtracklayer::readGFF() occupies quite a bit memory (about 400 MB) in the R even without loading the package rtracklayer. See the step by step use case and output.

Amount of memory used in fresh R session. (NOTE: No prior package or objects are loaded)

## fresh R session memory usage 
pryr::mem_used()
39.4 MB

Amount of memory used once rtracklayer::readGFF() function executed.

## read gff using rtracklayer::readGFF(), (NOTE: no file supplied to the  function)

rtracklayer::readGFF()
pryr::mem_used()
416 MB

As we can see, there is 10X larger memory occupied though nothing has been returned from the rtracklayer::readGFF() execution. How to explain why this is ? and how to prevent R occupying additional ~400 MB of memory while using rtracklayer::readGFF() ?

@cparsania cparsania changed the title R bioconductor package rtracklayer possible memory leakage rtracklayer possible memory leakage Oct 22, 2018
@lawremi
Copy link
Owner

lawremi commented Oct 22, 2018

Maybe @hpages has some ideas?

@lawremi
Copy link
Owner

lawremi commented Oct 22, 2018

Are you sure this isn't just from loading the rtracklayer namespace?

@cparsania
Copy link
Author

Yes, Checked on 2 different PCs. Got same output.

@lawremi
Copy link
Owner

lawremi commented Oct 22, 2018

But your example does not isolate calling readGFF(), which basically just throws an error immediately, from loading the rtracklayer namespace, which happens as soon as you call any function in the namespace.

Calling rtracklayer::readGFF() in a fresh session is going to load the rtracklayer namespace and all of its dependent namespaces, which is a ton of code, class definitions, method metadata, etc. Not surprising at all that it would consume 400MB.

@hpages
Copy link
Contributor

hpages commented Oct 22, 2018

Here is what I get (in a fresh R session):

library(pryr)
pryr::mem_used()
# 26.1 MB
suppressMessages(library(rtracklayer))
pryr::mem_used()
# 400 MB
rtracklayer::readGFF()
# Error in .make_filexp_from_filepath(filepath) : 
#   argument "filepath" is missing, with no default
pryr::mem_used()
# 400 MB

So yes, as Michael suggested, loading/attaching rtracklayer and all its 20 or so dependencies (direct and indirect) is what consumes about 400 MB of RAM, not calling rtracklayer::readGFF() per se. See sessionInfo() below for all the packages that are loaded/attached after doing library(pryr); library(rtracklayer). That's an average of 20 MB per package loaded/attached so along the line of what loading/attaching pryr consumes.

H.

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/hpages/R/R-3.5.1/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.5.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] rtracklayer_1.40.6   GenomicRanges_1.32.7 GenomeInfoDb_1.16.0 
[4] IRanges_2.14.12      S4Vectors_0.18.3     BiocGenerics_0.26.0 

loaded via a namespace (and not attached):
 [1] lattice_0.20-35             matrixStats_0.54.0         
 [3] XML_3.98-1.16               Rsamtools_1.32.3           
 [5] Biostrings_2.48.0           GenomicAlignments_1.16.0   
 [7] bitops_1.0-6                grid_3.5.1                 
 [9] zlibbioc_1.26.0             XVector_0.20.0             
[11] Matrix_1.2-14               BiocParallel_1.14.2        
[13] tools_3.5.1                 Biobase_2.40.0             
[15] RCurl_1.95-4.11             DelayedArray_0.6.6         
[17] compiler_3.5.1              SummarizedExperiment_1.10.1
[19] GenomeInfoDbData_1.1.0     

@cparsania
Copy link
Author

Hi, as @lawremi says, even without attaching/loading rtracklayer it occupies ~400 MB of memory from rtracklayer::readGFF() see the session info and memory usage.

> pryr::mem_used()
434 MB

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.1

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19                XVector_0.20.0              magrittr_1.5                GenomicRanges_1.32.7        BiocGenerics_0.26.0         zlibbioc_1.26.0             GenomicAlignments_1.16.0    IRanges_2.14.12             BiocParallel_1.14.2        
[10] lattice_0.20-35             stringr_1.3.1               GenomeInfoDb_1.16.0         tools_3.5.1                 grid_3.5.1                  SummarizedExperiment_1.10.1 parallel_3.5.1              Biobase_2.40.0              matrixStats_0.54.0         
[19] yaml_2.2.0                  Matrix_1.2-14               GenomeInfoDbData_1.1.0      rtracklayer_1.40.6          pryr_0.1.4                  S4Vectors_0.18.3            bitops_1.0-6                codetools_0.2-15            RCurl_1.95-4.11            
[28] DelayedArray_0.6.6          stringi_1.2.4               compiler_3.5.1              Biostrings_2.48.0           Rsamtools_1.32.3            stats4_3.5.1                XML_3.98-1.16              

I raised this issue because I am working with limited resources (shinyapp.io allows 1GB memory free of charge) and so 400 MB is quite expensive for that. I defined rtracklayer::readGFF() within a function and I thought once the function executed, it will freed all the memory but it didn't. Does it make sense ?

@hpages
Copy link
Contributor

hpages commented Oct 23, 2018

You seem to misunderstand what @lawremi said and what I also tried to show you above. When you call rtracklayer::readGFF() in a fresh session, this automatically loads the rtracklayer package and its 20 dependencies. That's because you can't use anything from a package without loading its namespace first. So there is no way to do this "without attaching/loading rtracklayer". The action of loading rtracklayer is what consumes 400 MB of RAM. This has nothing to do with the readGFF() function itself or with a memory leak. You get the same thing by referencing any symbol defined in the rtracklayer package in a fresh session (prefixing the symbol with rtracklayer::):

library(pryr)
pryr::mem_used()
# 26.1 MB

rtracklayer::TwoBitFile
# function (path) 
# {
#     if (!isSingleString(path)) 
#         stop("'filename' must be a single string, specifying a path")
#     new("TwoBitFile", resource = twoBitPath(path))
# }
# <bytecode: 0x1c2132b8>
# <environment: namespace:rtracklayer>

pryr::mem_used()
# 399 MB

Note that I didn't even try to execute any code from the rtracklayer package here. I just typed rtracklayer::TwoBitFile followed by <ENTER>, which had the effect of loading the rtracklayer package and displaying the definition of the TwoBitFile() function.

Unfortunately it is not realistic to use Bioconductor on a machine with only 1 GB of RAM. We don't have precise requirements, and it very much depends on what your use case is, but in my experience it's hard to run any typical Bioconductor workflow with less than 3 GB of RAM. Some workflows will require much more than that e.g. 8 GB or even more...

@cparsania
Copy link
Author

OK. Thank you for detailed explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants