Refactor rseqc BAM subsampling for memory efficiency #37
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I had a stab at rewriting how the BAM subsampling works for the RSeQC gene body coverage.
The previous use of
cat
andshuf
required the entire BAM file to be read into memory, which kind of defeated the purpose a little and meant that the process still broke with large files (see #36).Here I use
samtools view
with the-s
flag which subsamples randomly with a single pass. However, this only works with a fraction, not a read number target, so I have to try to calculate this based on the filesize.This is a bit of a guess currently - needs testing with some big files to see how many reads the subsampling actually spits out.
I'm not very happy with how this turned out to be honest. Code can hopefully be simplified a bit and logic moved from bash to groovy when nextflow-io/nextflow#731 is implemented. Could potentially also use a channel filtering approach as suggested here by @pditommaso (this would probably be nicest).
Bonus: added more command line flags for STAR processes to tell it how much memory is available.
Phil