Refactor rseqc BAM subsampling for memory efficiency #37

ewels · 2018-07-11T11:00:54Z

I had a stab at rewriting how the BAM subsampling works for the RSeQC gene body coverage.

The previous use of cat and shuf required the entire BAM file to be read into memory, which kind of defeated the purpose a little and meant that the process still broke with large files (see #36).

Here I use samtools view with the -s flag which subsamples randomly with a single pass. However, this only works with a fraction, not a read number target, so I have to try to calculate this based on the filesize.

This is a bit of a guess currently - needs testing with some big files to see how many reads the subsampling actually spits out.

I'm not very happy with how this turned out to be honest. Code can hopefully be simplified a bit and logic moved from bash to groovy when nextflow-io/nextflow#731 is implemented. Could potentially also use a channel filtering approach as suggested here by @pditommaso (this would probably be nicest).

Bonus: added more command line flags for STAR processes to tell it how much memory is available.

Phil

ewels · 2018-07-11T12:43:42Z

Rewrote a little to move the logic into groovy. Code is a lot tidier now.

pditommaso

it looks excellent!

apeltzer

Looks really good to me! I like the possibility to have different behavior dependent on file sizes!

apeltzer · 2018-07-11T18:29:35Z

As its just the missing release (rel nf-core/tools#72 ) that generates an error, I'm merging this in.

ewels added feature-request help wanted Extra attention is needed labels Jul 11, 2018

ewels requested a review from a team July 11, 2018 11:00

ewels added 2 commits July 11, 2018 13:48

Update docker and singularity files for new synax

9d0c124

Refactor rseqc BAM subsampling for memory efficiency

fb100bf

ewels force-pushed the rseqc_subsamp_refactor branch from 6e68684 to fb100bf Compare July 11, 2018 11:49

Tidy up conditional subsampling code

1c1447d

ewels mentioned this pull request Jul 11, 2018

Allow a process to access input file metadata nextflow-io/nextflow#731

Closed

pditommaso reviewed Jul 11, 2018

View reviewed changes

apeltzer approved these changes Jul 11, 2018

View reviewed changes

apeltzer closed this Jul 11, 2018

apeltzer reopened this Jul 11, 2018

apeltzer merged commit 7a32e09 into nf-core:master Jul 11, 2018

ewels deleted the rseqc_subsamp_refactor branch July 12, 2018 07:10

ewels mentioned this pull request Jul 12, 2018

Make BigWig files and use RSeQC geneBody_coverage2.py #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor rseqc BAM subsampling for memory efficiency #37

Refactor rseqc BAM subsampling for memory efficiency #37

ewels commented Jul 11, 2018

ewels commented Jul 11, 2018

pditommaso left a comment

apeltzer left a comment

apeltzer commented Jul 11, 2018

Refactor rseqc BAM subsampling for memory efficiency #37

Refactor rseqc BAM subsampling for memory efficiency #37

Conversation

ewels commented Jul 11, 2018

ewels commented Jul 11, 2018

pditommaso left a comment

Choose a reason for hiding this comment

apeltzer left a comment

Choose a reason for hiding this comment

apeltzer commented Jul 11, 2018