Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor rseqc BAM subsampling for memory efficiency #37

Merged
merged 3 commits into from
Jul 11, 2018

Conversation

ewels
Copy link
Member

@ewels ewels commented Jul 11, 2018

I had a stab at rewriting how the BAM subsampling works for the RSeQC gene body coverage.

The previous use of cat and shuf required the entire BAM file to be read into memory, which kind of defeated the purpose a little and meant that the process still broke with large files (see #36).

Here I use samtools view with the -s flag which subsamples randomly with a single pass. However, this only works with a fraction, not a read number target, so I have to try to calculate this based on the filesize.

This is a bit of a guess currently - needs testing with some big files to see how many reads the subsampling actually spits out.

I'm not very happy with how this turned out to be honest. Code can hopefully be simplified a bit and logic moved from bash to groovy when nextflow-io/nextflow#731 is implemented. Could potentially also use a channel filtering approach as suggested here by @pditommaso (this would probably be nicest).

Bonus: added more command line flags for STAR processes to tell it how much memory is available.

Phil

@ewels ewels added feature-request help wanted Extra attention is needed labels Jul 11, 2018
@ewels ewels requested a review from a team July 11, 2018 11:00
@ewels
Copy link
Member Author

ewels commented Jul 11, 2018

Rewrote a little to move the logic into groovy. Code is a lot tidier now.

Copy link
Contributor

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks excellent!

Copy link
Member

@apeltzer apeltzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good to me! I like the possibility to have different behavior dependent on file sizes!

@apeltzer
Copy link
Member

As its just the missing release (rel nf-core/tools#72 ) that generates an error, I'm merging this in.

@apeltzer apeltzer closed this Jul 11, 2018
@apeltzer apeltzer reopened this Jul 11, 2018
@apeltzer apeltzer merged commit 7a32e09 into nf-core:master Jul 11, 2018
@ewels ewels deleted the rseqc_subsamp_refactor branch July 12, 2018 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants