Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert to use BigWig (geneBody_coverage2.py) to calculate the gene body coverage? #195

Closed
jun-wan opened this issue Apr 12, 2019 · 14 comments

Comments

@jun-wan
Copy link
Contributor

jun-wan commented Apr 12, 2019

Script geneBody_coverage2.py (uses bigwig file as input) in RSeQC v3.0.0 is improved a lot by adding pyBigWig package. It took ~50min. for 12GB BAM file, pipeline duration: 6h 33m 33s. But
geneBody_coverage.py using subsampled BAM (7.6GB) took ~17hrs, pipeline duration: 1d 15h 44m 56s which took much longer time.

@ewels
Copy link
Member

ewels commented May 7, 2019

Interesting! We already remove the subsampled approach though. Could you compare it to the latest version with geneBody_coverage3.py instead?

@jun-wan
Copy link
Contributor Author

jun-wan commented May 15, 2019

@ewels Our current deployed pipeline uses geneBody_coverage2.py (bigwig file as input) and I do not know when it has been changed to the subsampled approach

rnaseq/main.nf

Lines 810 to 812 in 37f260d

input:
file bam from bam_subsampled.concat(bam_skipSubsampFiltered)
file bed12 from bed_genebody_coverage.collect()

geneBody_coverage3.py? did you mean geneBody_coverage2.py in RSeQC v3.0.0 (the latest version)? I have tested for quite many projects and it worked well and took just 1~2hrs.

Shall we revert to use BigWig?

@ewels
Copy link
Member

ewels commented May 18, 2019

I'm a bit lost with this.. @apeltzer - can you remember where we are with this process? Any thoughts on the above?

@apeltzer
Copy link
Member

I'm a bit lost too, but I'll try to summarize.

1.) Initially, this pipeline did run genebody_coverage.py on the unmodified BAM which turned out to be too slow.
2.) Subsampling was used to perform this faster, though this still had significant runtime.
3.) We then decided based on an issue to introduce the bigWig based genebody_coverage2.py script using RSEqc 2.X a while ago and I think I implemented that during the hackathon in Stockholm in August. After testing, this made things a bit faster but still not really fast.
4.) We then dropped support for it again, reverting back to 2.) which is currently in place, as the RSEQC 3.0.0 script without bigWig (genebody_coverage.py) is apparently faster. It is faster than the formerly used RSEQC 2.X version, but NOT fast on large input files.

That's just the historical summary of what we already tried doing. We can, of course, use RSEQC 3.0.0 and revert to subsampled + bigWig on these, which should be by far the fastest way of doing this. Though, I didn't test this so far.

@alneberg
Copy link
Member

@apeltzer,

It is faster than the formerly used RSEQC 2.X version, but NOT fast on large input files.

but this seems to be ok (from @jun-wan words above):

It took ~50min. for 12GB BAM file

@jun-wan
Copy link
Contributor Author

jun-wan commented May 20, 2019

@apeltzer yes, if we use bigwig as the input of geneBody_coverage2.py in RSeQC 3.0, I think we don't have to subsample the BAM.

@apeltzer
Copy link
Member

Hey both,

if it is fast indeed, do we need to change anything then? Or just revert back to BAM -> bigWig -> geneBody_coverage2.py instead of using the normal geneBody_Coverage.py on the subsampled BAM?

@apeltzer
Copy link
Member

Just came in in the Slack channel:

genebody_coverage analysis step takes too long to run and my wall time limit is 72 hours......this makes my job to fail since I am running the job on 350 single end raw files. Is there a way of deactivating this step in the pipeline?

@apeltzer
Copy link
Member

So maybe we could actually do it in the way that we subsample and use geneBody_coverage2.py on the bigWig created on that subsampled BAM. That should make things faster than using either one of the options (bigWig vs subsampling) separately.

@jun-wan
Copy link
Contributor Author

jun-wan commented May 20, 2019

I think "just revert back to BAM -> bigWig -> geneBody_coverage2.py" without doing subsample approach is the fastest way since it takes some time but won't speed up much.

@apeltzer
Copy link
Member

apeltzer commented May 20, 2019

Because it just came up in the Slack channel: Could you test using the qualimap feature in the dev branch and disabling genebody_coverage @jun-wan? It might be that we can simply disable that part if the newly inbuilt QualiMap module can do it already properly and much faster 👍

@drpatelh
Copy link
Member

qualimap_gene_coverage_profile-1

@drpatelh
Copy link
Member

drpatelh commented Jun 18, 2019

Qualimap does indeed generate the gene body coverage. Probably a good idea to remove the genebody_coverage process from the script given the problems we have had in the past, and especially since its redundant now.

@drpatelh
Copy link
Member

This has been removed in PR #195, and will be fixed when the salmon branch is merged into dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants