Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging performance of de novo variant discovery with racon #303

Open
leoisl opened this issue Oct 31, 2022 · 5 comments
Open

Logging performance of de novo variant discovery with racon #303

leoisl opened this issue Oct 31, 2022 · 5 comments

Comments

@leoisl
Copy link
Collaborator

leoisl commented Oct 31, 2022

To be done by @mbhall88 . Could you please use this commit?

@leoisl
Copy link
Collaborator Author

leoisl commented Oct 31, 2022

@mbhall88 could you also please report the max RAM usage? I am quite concerned with this part in denovo racon: https://github.com/rmcolq/pandora/blob/12a08c5483c19fc12411e174970d31c86e842a2d/src/denovo_discovery/discover_main.cpp#L205-L206

This is a dictionary from loci names to the subreads that map to each locus, inferred by pandora map. This structure could get potentially very large, as we basically store a substring of every read that map to each locus (is just the region of the read that maps to that specific locus, but still...). There are potentially many better ways to store this info, but I also want to avoid premature optimisation, and just work on this if RAM is indeed an issue.

@mbhall88
Copy link
Member

mbhall88 commented Nov 1, 2022

The nanopore runtime and memory usage of drprg with the previous version of de novo discovery

mbhall88/drprg#15 (comment)

and the Illumina

mbhall88/drprg#15 (comment)

Pandora is the vast majority of runtime and memory in drprg so these act as good benchmark figures, especially given the only thing that will change between them is the pandora version

@leoisl
Copy link
Collaborator Author

leoisl commented Nov 1, 2022

I just finished running pandora discover at this commit on the paper ONT data. Comparison with paper run:

Paper run

  • Threads: 16
  • Runtime: 7.59h
  • RAM: 10.2 GB

commit 02f9ec

  • Threads: 16
  • Runtime: 7.58h
  • RAM: 16.7 GB

I think runtime is totally fine, we got almost the same exact runtime.

RAM usage is much higher though, probably related to this: #303 (comment) . It will be the RAM bottleneck of the pipeline (pandora compare on ONT data takes 15.7GB, this new denovo implementation takes 1 GB more).

OFC this is data dependent, so let's see how @mbhall88 benchmarks compare

@iqbal-lab
Copy link
Collaborator

This RAM use is, IMO, acceptable for the moment - i wouldn't postpone the merge to reduce RAM

@mbhall88
Copy link
Member

mbhall88 commented Nov 7, 2022

Okay, so here are the updated benchmark figures

Runtime

Illumina

image

Median is 91s, down from 98s. So it is faster now on Illumina!

Nanopore

image

Median is now 171s, up from 163s.

Memory

Illumina

image

Median is 46MB, down from 54MB, so Memory is marginally down too.

Nanopore

image

Median is now 270MB, up from 240MB.


All in all, I think this acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants