Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How long does it take to build msbwt? #1

Closed
gengyuncong opened this issue Nov 14, 2016 · 3 comments
Closed

How long does it take to build msbwt? #1

gengyuncong opened this issue Nov 14, 2016 · 3 comments

Comments

@gengyuncong
Copy link

gengyuncong commented Nov 14, 2016

Hi Matt,

I'm interested in FMLRC and want to use it to correct my PacBio reads. I have ~50 X Illumina data (~125 Gb in total) and I'm building msbwt for Illumina reads. However, it seems that it takes too long to do it. Here is some information in the log file:
[2016-10-31 16:12:16] INFO: Formatting sequences for merging...
[2016-11-04 22:27:21] INFO: Beginning MSBWT construction...
[2016-11-05 00:20:34] INFO: Processing groups of size 256...
[2016-11-06 18:33:50] INFO: Processing groups of size 512...
[2016-11-09 19:07:49] INFO: Processing groups of size 1024...
[2016-11-13 12:56:12] INFO: Processing groups of size 2048...
I wonder what does the "size" mean and when will it finish?

Thank you for your time.
Yuncong Geng

@holtjma
Copy link
Owner

holtjma commented Nov 14, 2016

Yuncong,

It looks like you are building this using the BWT merge algorithm that is a part of MSBWT. While this method will work for those large Illumina datasets, I don't recommend it as it was not designed for large Illumina dataset and will take a prohibitively large amount of time to build the BWT that way. Instead, I recommend using ropebwt2 and converting to msbwt. That method should be able to build the BWT of a dataset that large in a matter of hours (usually I run them overnight and they're done by the time I get back to work the next day).

As a word of warning, we have not tested FMLRC using an Illumina dataset that large. Most of our tests were on organisms with smaller genomes and therefor smaller Illumina and PacBio inputs.

Matt

@fidelram
Copy link

Maybe you can add this info to the FMLRC README which mainly suggest using msbwt.

@holtjma
Copy link
Owner

holtjma commented Nov 22, 2016

@fidelram I just checked to make sure this was in the README for FMLRC. It is, but probably not emphasized as much as I would like. I'll look into promoting it more with the next commit.

@gengyuncong I'm closing this issue, feel free to re-open or start another issue if you have more questions regarding building the BWT

@holtjma holtjma closed this as completed Nov 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants