Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithread support for bwa index #104

Closed
unode opened this issue Jan 15, 2017 · 13 comments
Closed

Multithread support for bwa index #104

unode opened this issue Jan 15, 2017 · 13 comments

Comments

@unode
Copy link

unode commented Jan 15, 2017

Hi all,

Current databases are becoming increasingly large. Recently I've found myself indexing a large FASTA file and taking over 200CPU hours (single thread).

Searching for multithreaded support for bwa index I've landed on a 5 year old mailing-list thread that mentions the existence of some sort of patch. I couldn't find any reference to this patch though.

Regardless, is there any ongoing or planned work to make bwa index parallelizable in some form?

@lh3
Copy link
Owner

lh3 commented Jan 16, 2017

No, there is no pull request on multi-threaded indexing. Implementing one may take quite some time but might not dramatically improve the performance, especially when you try to build the index within limited space.

Generally, to build a large index, you may consider to use a large block size (option "-b"). This option defaults to 10,000,000. You may increase it to 100,000,000 or even larger, depending on your input. This may save you some time.

@unode
Copy link
Author

unode commented Jan 17, 2017

@lh3 Thanks, increasing -b does seem to improve speed considerably.

However I don't quite understand the impact of changing this option. At least during indexing, I don't see any significant memory increase even with values as large as 10,000,000,000.

What's the trade-off or otherwise, why isn't the default value larger?

@lh3
Copy link
Owner

lh3 commented Jan 17, 2017

-b specifies how many bases to process in a batch. The memory used by one batch is 8*{-b}. If you have a "reference genome" larger than 200Gb, you won't observe obvious memory increase with -b set to 10G. For a 3Gb human genome, setting -b to 10G will make the peak RAM 8 times as high at the BWT construction phase.

@unode
Copy link
Author

unode commented Jan 17, 2017

So if I understand correctly, the ideal -b value is around # of bases / 8.
Wouldn't it be possible to have this value adjusted automatically?
From what I gather, there's a first pass that packs the FASTA file. Is the -b value already used at this stage? If not, could this stage be used to calculate the ideal -b value?

On the other hand, if finding the ideal -b during the "pack" phase is impractical, would it be reasonable to have:

  • -b set to "auto" by default
  • if -b is set to "auto" perform a full file scan to calculate the ideal -b.
  • if -b is set to anything but "auto", skip the full file scan and use the given value.

@lh3
Copy link
Owner

lh3 commented Jan 18, 2017

-b is only used when bwa generate "ref.fa.bwt". At that step, bwa index already knows the total length of the reference. -b was added when I wanted to index nt. I have only done that once, so did not bother to explore the optimal -b in general. Yes, it should be possible to automatically adjust -b, but before that I need to do some experiment to see how speed is affected by -b. Thanks for the suggestion anyway.

@unode
Copy link
Author

unode commented Jan 18, 2017

From the tests I've been running, changing the -b value from the default of 10,000,000 to 500,000,000 to index a ~90Gb fasta file made the entire process roughly 6 times faster.
I'm now also giving it a try with a value of 20,000,000,000 computed by dividing the value of textLength by 8. If this scales well, I expect a gain of at least 8 times.

@lh3
Copy link
Owner

lh3 commented Jan 19, 2017

Thanks for the data. 6 times is a lot, much larger than my initial guess. I will consider to automatically adjust -b in a future version of bwa.

unode added a commit to unode/ngless that referenced this issue Jan 30, 2017
BWA's default indexing parameters are quite conservative. This leads to
a small memory footprint at the cost of more CPU hours.
With large databases (~100GB) default settings require over 2 weeks of
CPU time. Increasing the default blocksize will increase the memory
footprint but will reduce indexing time 3 to 6 fold.

This patch increases the blocksize to roughly 1/10th of the filesize.
The memory footprint should be about the size of the database.

As per lh3/bwa#104 this patch may become
obsolete once this functionality is built into bwa.
unode added a commit to unode/ngless that referenced this issue Jan 30, 2017
BWA's default indexing parameters are quite conservative. This leads to
a small memory footprint at the cost of more CPU hours.
With large databases (~100GB) default settings require over 2 weeks of
CPU time. Increasing the default blocksize will increase the memory
footprint but will reduce indexing time 3 to 6 fold.

This patch increases the blocksize to roughly 1/10th of the filesize.
The memory footprint should be about the size of the database.

As per lh3/bwa#104 this patch may become
obsolete once this functionality is built into bwa.
unode added a commit to unode/ngless that referenced this issue Jan 30, 2017
BWA's default indexing parameters are quite conservative. This leads to
a small memory footprint at the cost of more CPU hours.
With large databases (~100GB) default settings require over 2 weeks of
CPU time. Increasing the default blocksize will increase the memory
footprint but will reduce indexing time 3 to 6 fold.

This patch increases the blocksize to roughly 1/10th of the filesize.
The memory footprint should be about the size of the database.

As per lh3/bwa#104 this patch may become
obsolete once this functionality is built into bwa.
unode added a commit to unode/ngless that referenced this issue Jan 31, 2017
BWA's default indexing parameters are quite conservative. This leads to
a small memory footprint at the cost of more CPU hours.
With large databases (~100GB) default settings require over 2 weeks of
CPU time. Increasing the default blocksize will increase the memory
footprint but will reduce indexing time 3 to 6 fold.

This patch increases the blocksize to roughly 1/10th of the filesize.
The memory footprint should be about the size of the database.

As per lh3/bwa#104 this patch may become
obsolete once this functionality is built into bwa.
@serge2016
Copy link

Hello! Any news in this stream?

@emrahkirdok
Copy link

emrahkirdok commented Apr 10, 2020

Thanks for the data. 6 times is a lot, much larger than my initial guess. I will consider to automatically adjust -b in a future version of bwa.

Hi, I hope everyone is OK in this thread. I am working with large fasta files and I am wondering if this feature is implemented in the current version? Or will it be implemented any time soon? Or should I continue optimising it?
best wishes

@jorondo1
Copy link

Hi! I am also curious to know if anything changed since this thread was started. Cheers

@Stack7
Copy link

Stack7 commented Oct 23, 2023

Hi! I will be very happy to see any news in this threads! I am just dreaming about the threads option! It would be great! Cheers!

@MDBL403
Copy link

MDBL403 commented May 10, 2024

any news on threads?!

@lh3
Copy link
Owner

lh3 commented May 10, 2024

There won't be multi-threading indexing. I have explained the rationale above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants