Multithread support for bwa index #104

unode · 2017-01-15T18:31:26Z

Hi all,

Current databases are becoming increasingly large. Recently I've found myself indexing a large FASTA file and taking over 200CPU hours (single thread).

Searching for multithreaded support for bwa index I've landed on a 5 year old mailing-list thread that mentions the existence of some sort of patch. I couldn't find any reference to this patch though.

Regardless, is there any ongoing or planned work to make bwa index parallelizable in some form?

The text was updated successfully, but these errors were encountered:

lh3 · 2017-01-16T03:11:46Z

No, there is no pull request on multi-threaded indexing. Implementing one may take quite some time but might not dramatically improve the performance, especially when you try to build the index within limited space.

Generally, to build a large index, you may consider to use a large block size (option "-b"). This option defaults to 10,000,000. You may increase it to 100,000,000 or even larger, depending on your input. This may save you some time.

unode · 2017-01-17T11:31:57Z

@lh3 Thanks, increasing -b does seem to improve speed considerably.

However I don't quite understand the impact of changing this option. At least during indexing, I don't see any significant memory increase even with values as large as 10,000,000,000.

What's the trade-off or otherwise, why isn't the default value larger?

lh3 · 2017-01-17T14:23:58Z

-b specifies how many bases to process in a batch. The memory used by one batch is 8*{-b}. If you have a "reference genome" larger than 200Gb, you won't observe obvious memory increase with -b set to 10G. For a 3Gb human genome, setting -b to 10G will make the peak RAM 8 times as high at the BWT construction phase.

unode · 2017-01-17T14:51:39Z

So if I understand correctly, the ideal -b value is around # of bases / 8.
Wouldn't it be possible to have this value adjusted automatically?
From what I gather, there's a first pass that packs the FASTA file. Is the -b value already used at this stage? If not, could this stage be used to calculate the ideal -b value?

On the other hand, if finding the ideal -b during the "pack" phase is impractical, would it be reasonable to have:

-b set to "auto" by default
if -b is set to "auto" perform a full file scan to calculate the ideal -b.
if -b is set to anything but "auto", skip the full file scan and use the given value.

lh3 · 2017-01-18T01:37:10Z

-b is only used when bwa generate "ref.fa.bwt". At that step, bwa index already knows the total length of the reference. -b was added when I wanted to index nt. I have only done that once, so did not bother to explore the optimal -b in general. Yes, it should be possible to automatically adjust -b, but before that I need to do some experiment to see how speed is affected by -b. Thanks for the suggestion anyway.

unode · 2017-01-18T02:27:02Z

From the tests I've been running, changing the -b value from the default of 10,000,000 to 500,000,000 to index a ~90Gb fasta file made the entire process roughly 6 times faster.
I'm now also giving it a try with a value of 20,000,000,000 computed by dividing the value of textLength by 8. If this scales well, I expect a gain of at least 8 times.

lh3 · 2017-01-19T20:10:41Z

Thanks for the data. 6 times is a lot, much larger than my initial guess. I will consider to automatically adjust -b in a future version of bwa.

BWA's default indexing parameters are quite conservative. This leads to a small memory footprint at the cost of more CPU hours. With large databases (~100GB) default settings require over 2 weeks of CPU time. Increasing the default blocksize will increase the memory footprint but will reduce indexing time 3 to 6 fold. This patch increases the blocksize to roughly 1/10th of the filesize. The memory footprint should be about the size of the database. As per lh3/bwa#104 this patch may become obsolete once this functionality is built into bwa.

serge2016 · 2020-02-11T05:18:12Z

Hello! Any news in this stream?

emrahkirdok · 2020-04-10T14:35:40Z

Thanks for the data. 6 times is a lot, much larger than my initial guess. I will consider to automatically adjust -b in a future version of bwa.

Hi, I hope everyone is OK in this thread. I am working with large fasta files and I am wondering if this feature is implemented in the current version? Or will it be implemented any time soon? Or should I continue optimising it?
best wishes

jorondo1 · 2022-03-11T18:19:46Z

Hi! I am also curious to know if anything changed since this thread was started. Cheers

Stack7 · 2023-10-23T14:31:38Z

Hi! I will be very happy to see any news in this threads! I am just dreaming about the threads option! It would be great! Cheers!

MDBL403 · 2024-05-10T15:47:53Z

any news on threads?!

lh3 · 2024-05-10T16:11:45Z

There won't be multi-threading indexing. I have explained the rationale above.

luispedro mentioned this issue Jan 30, 2017

Use larger blocksize with bwa index on >100MB files ngless-toolkit/ngless#13

Merged

lh3 closed this as completed May 10, 2024

nicolo-tellini mentioned this issue Aug 12, 2024

Indexing large genome failed #413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithread support for bwa index #104

Multithread support for bwa index #104

unode commented Jan 15, 2017

lh3 commented Jan 16, 2017

unode commented Jan 17, 2017 •

edited

Loading

lh3 commented Jan 17, 2017

unode commented Jan 17, 2017

lh3 commented Jan 18, 2017

unode commented Jan 18, 2017

lh3 commented Jan 19, 2017

serge2016 commented Feb 11, 2020

emrahkirdok commented Apr 10, 2020 •

edited

Loading

jorondo1 commented Mar 11, 2022

Stack7 commented Oct 23, 2023

MDBL403 commented May 10, 2024

lh3 commented May 10, 2024

Multithread support for bwa index #104

Multithread support for bwa index #104

Comments

unode commented Jan 15, 2017

lh3 commented Jan 16, 2017

unode commented Jan 17, 2017 • edited Loading

lh3 commented Jan 17, 2017

unode commented Jan 17, 2017

lh3 commented Jan 18, 2017

unode commented Jan 18, 2017

lh3 commented Jan 19, 2017

serge2016 commented Feb 11, 2020

emrahkirdok commented Apr 10, 2020 • edited Loading

jorondo1 commented Mar 11, 2022

Stack7 commented Oct 23, 2023

MDBL403 commented May 10, 2024

lh3 commented May 10, 2024

unode commented Jan 17, 2017 •

edited

Loading

emrahkirdok commented Apr 10, 2020 •

edited

Loading