shuffling reads leads to different outputs #192

agrippa · 2018-04-11T14:45:51Z

We've noticed some unexpected behavior with bwa-mem, and wanted to open a discussion here on it.

To put it briefly, we notice that simply shuffling the ordering of reads in an input can significantly change the alignment produced by bwa-mem (up to ~8% of reads mapping to a different location on the reference). While many of these changed mappings are low confidence, anecdotally some high confidence mappings will also change.

We spent a good chunk of time diagnosing this behavior, and isolated most of it to the use of the variable 'id' on line 201 of bwamem_pair.c introduced in this commit:

66585b7#diff-9faaf4191963cb131c46bd920d5fec78

as well as a similar use of 'id' in a call to hash_64 in bwamem.c introduced here:

4219e58#diff-112f35e17420ca725945f8d3849929ea

Here 'id' seems to basically be offset in the input reads, which would explain why it's use leads to changes in alignment when we shuffle reads. Unfortunately, we're not very familiar with the bwa-mem code base so it's unclear what the original motivation behind these changes was or if this is something resolvable.

So our main questions are: why/how is 'id' being used in these cases? How is this helpful for alignments today? And is there an alternate approach that would remove this non-determinism in the face of semantically identical data (albeit re-ordered)?

The text was updated successfully, but these errors were encountered:

lh3 · 2018-04-11T22:11:02Z

You need a seed to randomly place equally best hits. id is such a seed. In addition to this, bwa-mem estimates an insert size distribution for each batch, which differs from batch to batch slightly.

Bwa-mem guarantees identical output if you provide the same input. However, reordering will change output. With insert size estimate turned on, which is a very useful feature, it is impossible to give identical output when you reorder reads.

lh3 closed this as completed Apr 11, 2018

tnguyensanger mentioned this issue Jul 29, 2020

How to make bwa short read alignment deterministic for testing malariagen/pipelines#41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shuffling reads leads to different outputs #192

shuffling reads leads to different outputs #192

agrippa commented Apr 11, 2018

lh3 commented Apr 11, 2018

shuffling reads leads to different outputs #192

shuffling reads leads to different outputs #192

Comments

agrippa commented Apr 11, 2018

lh3 commented Apr 11, 2018