You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've noticed some unexpected behavior with bwa-mem, and wanted to open a discussion here on it.
To put it briefly, we notice that simply shuffling the ordering of reads in an input can significantly change the alignment produced by bwa-mem (up to ~8% of reads mapping to a different location on the reference). While many of these changed mappings are low confidence, anecdotally some high confidence mappings will also change.
We spent a good chunk of time diagnosing this behavior, and isolated most of it to the use of the variable 'id' on line 201 of bwamem_pair.c introduced in this commit:
Here 'id' seems to basically be offset in the input reads, which would explain why it's use leads to changes in alignment when we shuffle reads. Unfortunately, we're not very familiar with the bwa-mem code base so it's unclear what the original motivation behind these changes was or if this is something resolvable.
So our main questions are: why/how is 'id' being used in these cases? How is this helpful for alignments today? And is there an alternate approach that would remove this non-determinism in the face of semantically identical data (albeit re-ordered)?
The text was updated successfully, but these errors were encountered:
You need a seed to randomly place equally best hits. id is such a seed. In addition to this, bwa-mem estimates an insert size distribution for each batch, which differs from batch to batch slightly.
Bwa-mem guarantees identical output if you provide the same input. However, reordering will change output. With insert size estimate turned on, which is a very useful feature, it is impossible to give identical output when you reorder reads.
We've noticed some unexpected behavior with bwa-mem, and wanted to open a discussion here on it.
To put it briefly, we notice that simply shuffling the ordering of reads in an input can significantly change the alignment produced by bwa-mem (up to ~8% of reads mapping to a different location on the reference). While many of these changed mappings are low confidence, anecdotally some high confidence mappings will also change.
We spent a good chunk of time diagnosing this behavior, and isolated most of it to the use of the variable 'id' on line 201 of bwamem_pair.c introduced in this commit:
66585b7#diff-9faaf4191963cb131c46bd920d5fec78
as well as a similar use of 'id' in a call to hash_64 in bwamem.c introduced here:
4219e58#diff-112f35e17420ca725945f8d3849929ea
Here 'id' seems to basically be offset in the input reads, which would explain why it's use leads to changes in alignment when we shuffle reads. Unfortunately, we're not very familiar with the bwa-mem code base so it's unclear what the original motivation behind these changes was or if this is something resolvable.
So our main questions are: why/how is 'id' being used in these cases? How is this helpful for alignments today? And is there an alternate approach that would remove this non-determinism in the face of semantically identical data (albeit re-ordered)?
The text was updated successfully, but these errors were encountered: