Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shuffling reads leads to different outputs #192

Closed
agrippa opened this issue Apr 11, 2018 · 1 comment
Closed

shuffling reads leads to different outputs #192

agrippa opened this issue Apr 11, 2018 · 1 comment

Comments

@agrippa
Copy link

agrippa commented Apr 11, 2018

We've noticed some unexpected behavior with bwa-mem, and wanted to open a discussion here on it.

To put it briefly, we notice that simply shuffling the ordering of reads in an input can significantly change the alignment produced by bwa-mem (up to ~8% of reads mapping to a different location on the reference). While many of these changed mappings are low confidence, anecdotally some high confidence mappings will also change.

We spent a good chunk of time diagnosing this behavior, and isolated most of it to the use of the variable 'id' on line 201 of bwamem_pair.c introduced in this commit:

66585b7#diff-9faaf4191963cb131c46bd920d5fec78

as well as a similar use of 'id' in a call to hash_64 in bwamem.c introduced here:

4219e58#diff-112f35e17420ca725945f8d3849929ea

Here 'id' seems to basically be offset in the input reads, which would explain why it's use leads to changes in alignment when we shuffle reads. Unfortunately, we're not very familiar with the bwa-mem code base so it's unclear what the original motivation behind these changes was or if this is something resolvable.

So our main questions are: why/how is 'id' being used in these cases? How is this helpful for alignments today? And is there an alternate approach that would remove this non-determinism in the face of semantically identical data (albeit re-ordered)?

@lh3
Copy link
Owner

lh3 commented Apr 11, 2018

You need a seed to randomly place equally best hits. id is such a seed. In addition to this, bwa-mem estimates an insert size distribution for each batch, which differs from batch to batch slightly.

Bwa-mem guarantees identical output if you provide the same input. However, reordering will change output. With insert size estimate turned on, which is a very useful feature, it is impossible to give identical output when you reorder reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants