Fix exponential runtime for longer adapters #695
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #694
This eliminates the exponential algorithm. This does not affect the speed of KmerFinder because it is oblivious to the number of patterns searched, as long as their combined length is smaller than a machine word.
The only disadvantage is an increase in the number of false positives:
before:
After:
The older algorithm arranged the kmers so it required a minimal amount to proof the non-presence of the adapter. This reduced the number of false positives. As is visible, the estimated random hit chance is up 0.26 percentage points and the actual hit rate 0.40 percentage points (the percentage on the last line should be 'fraction', this is already adressed in the code by displaying a proper percentage, but this is still from old runs.).
In practice this means alignment is run for nothing for 1 in 250 reads more than with the old algorithm. I think this is acceptable.
The kmer search set construction code can probably be simplified and refactored further but this is the best I can do for now on a limited time budget.