Fix exponential runtime for longer adapters #695

rhpvorderman · 2023-04-12T10:04:26Z

Fixes #694

This eliminates the exponential algorithm. This does not affect the speed of KmerFinder because it is oblivious to the number of patterns searched, as long as their combined length is smaller than a machine word.
The only disadvantage is an increase in the number of false positives:

before:

kmer    start   stop    considered sites        hit chance by random sequence (%)
AGA             -3      None    1       1.56
AGAT            -4      None    1       0.39
AGATC           -19     None    15      1.45
GGAAG           -19     None    15      1.45
AGATCGG         -33     None    27      0.16
AAGAGCA         -33     None    27      0.16
AACTCCAG        -33     None    26      0.04
CACGTCTG        -33     None    26      0.04
CACGTC          -29     None    24      0.58
AGATCGGA        0       None    143     0.22
CGTCTGAAC       0       None    142     0.05
TCCAGTCA        0       None    143     0.22
AGAGCACA        0       None    143     0.22
Chance for profile hit by random sequence: 6.39%

Percentage possible adapters: 0.1595648

After:

kmer    start   stop    considered sites        hit chance by random sequence (%)
AGA             -3      None    1       1.56
AGAT            -4      None    1       0.39
AGATC           -19     None    15      1.45
GGAAG           -19     None    15      1.45
AGATCGG         -29     None    23      0.14
CACGTC          -29     None    24      0.58
AAGAGCA         -29     None    23      0.14
CGTCTGA         -33     None    27      0.16
ACTCCAG         -33     None    27      0.16
AGATCGGA        -33     None    26      0.04
AGAGCACA        -33     None    26      0.04
GTCTGAAC        0       None    143     0.22
AGATCGGAA       0       None    142     0.05
TCCAGTCA        0       None    143     0.22
GAGCACAC        0       None    143     0.22
Chance for profile hit by random sequence: 6.65%

Percentage possible adapters: 0.163504

The older algorithm arranged the kmers so it required a minimal amount to proof the non-presence of the adapter. This reduced the number of false positives. As is visible, the estimated random hit chance is up 0.26 percentage points and the actual hit rate 0.40 percentage points (the percentage on the last line should be 'fraction', this is already adressed in the code by displaying a proper percentage, but this is still from old runs.).
In practice this means alignment is run for nothing for 1 in 250 reads more than with the old algorithm. I think this is acceptable.

The kmer search set construction code can probably be simplified and refactored further but this is the best I can do for now on a limited time budget.

rhpvorderman added 10 commits April 12, 2023 10:02

Add a main section to kmer_heuristic.py for easier debugging

1471a6c

Simplify kmer selection algorithm

acb7040

Refactor name and add docstring

88f9714

Make kmer_chunks result deterministic

30643f5

Fix a test that expected an old function name

df401b8

Make the adapter a 1000 base pairs for the slowness test

48be9be

Eliminate invalid syntax

1eea545

remove unused imports

c9729b6

Ignore mypy for test and benchmarking code

248ebf3

Properly report percentage sign

e3212c8

marcelm merged commit 27ab483 into marcelm:main Apr 12, 2023
15 checks passed

rhpvorderman deleted the fixkmerfinder branch April 12, 2023 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix exponential runtime for longer adapters #695

Fix exponential runtime for longer adapters #695

rhpvorderman commented Apr 12, 2023 •

edited

Fix exponential runtime for longer adapters #695

Fix exponential runtime for longer adapters #695

Conversation

rhpvorderman commented Apr 12, 2023 • edited

rhpvorderman commented Apr 12, 2023 •

edited