Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix exponential runtime for longer adapters #695

Merged
merged 10 commits into from Apr 12, 2023

Conversation

rhpvorderman
Copy link
Collaborator

@rhpvorderman rhpvorderman commented Apr 12, 2023

Fixes #694

This eliminates the exponential algorithm. This does not affect the speed of KmerFinder because it is oblivious to the number of patterns searched, as long as their combined length is smaller than a machine word.
The only disadvantage is an increase in the number of false positives:

before:

kmer    start   stop    considered sites        hit chance by random sequence (%)
AGA             -3      None    1       1.56
AGAT            -4      None    1       0.39
AGATC           -19     None    15      1.45
GGAAG           -19     None    15      1.45
AGATCGG         -33     None    27      0.16
AAGAGCA         -33     None    27      0.16
AACTCCAG        -33     None    26      0.04
CACGTCTG        -33     None    26      0.04
CACGTC          -29     None    24      0.58
AGATCGGA        0       None    143     0.22
CGTCTGAAC       0       None    142     0.05
TCCAGTCA        0       None    143     0.22
AGAGCACA        0       None    143     0.22
Chance for profile hit by random sequence: 6.39%

Percentage possible adapters: 0.1595648

After:

kmer    start   stop    considered sites        hit chance by random sequence (%)
AGA             -3      None    1       1.56
AGAT            -4      None    1       0.39
AGATC           -19     None    15      1.45
GGAAG           -19     None    15      1.45
AGATCGG         -29     None    23      0.14
CACGTC          -29     None    24      0.58
AAGAGCA         -29     None    23      0.14
CGTCTGA         -33     None    27      0.16
ACTCCAG         -33     None    27      0.16
AGATCGGA        -33     None    26      0.04
AGAGCACA        -33     None    26      0.04
GTCTGAAC        0       None    143     0.22
AGATCGGAA       0       None    142     0.05
TCCAGTCA        0       None    143     0.22
GAGCACAC        0       None    143     0.22
Chance for profile hit by random sequence: 6.65%

Percentage possible adapters: 0.163504

The older algorithm arranged the kmers so it required a minimal amount to proof the non-presence of the adapter. This reduced the number of false positives. As is visible, the estimated random hit chance is up 0.26 percentage points and the actual hit rate 0.40 percentage points (the percentage on the last line should be 'fraction', this is already adressed in the code by displaying a proper percentage, but this is still from old runs.).
In practice this means alignment is run for nothing for 1 in 250 reads more than with the old algorithm. I think this is acceptable.

The kmer search set construction code can probably be simplified and refactored further but this is the best I can do for now on a limited time budget.

@marcelm marcelm merged commit 27ab483 into marcelm:main Apr 12, 2023
15 checks passed
@rhpvorderman rhpvorderman deleted the fixkmerfinder branch April 12, 2023 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cutadapt 4.3 runs excruciatingly long
2 participants